Weight of Evidence (Operator Toolbox)

Synopsis

This Operator discretizes the selected numerical attributes into user-specified classes, and applies Weight of Evidence transformation on the values. Therefore, all Examples that belong to the same class will have a common numerical value. The type of the selected Attributes will remain numerical.

Description

This Operator applies Weight of Evidence transformation to the selected Attributes. The new numeric values will be calculated on the basis of user specified classes. Within each class, the Weight of Evidence value will be calculated using the binominal Attribute specified in base of distribution parameter. First, the distribution for the number of negative and positive values (compared to the whole data set) are calculated for each group. Then, the Weight of Evidence value is calculated as: ln(% of negatives / % of positives). All those Examples that belong to the same class will have this new common numeric value. A separate class for the missing values can also be created by checking the class for missing values parameter.

Differentiation

Discretize by Binning

The Discretize By Binning Operator creates bins in such a way that the range of all bins is (almost) equal.

Discretize by Frequency

The Discretize By Frequency Operator creates bins in such a way that the number of unique values in all bins are (almost) equal.

Discretize by Size

The Discretize By Size Operator creates bins in such a way that each bin has user-specified size (i.e. number of Examples).

Discretize by Entropy

The discretization is performed by selecting bin boundaries such that the entropy is minimized in the induced partitions.

Discretize by User Specification

This Operator discretizes the selected numerical attributes into user-specified classes.

Input

example set (Data table)
This input port expects an ExampleSet. Note that there should be at least one numerical and one binominal attribute in the input ExampleSet, otherwise the use of this operator does not make sense.

Output

example set (Data table)
The selected numerical Attributes are discretized by calculated Weight of Evidence values and the resulting ExampleSet is delivered through this port.
original (Data table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further Operators or to view the ExampleSet in the Results Workspace.
preprocessing model (Preprocessing Model)
This port delivers the preprocessing model, which has information regarding the parameters of this Operator in the current process.

Parameters

create_view It is possible to create a View instead of changing the underlying data. Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested and the result is returned without changing the data. Range: boolean
attribute_filter_typeThis parameter allows you to select the attribute selection filter; the method you want to use for selecting attributes. It has the following options:
- all: This option simply selects all the attributes of the ExampleSet. This is the default option.
- single: This option allows selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the Parameters panel.
- subset: This option allows selection of multiple attributes through a list. All attributes of ExampleSet are present in the list; required attributes can be easily selected. This option will not work if meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
- regular_expression: This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
- value_type: This option allows selection of all the attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. Users should have basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
- block_type: This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. When this option is selected some other parameters (block type, use block type exception) become visible in the Parameters panel.
- no_missing_values: This option simply selects all the attributes of the ExampleSet which don't contain a missing value in any example. Attributes that have even a single missing value are removed.
- numeric value filter: When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
Range: selection
attributeThe required attribute can be selected from this option. The attribute name can be selected from the drop down box of the parameter attribute if the meta data is known. Range: string
attributesThe required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list and can be shifted to the right list, which is the list of selected attributes. Range: string
regular_expressionThe attributes whose name match this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to try different expressions and preview the results simultaneously. Range: string
use_except_expressionIf enabled, an exception to the first regular expression can be specified. When this option is selected another parameter (except regular expression) becomes visible in the Parameters panel. Range: boolean
except_regular_expressionThis option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first regular expression (regular expression that was specified in the regular expression parameter). Range: string
value_typeThe type of attributes to be selected can be chosen from a drop down list. Range: selection
use_value_type_exception If enabled, an exception to the selected type can be specified. When this option is enabled, another parameter (except value type) becomes visible in the Parameters panel. Range: boolean
except_value_typeThe attributes matching this type will not be selected even if they match the previously mentioned type i.e. value type parameter's value. Range: selection
block_typeThe block type of attributes to be selected can be chosen from a drop down list. Range: selection
use_block_type_exception If enabled, an exception to the selected block type can be specified. When this option is selected another parameter (except block type) becomes visible in the Parameters panel. Range: boolean
except_block_typeThe attributes matching this block type will not be selected even if they match the previously mentioned block type i.e. block type parameter's value. Range: selection
numeric_conditionThe numeric condition for testing examples of numeric attributes is specified here. For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||. Use a blank space after '>', '=' and '<' e.g. '<5' will not work, so use '< 5' instead. Range: string
invert_selectionIf this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected attributes are unselected and previously unselected attributes are selected. For example if attribute 'att1' is selected and attribute 'att2' is unselected prior to checking of this parameter. After checking of this parameter 'att1' will be unselected and 'att2' will be selected. Range: boolean
include_special_attributesThe special attributes are attributes with special roles which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the conditions. Range: boolean
base_of_distributionThe binary attribute which is the basis for calculating the distribution. Please note that this attribute must be included in the previous attribute filter. Range:
classesDefines the upper limits of each class. Range:
replace_infinite_WoE_valuesDefines whether infinite Weight of Evidence values should be replaced with constants. The Weight of Evidence value is always (positive or negative) infinity if there was a class with no positive or negative values. Range:
positive_infinite_substituteSubstitute for classes with positive infinite values. Range:
negative_infinite_substituteSubstitute for classes with negative infinite values. Range:
WoE_of_empty_classesWeight of Evidence value for empty classes. Range:
class_for_missing_valuesDefines whether an extra class for missing values should be created. Range:

Tutorial Processes

Creating Weight of Evidence values for Age groups in Titanic data set

This Example Process demonstrates a use case for Weight of Evidence Operator. The aim is to assign a numeric value to age groups that expresses the chance of the Survived attribute being Yes or No. This value will be the same for all the examples that belong to the same group. The examples with a missing Age attribute value are put into a separate group. Furthermore, some attributes describing the group are also added to the examples.

The Weight of Evidence value is easy to understand and interpret. If the value is positive, the examples from that age group are more likely to have Yes value for the attribute Survived than the whole crowd. The higher the Weight of Evidence value, the greater the chance for survival. Likewise, negative values mean more frequent presence of No value for the attribute Survived within this Age group than in the whole data set. A lower Weight of Evidence value implies lower chance for survival.

What makes this value highly useful is the fact that originally there is not strong connection between Age and Survived attributes. For example, people aged between 40 and 65 had better chances to survive than the one that belong to the 18-40 and 65-80 age groups. By applying these groupings and calculating the Weight of Evidence values, this new attribute has more meaningful and more closely connected value than the original Age attribute.

Categories

Versions