Categories

Versions

(Concurrency)

Synopsis

This Operator determines correlation between all Attributes and it can produce a weights vector based on these correlations. Correlation is a statistical technique that can show whether and how strongly pairs of Attributes are related.

Description

A correlation is a number between -1 and +1 that measures the degree of association between two Attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa.

Suppose we have two Attributes X and Y, with means X' and Y' respectively and standard deviations S(X) and S(Y) respectively. The correlation is computed as summation from 1 to n of the product (X(i)-X').(Y(i)-Y') and then dividing this summation by the product (n-1).S(X).S(Y) where n is total number of Examples and i is the increment variable of summation. There can be other formulas and definitions but let us stick to this one for simplicity.

As discussed earlier a positive value for the correlation implies a positive association. Suppose that an X value was above average, and that the associated Y value was also above average. Then the product (X(i)-X').(Y(i)-Y') would be the product of two positive numbers which would be positive. If the X value and the Y value were both below average, then the product above would be of two negative numbers, which would also be positive. Therefore, a positive correlation is evidence of a general tendency that large values of X are associated with large values of Y and small values of X are associated with small values of Y.

As discussed earlier a negative value for the correlation implies a negative or inverse association. Suppose that an X value was above average, and that the associated Y value was instead below average. Then the product (X(i)-X').(Y(i)-Y') would be the product of a positive and a negative number which would make the product negative. If the X value was below average and the Y value was above average, then the product above would also be negative. Therefore, a negative correlation is evidence of a general tendency that large values of X are associated with small values of Y and small values of X are associated with large values of Y.

This Operator can be used for creating a correlation matrix that shows correlations of all the Attributes of the input ExampleSet. The Attribute weights vector; based on the correlations can also be returned by this Operator. Using this weights vector, highly correlated Attributes can be removed from the ExampleSet with the help of the Select by Weights Operator. Highly correlated Attributes can be more easily removed by simply using the Remove Correlated Attributes Operator. Correlated Attributes are usually removed because they are similar in behavior and only have little influence when calculating predictions. They may also hamper run time and memory usage.

Input

  • example set (Data table)

    This input port expects an ExampleSet on which the correlation matrix will be calculated.

Output

  • example set (Data table)

    The ExampleSet, that was given as input is passed through without changes.

  • matrix (Numerical Matrix)

    The correlations of all Attributes of the input ExampleSet are calculated and the resultant correlation matrix is returned from this port. The correlation for nominal Attributes is not well defined and results in a missing value. When Attributes contain missing values, only pairwise complete tuples are used for calculating the correlation.

  • weights (Attribute Weights)

    The Attribute weights vector based on the correlations of the Attributes is delivered through this output port.

Parameters

  • type This parameter can be used to decide whether to include or exclude the selected Attributes. include attributes ist the default option. It configures the Operator to keep the selected Attributes and remove the remainder. exclude attributes leads to the inverse behaviour. It configures the Operator to remove the selected Attributes and keep the remainder. This also applies to special attributes if the also apply to special attributes parameter is set to true.
  • normalize weightsThis parameter indicates if the weights of the resultant Attribute weights vector should be normalized. If set to true, all weights are normalized such that the minimum weight is 0 and the maximum weight is 1.
  • squared correlationThis parameter indicates if the squared correlation should be calculated. If set to true, the correlation matrix shows squares of correlations instead of simple correlations.
  • attribute filter type This parameter allows you to select the Attribute selection filter; the method you want to use for selecting Attributes. It has the following options:
    • all attributes: This option selects all the Attributes of the ExampleSet, no Attributes are removed. This is the default option
    • one attribute: This option allows the selection of a single Attribute. The Attribute is selected by the select attribute parameter.
    • a subset: This option allows the selection of multiple Attributes through a list (see parameter select subset). If the meta data of the ExampleSet is known all Attributes are present in the list and the required ones can easily be selected.
    • regular expression: This option allows you to specify a regular expression for the Attribute selection. The regular expression filter is configured via the parameters expression and exclude expression.
    • type(s) of values: This option allows the selection of Attributes of particular type(s). The value type filter is configured via the parameter type of value.
    • no missing values: This option selects all Attributes of the ExampleSet which do not contain a missing value in any Example. Attributes that have even a single missing value are removed.
  • select attribute The required Attribute can be selected from this option. The Attribute name can be selected from the drop down box of the parameter if the meta data is known. Otherwise, the attribute name can be typed in manually.
  • select subset The required Attributes can be selected from this option. This opens a new window with two lists. All Attributes are present in the left list, if the meta data is known. They can be shifted to the right list, which is the list of selected Attributes that will make it to the output port. If the meta data is unknown, you can manually type in attribute names and use the green plus-button to add them to the list of selected attributes.
  • expression Attributes whose names match this expression will be selected. The expression can be specified through the button on the right that will open the Edit Regular Expression menu. This menu gives a good idea of regular expressions, and it also allows you to try different expressions and preview the results simultaneously.
  • exclude expression This option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified via the expression parameter).
  • type of value This option allows to select Attribute types. A subset of the following types can be chosen: real, integer, date-time, time, binominal, non-binominal.
  • also apply to special attributes (id, label..) Special Attributes are Attributes with roles (e.g. id, label..). By default, all special Attributes are delivered to the output port regardless of the conditions in the Select Attributes Operator. If this parameter is set to true, special Attributes are also tested against the specified conditions and only those Attributes are selected that match the conditions.

Tutorial Processes

Correlation matrix of the Golf data set

The 'Golf' data set is loaded using the Retrieve Operator. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, the ExampleSet has 4 regular Attributes i.e. 'Outlook', 'Temperature', 'Humidity' and 'Wind' and the label Attribute 'Play'.

All Attributes with only two nominal values are converted to binominal Attributes using Nominal to Binominal. Then the Correlation Matrix Operator is applied on the result. The weights vector generated by this Operator is provided to the Select by Weights Operator along with the data set. The parameters of the Select by Weights Operator are adjusted such that the Attributes with weights greater than 0.5 are selected and all other Attributes are removed. This is why the resultant ExampleSet only has the 'Play' and the 'Temperature' Attribute.

The correlation matrix, weights vector and the resultant ExampleSet can be viewed in the Results Workspace. For the correlation matrix you can see that Outlook is a nominal Attribute, so no correlation can be calculated with it. The correlation of an Attribute to its self is always one, so the diagonal entries are all 1.