Correlation Matrix (RapidMiner Studio Core)

Synopsis

This operator determines correlation between all attributes and it can produce a weights vector based on these correlations. Correlation is a statistical technique that can show whether and how strongly pairs of attributes are related.

Description

A correlation is a number between -1 and +1 that measures the degree of association between two attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa.

Suppose we have two attributes X and Y, with means X' and Y' respectively and standard deviations S(X) and S(Y) respectively. The correlation is computed as summation from 1 to n of the product (X(i)-X').(Y(i)-Y') and then dividing this summation by the product (n-1).S(X).S(Y) where n is total number of examples and i is the increment variable of summation. There can be other formulas and definitions but let us stick to this one for simplicity.

As discussed earlier a positive value for the correlation implies a positive association. Suppose that an X value was above average, and that the associated Y value was also above average. Then the product (X(i)-X').(Y(i)-Y') would be the product of two positive numbers which would be positive. If the X value and the Y value were both below average, then the product above would be of two negative numbers, which would also be positive. Therefore, a positive correlation is evidence of a general tendency that large values of X are associated with large values of Y and small values of X are associated with small values of Y.

As discussed earlier a negative value for the correlation implies a negative or inverse association. Suppose that an X value was above average, and that the associated Y value was instead below average. Then the product (X(i)-X').(Y(i)-Y') would be the product of a positive and a negative number which would make the product negative. If the X value was below average and the Y value was above average, then the product above would also be negative. Therefore, a negative correlation is evidence of a general tendency that large values of X are associated with small values of Y and small values of X are associated with large values of Y.

This operator can be used for creating a correlation matrix that shows correlations of all the attributes of the input ExampleSet. Please note that this operator performs a data scan for each attribute combination and might therefore take some time for non-memory ExampleSets. The attribute weights vector; based on the correlations can also be returned by this operator. Using this weights vector, highly correlated attributes can be removed from the ExampleSet with the help of the Select by Weights operator. Highly correlated attributes can be more easily removed by simply using the Remove Correlated Attributes operator. Correlated attributes are usually removed because they are similar in behavior and will have similar impact in prediction calculations, so keeping attributes with similar impacts is redundant. Removing correlated attributes saves space and time of calculation of complex algorithms.

Input

example set (Data Table)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
matrix (Numerical Matrix)
The correlations of all attributes of the input ExampleSet are calculated and the resultant correlation matrix is returned from this port.
weights (Attribute Weights)
The attribute weights vector based on the correlations of the attributes is delivered through this output port.

Parameters

normalize_weightsThis parameter indicates if the weights of the resultant attribute weights vector should be normalized. If set to true, all weights are normalized such that the minimum weight is 0 and the maximum weight is 1. Range: boolean
squared_correlationThis parameter indicates if the squared correlation should be calculated. If set to true, the correlation matrix shows squares of correlations instead of simple correlations. Range: boolean

Tutorial Processes

Correlation matrix of the Golf data set

The 'Golf' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, the ExampleSet has 4 regular attributes i.e. 'Outlook', 'Temperature', 'Humidity' and 'Wind'. The Correlation Matrix operator is applied on it. The weights vector generated by this operator is provided to the Select by Weights operator along with the 'Golf' data set. The parameters of the Select by Weights operator are adjusted such that the attributes with weights greater than 0.5 are selected and all other attributes are removed. This is why the resultant ExampleSet does not have the 'Temperature' attribute (weight=0). The correlation matrix, weights vector and the resultant ExampleSet can be viewed in the Results Workspace.