Singular Value Decomposition (RapidMiner Studio Core)

Synopsis

This operator performs a dimensionality reduction of the given ExampleSet based on Singular Value Decomposition (SVD). The user can specify the required number of dimensions or specify the cumulative variance threshold. In the latter case all components having cumulative variance above this threshold are discarded.

Description

Singular Value Decomposition (SVD) can be used to better understand an ExampleSet by showing the number of important dimensions. It can also be used to simplify the ExampleSet by reducing the number of attributes of the ExampleSet. This reduction removes unnecessary attributes that are linearly dependent in the point of view of Linear Algebra. It is useful when you have obtained data on a number of attributes (possibly a large number of attributes), and believe that there is some redundancy in those attributes. In this case, redundancy means that some of the attributes are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed attributes into a smaller number of components (artificial attributes) that will account for most of the variance in the observed attributes. For example, imagine an ExampleSet which contains an attribute that stores the water's temperature on several samples and another that stores its state (solid, liquid or gas). It is easy to see that the second attribute is dependent on the first attribute and, therefore, SVD could easily show us that it is not important for the analysis.

RapidMiner provides various dimensionality reduction operators e.g. the Principal Component Analysis operator. The Principal Component Analysis technique is a specific case of SVD. It is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated attributes into a set of values of uncorrelated attributes called principal components. The number of principal components is less than or equal to the number of original attributes. This transformation is defined in such a way that the first principal component's variance is as high as possible (accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it should be orthogonal to (uncorrelated with) the preceding components.

Differentiation

Principal Component Analysis

PCA is a dimensionality reduction procedure. PCA is a specific case of SVD.

Input

example set input (Data Table)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data along with the data. Please note that this operator cannot handle nominal attributes; it works on numerical attributes.

Output

example set output (Data Table)
The Singular Value Decomposition is performed on the input ExampleSet and the resultant ExampleSet is delivered through this port.
original (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
preprocessing model (Preprocessing Model)
This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.

Parameters

dimensionality_reductionThis parameter indicates which type of dimensionality reduction (reduction in number of attributes) should be applied.
- none: if this option is selected, dimensionality reduction is not performed.
- keep_percentage: if this option is selected, all the components with a cumulative variance greater than the given threshold are removed from the ExampleSet. The threshold is specified by the percentage threshold parameter.
- fixed_number: if this option is selected, only a fixed number of components are kept. The number of components to keep is specified by the dimensions parameter.
Range: selection
percentage_thresholdThis parameter is only available when the dimensionality reduction parameter is set to 'keep percentage'. All the components with a cumulative variance greater than the percentage threshold are removed from the ExampleSet. Range: real
dimensionsThis parameter is only available when the dimensionality reduction parameter is set to 'fixed number'. The number of components to keep is specified by the dimensions parameter. Range: integer

Tutorial Processes

Dimensionality reduction of the Sonar data set using the Singular Value Decomposition operator

The 'Sonar' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that the ExampleSet has 60 attributes. The Singular Value Decomposition operator is applied on the 'Sonar' data set. The dimensionality reduction parameter is set to 'fixed number' and the dimensions parameter is set to 10. Thus the resultant ExampleSet will be composed of 10 dimensions (artificial attributes). You can see the resultant ExampleSet in the Results Workspace and verify that it has only 10 attributes. Please note that these attributes are not original attributes of the 'Sonar' data set. These attributes were created using the SVD procedure.