Unsupervised Feature Selection (Model Simulator)

Synopsis

This operator performs a fully automated feature selection for centroid-based clustering techniques like k-Means.

Description

This is a new operator for simpler automatic feature selection for unsupervised learning. It provides much simpler settings and is more robust compared to the existing feature engineering operators. This operator also supports multi-objective feature selection and allows to define a balance value between 1 (few features) and 0 (most features, i.e. the cluster model which is closest to the original cluster model using all the input data). Based on this setting the final solution will be selected from the Pareto front. As a rule of thumb, a value of 0.5 roughly brings the number of features down to half.

IMPORTANT: Unlike other optimization operators in RapidMiner, this one only works for a specific cluster validation measurement and only for centroid-cluster models. Therefore, it requires that the inner process delivers such a cluster model together with the clustered data. Those outputs are for example directly generated by the k-Means operator.

The two basic working modes are "no selection" and "selection". In the first mode, the resulting feature set describes the complete input example set. In the second mode, the resulting feature sets describes a subset of the input features. In both cases, other data sets (like scoring or validation data) can be brought to the same format by using the operator Apply Feature Set.

The operator uses a multi-objective evolutionary algorithm for finding the best feature sets. Each feature set is pareto-optimal with respect to complexity vs. model validation. The complexity is calculated based on the feature set where each feature in the set contributes complexity one. The cluster model performance is measured by the Davies Bouldin index which is automatically calculated by this operator. Better cluster separations are indicated by lower values for the Davies Bouldin index.

The first output is the best feature set from the Pareto set according to the balancing parameter. The second output is the complete final population of the optimiation run, i.e. the full Pareto-front of all optimal trade-offs between complexity and model errors. Finally, the log data of best error rates, smallest feature set, and largest feature set size for all generations are also delivered for plotting purposes.

Input

example set in (Data Table)
This input port expects a data set which is used as training data to create the best feature set.

Output

feature set
The resulting optimal feature set selected from the optimal trade-offs based on the balance parameter.
population
All optimal trade-offs between error rates and complexity.
optimization log (Data Table)
A table with log data about the optimization run.

Parameters

mode The mode for the feature engineering: keep all original features or feature selection. Range: selection
balance for accuracy Defines a balance between 1 (few features) and 1 (most features) to pick the final solution. Range: real
show progress dialog Indicates if a dialog should be shown during the optimization offering details about the optimization progress. This should not be used if the process is run on systems without graphical user interface but can be useful during process testing. Range: boolean
use optimization heuristics Indicates if heuristics should be used to determine a good population size and maximum number of generations. Range: boolean
use time limit Indicates if a time limit should be used to stop the optimization. Range: boolean
time limit in seconds The number of seconds after the optimization will be stopped. Range: integer

Tutorial Processes

Finding feature sets and apply them

This process performs an unsupervised feature selection for k-Means clustering. It performs this on the Sonar data set which in total has 60 features. As you can see in the results, a cluster model for almost all possible complexities has been generated where less features naturally lead to denser and therefore better clusterings. Only the user can decide at what point too much information has been ommitted and the clusters no longer make sense. This should drive the choice for the balance point. In this tutorial, we have been settting the balance to 0.5 which leads to about half the feature set.

The optimal feature set is then applied on the input data and the final cluster model on that data is created. Of course this feature set could also be applied to new data sets as well. Finally, we visualize the cluster model as well.

Categories

Versions