Automatic Feature Engineering (Model Simulator)

Synopsis

This operator performs a fully automated feature engineering process which covers feature selection and feature generation.

Description

This is a new operator for simpler automatic feature engineering. It provides much simpler settings and is more robust compared to the existing feature engineering operators. This operator also supports multi-objective feature engineering and allows to define a balance value between 0 (most simple) and 1 (most accurate, i.e. the model with the least error rate, see below) to pick the final solution from the Pareto front.

IMPORTANT: Unlike other optimization operators in RapidMiner, this one requires the inner performance to be an error measurement, i.e. a performance criterion which should be minimized. Using measurements like accuracy would lead to wrong results. This way we ensure consistent behaviour of the operator between classification and regression task and avoid to maximize "negative" error rates which is often confusing to users.

The three basic working modes are "no selection", "only selection", "selection and generation". In the first mode, the resulting feature set describes the complete input example set. In the second mode, the resulting feature sets describes a subset of the input features. And in the third mode, the resulting feature set desribes a subset of the input features and / or newly generated features. But in all three cases, other data sets (like scoring or validation data) can be brought to the same format by using the operator Apply Feature Set.

The operator uses a multi-objective evolutionary algorithm for finding the best feature sets. Each feature set is pareto-optimal with respect to complexity vs. model error. The complexity is calculated based on the feature set where each feature in the set contributes complexity one. The same applies for additional function applications in case of feature generation. The error rate is measured by the performance calculation delivered by the inner operators. Please make sure that the delivered criterion is actually an error rate, i.e. a performance criterion which needs to be minimized for better models.

The first output is the best feature set from the Pareto set according to the balancing parameter. The second output is the complete final population of the optimiation run, i.e. the full Pareto-front of all optimal trade-offs between complexity and model errors. Finally, the log data of best error rates, smallest feature set, and largest feature set size for all generations are also delivered for plotting purposes.

Input

example set in (Data Table)
This input port expects a data set which is used as training data to create the best feature set.

Output

feature set
The resulting optimal feature set selected from the optimal trade-offs based on the balance parameter.
population
All optimal trade-offs between error rates and complexity.
optimization log (Data Table)
A table with log data about the optimization run.

Parameters

mode The mode for the feature engineering: keep all original features, feature selection, feature selection and generation. Range: selection
balance for accuracy Defines a balance between 0 (most simple feature set) and 1 (most accurate feature set) to pick the final solution. Range: real
show progress dialog Indicates if a dialog should be shown during the optimization offering details about the optimization progress. This should not be used if the process is run on systems without graphical user interface but can be useful during process testing. Range: boolean
use optimization heuristics Indicates if heuristics should be used to determine a good population size and maximum number of generations. Range: boolean
use time limit Indicates if a time limit should be used to stop the optimization. Range: boolean
time limit in seconds The number of seconds after the optimization will be stopped. Range: integer

Tutorial Processes

Finding feature sets and apply them

This process creates an optimal feature set which is then applied to the complete training data to build the final model. The same feature set is also applied on an independent validation set before the prediction model is applied.

Categories

Versions