Wrapper-X-Validation (RapidMiner Studio Core)
Synopsis
This operator performs a cross-validation in order to evaluate the performance of a feature weighting or selection scheme. It is mainly used for estimating how accurately a scheme will perform in practice.Description
The Wrapper-X-Validation operator is a nested operator. It has three subprocesses: an Attribute Weighting subprocess, a Model Building subprocess and a Model Evaluation subprocess. The Attribute Weighting subprocess contains the algorithm to be evaluated. It must return an attribute weights vector which is then applied on the training data set. The Model Building subprocess is used for training a new model in each iteration. This model is trained on the same training data set that was used in the first subprocess. But the training data set for this subprocess does not contain those attributes that had weight 0 in the weights vector of the first subprocess. The trained model is then applied and evaluated in the Model Evaluation subprocess. The model is tested on the testing data set. This subprocess must return a performance vector. This performance vector serves as a performance indicator of the actual algorithm.
The input ExampleSet is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing data set (i.e. input of the third subprocess), and the remaining k − 1 subsets are used as training data set (i.e. input of the first two subprocesses). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be averaged (or otherwise combined) to produce a single estimation. The value k can be adjusted using the number of validations parameter. Please study the attached Example Process for more information.
Just as for learning, it is also possible that overfitting occurs during preprocessing. In order to estimate the generalization performance of a preprocessing method RapidMiner supports several validation operators for preprocessing steps. The basic idea is the same as for all other validation operators with a slight difference: the first inner operator must produce a transformed example set, the second must produce a model from this transformed data set and the third operator must produce a performance vector of this model on a test set transformed in the same way.
Input
- example set in (Data Table)
This input port expects an ExampleSet. Subsets of this ExampleSet will be used as training and testing data sets.
Output
- performance vector out (Performance Vector)
The Model Evaluation subprocess must return a Performance Vector in each iteration. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the statistical performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.
- attribute weights out (Attribute Weights)
The Attribute Weighting subprocess must return an attribute weights vector in each iteration. Please note that the attribute weights vector built on the complete input ExampleSet is delivered from this port.
Parameters
- leave_one_outAs the name suggests, the leave one out cross-validation involves using a single example from the original ExampleSet as the testing data, and the remaining examples as the training data. This is repeated such that each example in the ExampleSet is used once as the testing data. Thus, it is repeated 'n' number of times, where 'n' is the total number of examples in the ExampleSet. This is the same as applying the Batch-X-Validation operator with the number of validations parameter set equal to the number of examples in the original ExampleSet. This is usually very expensive for large ExampleSets from a computational point of view because the training process is repeated a large number of times (number of examples time). If set to true, the number of validations parameter is ignored. Range: boolean
- number_of_validationsThis parameter specifies the number of subsets the ExampleSet should be divided into (each subset has an equal number of examples). Also the same number of iterations will take place. If this is set equal to the total number of examples in the ExampleSet, it is equivalent to the Batch-X-Validation operator with the leave one out parameter set to true. Range: integer
- sampling_typeThe Batch-X-Validation operator can use several types of sampling for building the subsets. Following options are available:
- linear_sampling: The linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
- shuffled_sampling: The shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
- stratified_sampling: The stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example, in the case of a binominal classification, stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of class labels.
- use_local_random_seedThis parameter indicates if a local random seed should be used for randomizing examples of a subset. Using the same value of the local random seed will produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is available only if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence. Range: boolean
- local_random_seedThis parameter specifies the local random seed. This parameter is available only if the use local random seed parameter is set to true. Range: integer
Tutorial Processes
Evaluating an attribute selection scheme
This Example Process starts with the Subprocess operator which provides an ExampleSet. A breakpoint is inserted here so that you can have a look at the ExampleSet. You can see that there are 60 examples, uniquely identified by the id attribute. There are 6 attributes in the ExampleSet. The Wrapper-X-Validation operator is applied on this ExampleSet for evaluating an attribute selection scheme. The scheme to be evaluated is placed in the Attribute Weighting subprocess of the Wrapper-X-Validation operator. The Optimize Selection operator is used in this Example Process. Its subprocess is not discussed here for the sake of simplicity.
Have a look at the parameters of the Wrapper-X-Validation operator. The number of validations parameter is set to 6 and the sampling type parameter is set to 'linear sampling'. Thus the given ExampleSet will be broken into 6 subsets linearly (i.e. each subset will have consecutive examples). The Wrapper-X-Validation operator will have 6 iterations. In every iteration 5 out of 6 subsets will serve as the training data set and the remaining subset will serve as the testing subset.
The following steps are followed in every iteration: The Attribute Weighting subprocess trains an attribute selection scheme using the training data set. The Model Building subprocess receives the training data set but with only those attributes that had non-zero weight in the resultant weights vector of the first subprocess. A model is trained using this data set. The Model Evaluation subprocess tests this model on the testing data set and delivers a performance vector.
Breakpoints are inserted at the following places in the process: Before the Attribute Weighting subprocess so that you can see the training data set of the iteration. After the Attribute Weighting subprocess so that you can see the attribute weights vector. Before the Model Building subprocess so that you can see the training data set (without attributes that had 0 weight) that will be used for training the model. Before the Model Evaluation subprocess so that you can see the testing data set of the iteration.