Split Validation (RapidMiner Studio Core)

Synopsis

This operator performs a simple validation i.e. randomly splits up the ExampleSet into a training set and test set and evaluates the model. This operator performs a split validation in order to estimate the performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learnt by a particular learning operator) will perform in practice.

Description

The Split Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for learning or building a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.

The input ExampleSet is partitioned into two subsets. One subset is used as the training set and the other one is used as the test set. The size of two subsets can be adjusted through different parameters. The model is learned on the training set and is then applied on the test set. This is done in a single iteration, as compared to the Cross Validation operator that iterates a number of times using different subsets for testing and training purposes.

Usually the learning process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of testing data, it will generally turn out that the model does not fit the testing data as well as it fits the training data. This is called 'over-fitting', and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Split Validation is a way to predict the fit of a model to a hypothetical testing set when an explicit testing set is not available. The Split Validation operator also allows training on one data set and testing on another explicit testing data set.

Input

training example set (Data Table)
This input port expects an ExampleSet for training a model (training data set). The same ExampleSet will be used during the testing subprocess for testing the model if no other data set is provided.

Output

model (Model)
The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port.
training example set (Data Table)
The ExampleSet that was given as input at the training input port is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
averagable (Performance Vector)
The testing subprocess must return a Performance Vector. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.

Parameters

splitThis parameter specifies how the ExampleSet should be split
- relative: If a relative split is required, the relative size of the training set should be provided in the split ratio parameter. Afterwards the relative size of the test set is automatically calculated by subtracting the value of the split ratio from 1.
- absolute: If an absolute split is required, you have to specify the exact number of examples to use in the training or test set in the training set size parameter or in the test set size parameter. If either of these parameters is set to -1, its value is calculated automatically using the other one.
Range: selection
split_ratioThis parameter is only available when the split parameter is set to 'relative'. It specifies the relative size of the training set. It should be between 1 and 0, where 1 means that the entire ExampleSet will be used as training set. Range: real
training_set_sizeThis parameter is only available when the split parameter is set to 'absolute'. It specifies the exact number of examples to be used as training set. If it is set to -1, the test size set number of examples will be used for the test set and the remaining examples will be used as training set. Range: integer
test_set_sizeThis parameter is only available when the split parameter is set to 'absolute'. It specifies the exact number of examples to be used as test set. If it is set to -1, the training size set number of examples will be used for training set and the remaining examples will be used as test set. Range: integer
sampling_typeThe Split Validation operator can use several types of sampling for building the subsets. Following options are available:
- linear_sampling: The linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
- shuffled_sampling: The shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
- stratified_sampling: The stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example, in the case of a binominal classification, stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of class labels.
- automatic: The automated mode uses stratified sampling per default. If it isn't applicable, e.g., if the ExampleSet doesn't contain a nominal label, shuffled sampling will be used instead.
Range: selection
use_local_random_seedIndicates if a local random seedshould be used for randomizing examples of a subset. Using the same value of local random seed will produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is only available if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence. Range: boolean
local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Validating Models using Split Validation

The 'Golf' data set is loaded using the Retrieve operator. The Generate ID operator is applied on it to uniquely identify examples. This is done so that you can understand this process easily; otherwise IDs are not required here. A breakpoint is added after this operator so that you can preview the data before the Split Validation operator starts. Double click the Split Validation operator and you will see training and testing subprocesses. The Decision Tree operator is used in the training subprocess. The trained model (i.e. Decision Tree) is passed to the testing subprocess through the model ports. The testing subprocess receives testing data from the testing port.

Now, have a look at the parameters of the Split Validation operator. The split parameter is set to 'absolute'. The training set size parameter is set to 10 and the test set size parameter is set to -1. As there are 14 total examples in the 'Golf' data set, the test set automatically gets 4 remaining examples. The sampling type parameter is set to Linear Sampling. Remaining parameters have default values. Thus two subsets of the 'Golf' data set will be created. You will observe later that these two subsets are created: training set: examples with IDs 1 to 10 (10 examples) test set: examples with IDs 11 to 14 (4 examples)

You can see that all examples in a subset are consecutive (i.e. with consecutive IDs). This is because Linear Sampling is used.

Breakpoints are inserted to make you understand the process. Here is what happens when you run the process: First the 'Golf' data set is displayed with all rows uniquely identified using the ID attribute. There are 14 rows with ids 1 to 14. Press the green-colored Run button to continue. Now a Decision tree is shown. This was trained from the training set of the 'Golf' data set. Hit the Run button to continue. The Decision tree was applied on the testing data. Here you can see the results after application of the Decision Tree model. Have a look at IDs of the testing data here. They are 11 to 14. Compare the label and prediction columns and you will see that only 2 predictions out of 4 are correct (only ID 1 and 3 are correct predictions). Hit the Run button again. Now the Performance Vector of the Decision tree is shown. As only 2 out of 4 predictions were correct, the accuracy is 50%. Press the Run button again. Now you can see a different Decision tree. It was trained on the complete 'Golf' data set that is why it is different from the previous decision tree.

You can run the same process with different values of sampling type parameter. If linear sampling is used, as in our example process, you will see that IDs of examples in subsets will be consecutive values. If shuffled sampling is used you will see that IDs of examples in subsets will be random values. If stratified sampling is used you will see that IDs of examples in subsets will be random values but the class distribution in the subsets will be nearly the same as in the whole 'Golf' data set.

To get an understanding of how objects are passed using through ports please study the Example Process of Cross Validation operator.