Cross Validation (Concurrency)

Synopsis

This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learned by a particular learning operator) will perform in practice. It will also return the labeled data if desired.

Description

The Cross Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for training a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.

The input ExampleSet is partitioned into k subsets of equal size. Of the k subsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remaining k − 1 subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeated k times, with each of the k subsets used exactly once as the testing data. The k results from the k iterations then can be averaged (or otherwise combined) to produce a single estimation. The value k can be adjusted using the number of validations parameter.

The learning processes usually optimize the model to make it fit the training data as well as possible. If we test this model on some independent set of data, mostly this model does not perform that well on testing data as it performed on the data that was used to generate it. This is called 'over-fitting'. The Cross-Validation operator predicts the fit of a model to a hypothetical testing data. This can be especially useful when separate testing data is not present.

Input

  • example_set (Data Table)

    This input port receives an ExampleSet to apply the cross-validation.

Output

  • model (Model)

    This port delivers the prediction model trained on the whole ExampleSet. Please note that this port should only be connected if you really need this model because otherwise the generation will be skipped.

  • performance (IOObject)

    The average of the performances of the k iterations is the output of this port.

  • example_set (Data Table)

    This port returns the same ExampleSet which has been given as input.

  • test_result_set (Data Table)

    This port delivers the updated ExampleSet (after applying a model) of each iteration merged to one ExampleSet.

Parameters

  • split_on_batch_attribute If this parameter is enabled, use the special attribute 'batch' to partition the data instead of randomly splitting the data. This gives you control over the exact examples which are used to train the model each fold. All other split parameters are ignored in that case. Range: boolean
  • leave_one_out If this parameter is enabled, the cross-validation involves using a single example from the original ExampleSet as the testing data (in testing subprocess), and the remaining examples as the training data (in training subprocess). This is repeated such that each example in the ExampleSet is used once as the testing data. Thus, it is repeated 'n' number of times, where 'n' is the total number of examples in the ExampleSet. This is the same as applying the Cross Validation operator with the number of validations parameter set equal to the number of examples in the original ExampleSet. This is usually very expensive for large ExampleSets from a computational point of view because the training process is repeated a large number of times (number of examples time). If set to true, the number of validations parameter is ignored. Range: boolean
  • number_of_folds This parameter specifies the number of folds (aka number of subsets) the ExampleSet should be divided into (each subset has equal number of examples). Also the same number of iterations will take place. Each iteration involves training a model and testing that model. If this is set equal to total number of examples in the ExampleSet, it is equivalent to the Cross Validation operator with the leave one out parameter set to true. Range: integer
  • sampling_type The Cross Validation operator can use several types of sampling for building the subsets. Following options are available:
    • linear_sampling: The Linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
    • shuffled_sampling: The Shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
    • stratified_sampling: The Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of class labels.
    • automatic: The Automated mode uses stratified sampling per default. If it isn't applicable e.g. if the ExampleSet doesn't contain a nominal label, Shuffled sampling will be used instead.
    Range: selection
  • use_local_random_seed This parameter indicates if a local random seed should be used for randomizing examples of a subset. Using the same value of the local random seed will produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is available only if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence. Range: boolean
  • local_random_seed If the use local random seed parameter is checked this parameter determines the local random seed. The same subsets will be created every time if the same value is used. Range: integer
  • enable_parallel_execution This parameter enables the parallel execution of the inner processes. Please disable the parallel execution if you run into memory problems. Range: boolean

Tutorial Processes

Validating Models using Cross Validation

The 'Deals' data set is loaded using the Retrieve operator. The Generate ID operator is applied on it to uniquely identify examples. This is done so that you can understand this process easily; otherwise IDs are not required here. The breakpoint is added after this operator so that you can preview the data before the Cross-Validation operator starts. Double click the Cross-Validation operator and you will see the training and testing subprocesses. The Decision Tree operator is used in the training subprocess. The trained model (i.e. Decision Tree) is passed to the testing subprocess through the model ports. The testing subprocess receives testing data from the testing port.

Now, have a look at the parameters of the Cross Validation operator. The no of validations parameter is set to 3 and the sampling type parameter is set to linear sampling. Remaining parameters have default values. The no of validations is set to 3, which implies that 3 subsets of the 'Deals' data set will be created. You will observe later that these three subsets are created: sub1: examples with IDs 1 to 333 (333 examples) sub2: examples with IDs 334 to 667 (334 examples) sub3: examples with IDs 668 to 1000 (333 examples)

You can see that all examples in a subset are consecutive (i.e. with consecutive IDs). This is because linear sampling is used. Also note that all subsets have almost an equal number of elements. An exactly equal number of elements was not possible because 1000 examples could not be divided equally in 3 subsets.

As the no of validations parameter is set to 3, there will be three iterations. Iteration 1: A model (decision tree) will be trained on sub2 and sub3 during the training subprocess. The trained model will be applied on sub1 during the testing subprocess. Iteration 2: A model (decision tree) will be trained on sub1 and sub3 during the training subprocess. The trained model will be applied on sub2 during the testing subprocess. Iteration 3: A model (decision tree) will be trained on sub1 and sub2 during the training subprocess. The trained model will be applied on sub3 during the testing sub process.

Breakpoints are inserted to make you understand the process. Here is what happens when you run the process: First the 'Deals' data set is displayed with all rows uniquely identified using the ID attribute. There are 1000 rows with ids 1 to 1000. Press the Run button to continue. Now a Decision tree is shown. This was trained from a subset (combination of sub2 and sub3) of the 'Deals' data set. Press the Run button to continue. The Decision tree was applied on the testing data. Testing data for this iteration was sub1. Here you can see the results after application of the Decision Tree model. Have a look at the IDs of the testing data here. They are 1 to 333. This means that the tree was trained on the remaining examples i.e. examples with IDs 334 to 1000 thus sub2 + sub3. Press the Run button again. Now the Performance Vector of the Decision tree is shown. Press the Run button again. Now you can see a different Decision tree. It was trained on another subset which is why it is different from the previous decision tree. Keep pressing the Run button and you will see testing data and the Performance Vector for this tree. This process will repeat 3 times because there will be three iterations because the number of validations parameter was set to 3. At the end of 3 iterations, you will see the Average Performance Vector in the Results Workspace it averages all the performance vectors.

You can run the same process with different values of the sampling type parameter. If linear sampling is used, as in our example process, you will see that the IDs of the examples in the subsets will be consecutive values. If shuffled sampling is used you will see that the IDs of the examples in the subsets will be randomized. If stratified sampling is used you will also see randomized IDs but the class distribution in the subsets will be nearly the same as in the whole 'Deals' data set.

Passing results from training to testing process using through ports

This process is similar to the first process, but it will apply a weighting operator for selecting a subset of the attributes before applying a Linear Regression operator. The focus of this example process is to highlight the usage of through and additional performance ports.

Please note the use of through ports for transferring objects between the training and testing subprocesses. The results generated during training, that have to be applied in the same way on the test set before the model can be applied, may be passed using the through ports.

Have a look at the subprocesses. In the training subprocess, the Weight by Correlation operator is applied on the training data set. Afterwards the Select by Weights operator receives the results, followed by the Linear Regression operator. Note the use of the through ports. The through ports are used to transfer the weights from the training subprocess to the testing subprocess. The Select by Weights operator is applied on the testing data set using the same parameter values as used in the Select by Weights operator of the training subprocess. The Select by Weights operator uses the weights transferred through the through ports. In the testing subprocess two performance vectors are created and provided via the performance ports.

Using the batch attribute to split the training data

The 'Deals' data set is loaded using the Retrieve operator. The Set Role operator is applied on it to set the special batch attribute on the existing Payment Method attribute. This is done so that the Cross Validation can split the data into subsets for each of the three different values the Payment Method attribute has. The breakpoint is added after this operator so that you can preview the data before the Cross Validation operator starts. Double click the Cross-Validation operator and you will see the training and testing subprocesses. The Decision Tree operator is used in the training subprocess. The trained model (i.e. Decision Tree) is passed to the testing subprocess through the model ports. The testing subprocess receives testing data from the testing port.

Now, have a look at the parameters of the Cross-Validation operator. The split on batch attribute parameter is selected. The other split parameters are hidden and are ignored while this parameter is set. You will observe later that these three subsets are created: sub1: examples with Payment Method of credit card (652 examples) sub2: examples with Outlook of cheque (68 examples) sub3: examples with Outlook of cash (280 examples)

Apart from how the input data is split, the operator works the same way as in the first tutorial process. But notice how much lower the Performance is. Only split on custom batch attributes if you know what you are doing, otherwise the default Cross Validation random splitting will be far superior.