Loop Data Sets (RapidMiner Studio Core)

Synopsis

This operator iterates over its subprocess for every ExampleSet given at its input ports.

Description

The subprocess of the Loop Data Sets operator executes n number of times where n is the number of ExampleSets provided as input to this operator. You must have basic understanding of Subprocesses in order to understand this operator. For more information regarding subprocesses please study the Subprocess operator. For each input ExampleSet the Loop Data Sets operator executes the inner operators of the subprocess like an operator chain. This operator can be used to conduct a process consecutively on a number of different data sets. If the only best parameter is set to true then only the results generated during the iteration with best performance are delivered as output. For this option it is compulsory to attach a performance vector to the performance port in the subprocess of this operator. The Loop Data Sets operator uses this performance vector to select the iteration with best performance.

Input

  • example set (IOObject)

    This operator can have multiple inputs. When one input is connected, another input port becomes available which is ready to accept another ExampleSet (if any). The order of inputs remains the same. The ExampleSet supplied at the first input port of this operator is available at the first input port of the nested chain (inside the subprocess). Do not forget to connect all inputs in correct order. Make sure that you have connected the right number of ports at the subprocess level.

Output

  • output (IOObject)

    The Loop Data Sets operator can have multiple output ports. When one output is connected, another output port becomes available which is ready to deliver another output (if any). The order of outputs remains the same. The Object delivered at the first output port of the subprocess is delivered at the first output of the outer process. Do not forget to connect all outputs in correct order. Make sure that you have connected the right number of ports at all levels of the chain.

Parameters

  • only_bestIf the only best parameter is set to true then only the results generated during the iteration with the best performance are delivered as output. For this option it is compulsory to attach a performance vector to the performance port in the subprocess of this operator. The Loop Data Sets operator uses this performance vector to select the iteration with the best performance. Range: boolean

Tutorial Processes

Selecting the ExampleSet with best performance

This Example Process explains the usage of the only best parameter of the Loop Data Sets operator. The 'Golf', 'Golf-Testset' and 'Iris' data sets are loaded using the Retrieve operator. All these ExampleSets are provided as input to the Loop Data Sets operator. Have a look at the subprocess of the Loop Data Sets operator. The Split Validation operator is used for training and testing a K-NN model on the given ExampleSet. The Split Validation operator returns the performance vector of the model. This performance vector is used by the Loop Data Sets operator for finding the iteration with the best performance. The results of the iteration with the best performance are delivered because the only best parameter is set to true.

When this process is executed, the 'Iris' data set is delivered as result. This is because the iteration with the 'Iris' data set had the best performance vector. If you insert a breakpoint after the Split Validation operator and run the process again, you can see that the 'Golf', 'Golf-Testset' and 'Iris' data sets have 25%, 50% and 93.33% accuracy respectively. As the iteration with 'Iris' data set had the best performance its results are returned by this operator (remember only best parameter is set to true). This operator can also return other objects like a model etc.