You are viewing the RapidMiner Studio documentation for version 9.2 - Check here for latest version
Split Data (RapidMiner Studio Core)
Synopsis
This operator produces the desired number of subsets of the given ExampleSet. The ExampleSet is partitioned into subsets according to the specified relative sizes.Description
The Split Data operator takes an ExampleSet as its input and delivers the subsets of that ExampleSet through its output ports. The number of subsets (or partitions) and the relative size of each partition are specified through the partitions parameter. The sum of the ratio of all partitions should be 1. The sampling type parameter decides how the examples should be shuffled in the resultant partitions. For more information about this operator please study the parameters section of this description. This operator is different from other sampling and filtering operators in the sense that it is capable of delivering multiple partitions of the given ExampleSet.
Input
- example set (IOObject)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.
Output
- partition (IOObject)
This operator can have multiple number of partition ports. The number of useful partition ports depends on the number of partitions (or subsets) this operator is configured to produce. The partitions parameter is used for specifying the desired number of partitions.
Parameters
- partitionsThis is the most important parameter of this operator. It specifies the number of partitions and the relative ratio of each partition. The user just requires to specify the ratio of all partitions. The number of required partitions is not explicitly specified by the user because it is calculated automatically by the number of ratios specified in this parameter. The ratios should be between 0 and 1. The sum of all ratios should be 1. For better understanding of this parameter please study the attached Example Process. Range: enumeration
- sampling_typeThe Split Data operator can use several types of sampling for building the subsets. Following options are available:
- Linear sampling: Linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
- Shuffled sampling: Shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
- Stratified sampling: Stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example in the case of a binominal classification, Stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the class labels.
- Automatic: Uses stratified sampling if the label is nominal, shuffled sampling otherwise.
- use_local_random_seedIndicates if a local random seed should be used for randomizing examples of a subset. Using the same value of local random seed will produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is only available if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence. Range: boolean
- local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
Tutorial Processes
Creating partitions of the Golf data set using the Split Data operator
The 'Golf' data set is loaded using the Retrieve operator. The Generate ID operator is applied on it so the examples can be identified uniquely. A breakpoint is inserted here so the ExampleSet can be seen before the application of the Split Data operator. It can be seen that the ExampleSet has 14 examples which can be uniquely identified by the id attribute. The examples have ids from 1 to 14. The Split Data operator is applied next. The sampling type parameter is set to 'linear sampling'. The partitions parameter is configured to produce two partitions with ratios 0.8 and 0.2 respectively. The partitions can be seen in the Results Workspace. The number of examples in each partition is calculated by this formula:
(Total number of examples) / (sum of ratios) * ratio of this partition
If the answer is a decimal number it is rounded off. The number of examples in each partition turns out to be: (14) / (0.8 + 0.2) * (0.8) = 11.2 which is rounded off to 11 (14) / (0.8 + 0.2) * (0.2) = 2.8 which is rounded off to 3
It is a good practice to adjust ratios such that the sum of ratios is 1. But this operator also works if the sum of ratios is lower than or greater than 1. For example if two partitions are created with ratios 1.0 and 0.4. The resultant partitions would be calculated as follows: (14) / (1.0 + 0.4) * (1.0) = 10 (14) / (1.0 + 0.4) * (0.4) = 4