Sample (RapidMiner Studio Core)

Synopsis

This operator creates a sample from an ExampleSet by selecting examples randomly. The size of a sample can be specified on absolute, relative and probability basis.

Description

This operator is similar to the Filter Examples operator in principle that it takes an ExampleSet as input and delivers a subset of the ExampleSet as output. The difference is this that the Filter Examples operator filters examples on the basis of specified conditions. But the Sample operator focuses on the number of examples and class distribution in the resultant sample. Moreover, the samples are generated randomly. The number of examples in the sample can be specified on absolute, relative or probability basis depending on the setting of the sample parameter. The class distribution of the sample can be controlled by the balance data parameter.

Input

  • example set input (IOObject)

    This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process.

Output

  • example set output (IOObject)

    A randomized sample of the input ExampleSet is output of this port.

  • original (IOObject)

    ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • sampleThis parameter determines how the amount of data is specified.
    • absolute: If the sample parameter is set to 'absolute' the sample is created of an exactly specified number of examples. The required number of examples is specified in the sample size parameter.
    • relative: If the sample parameter is set to 'relative' the sample is created as a fraction of the total number of examples in the input ExampleSet. The required ratio of examples is specified in the sample ratio parameter.
    • probability: If the sample parameter is set to 'probability' the sample is created of probability basis. The required probability is specified in the sample probability parameter.
    Range: selection
  • balance_dataYou can set this parameter to true if you need to sample differently for examples of a certain class. If this parameter is set to true, sample size, sample ratio and sample probability parameters are replaced by sample size per class, sample ratio per class and sample probability per class parameters respectively. These parameters allow you to specify different sample sizes for different values of the label attribute. Range: boolean
  • sample_sizeThis parameter specifies the exact number of examples which should be sampled. This parameter is only available when the sample parameter is set to 'absolute' and the balance data parameter is not set to true. Range: integer
  • sample_ratioThis parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to 'relative' and the balance data parameter is not set to true. Range: real
  • sample_probabilityThis parameter specifies the sample probability for each example. This parameter is only available when the sample parameter is set to 'probability' and the balance data parameter is not set to true. Range: real
  • sample_size_per_classThis parameter specifies the absolute sample size per class. This parameter is only available when the sample parameter is set to 'absolute' and the balance data parameter is set to true. Range:
  • sample_ratio_per_classThis parameter specifies the fraction of examples per class. This parameter is only available when the sample parameter is set to 'relative' and the balance data parameter is set to true. Range:
  • sample_probability_per_classThis parameter specifies the probability of examples per class. This parameter is only available when the sample parameter is set to 'probability' and the balance data parameter is set to true. Range:
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomizing examples of the sample. Using the same value of local random seed will produce the same sample. Changing the value of this parameter changes the way the examples are randomized, thus the sample will have a different set of examples. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Sampling the Ripley-Set data set

The 'Ripley-Set' data set is loaded using the Retrieve operator. The Generate ID operator is applied on it so that the examples can be identified uniquely. A breakpoint is inserted at this stage so that you can see the ExampleSet before the Sample operator is applied. You can see that there are 250 examples with two possible classes: 0 and 1. 125 examples have class 0 and 125 examples have class 1. Now, the Sample operator is applied on the ExampleSet. The sample parameter is set to 'relative'. The balance data parameter is set to true. The sample ratio per class parameter specifies two ratios. Class 0 is assigned ratio 0.2. Thus, of all the examples where label attribute is 0 only 20 percent will be selected. There were 125 examples with class 0, so 25 (i.e. 20% of 125) examples will be selected. Class 1 is assigned ratio 1. Thus, of all the examples where label attribute is 1, 100 percent will be selected. There were 125 examples with class 1, so all 125 (i.e. 100% of 125) examples will be selected. Run the process and you can verify the results. Also note that the examples are taken randomly. The randomization can be changed by changing the local random seed parameter.