Categories

Versions

You are viewing the RapidMiner Studio documentation for version 8.2 - Check here for latest version

Sample (Bootstrapping) (RapidMiner Studio Core)

Synopsis

This operator creates a bootstrapped sample from an ExampleSet. Bootstrapped sampling uses sampling with replacement, thus the sample may not have all unique examples. The size of the sample can be specified on absolute and relative basis.

Description

This operator is different from other sampling operators because it uses sampling with replacement. In sampling with replacement, at every step all examples have equal probability of being selected. Once an example has been selected for the sample, it remains candidate for selection and it can be selected again in any other coming steps. Thus a sample with replacement can have the same example multiple number of times. More importantly, a sample with replacement can be used to generate a sample that is greater in size than the original ExampleSet. The number of examples in the sample can be specified on absolute or relative basis depending on the setting of the sample parameter.

Input

  • example set input (Data Table)

    This input port expects an ExampleSet. It is output of the Generate ID operator in the attached Example Process.

Output

  • example set output (Data Table)

    A bootstrapped sample of the input ExampleSet is output of this port.

  • original (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • sampleThis parameter determines how the amount of data is specified.
    • absolute: If the sample parameter is set to 'absolute' the sample is created of the exactly specified number of examples. The required number of examples is specified in the sample size parameter.
    • relative: If the sample parameter is set to 'relative' the sample is created as a fraction of the total number of examples in the input ExampleSet. The required ratio of examples is specified in the sample ratio parameter.
    Range: selection
  • sample_sizeThis parameter specifies the exact number of examples which should be sampled. This parameter is only available when the sample parameter is set to 'absolute'. Range: integer
  • sample_ratioThis parameter specifies the fraction of examples which should be sampled. This parameter is only available when the sample parameter is set to 'relative'. Range: real
  • use_weightsIf set to true, example weights will be considered during the bootstrapping if such weights are present. Range: boolean
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomizing examples of the sample. Using the same value of the local random seed will produce the same sample. Changing the value of this parameter changes the way the examples are randomized, thus the sample will have a different set of examples. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Bootstrapped Sampling of the Golf data set

The 'Golf' data set is loaded using the Retrieve operator. The Generate ID operator is applied on it to create an id attribute with ids starting from 1. This is done so that the examples can be identified uniquely, otherwise the id attribute was not necessary here. A breakpoint is inserted here so that you can view the ExampleSet before the application of the Sample (Bootstrapping) operator. As you can see, the ExampleSet has 14 examples. The Sample (Bootstrapping) operator is applied on the ExampleSet. The sample parameter is set to 'absolute' and the sample size parameter is set to 140. Thus a sample 10 times in size of the original ExampleSet is generated. Instead of repeating each example of the input ExampleSet 10 times, examples are selected randomly. You can verify this by seeing the results of this process in the Results Workspace.