Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.8 - Check here for latest version

Decision Stump (RapidMiner Studio Core)

Synopsis

This operator learns a Decision Tree with only one single split. This operator can be applied on both nominal and numerical data sets.

Description

The Decision Stump operator is used for generating a decision tree with only one single split. The resulting tree can be used for classifying unseen examples. This operator can be very efficient when boosted with operators like the AdaBoost operator. The examples of the given ExampleSet have several attributes and every example belongs to a class (like yes or no). The leaf nodes of a decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. For more information about decision trees, please study the Decision Tree operator.

Input

  • training set (Data Table)

    This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

  • model (Decision Tree)

    The Decision Tree with just a single split is delivered from this output port. This classification model can now be applied on unseen data sets for the prediction of the label attribute.

  • example set (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • criterionThis parameter specifies the criterion on which attributes will be selected for splitting. It can have one of the following values:
    • information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
    • gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
    • gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
    • accuracy: Such an attribute is selected for split that maximizes the accuracy of the whole Tree.
    Range: selection
  • minimal_leaf_sizeThe size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least the minimal leaf size number of instances. Range: integer

Tutorial Processes

Introduction to the Decision Stump operator

To understand the basic terminology of trees, please study the Example Process of the Decision Tree operator.

The 'Golf' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can have a look at the ExampleSet. The Decision Stump operator is applied on this ExampleSet. The criterion parameter is set to 'information gain' and the minimal leaf size parameter is set to 1. The resultant decision tree model is connected to the result port of the process and it can be seen in the Results Workspace. You can see that this decision tree has just a single split.