Random Forest (RapidMiner Studio Core)

Synopsis

This operator generates a set of a specified number of random trees i.e. it generates a random forest. The resulting model is a voting model of all the trees.

Description

The Random Forest operator generates a set of random trees. The random trees are generated in exactly the same way as the Random Tree operator generates a tree. The resulting forest model contains a specified number of random tree models. The number of trees parameter specifies the required number of trees. The resulting model is a voting model of all the random trees. For more information about random trees please study the Random Tree operator.

The representation of the data in form of a tree has the advantage compared with other approaches of being meaningful and easy to interpret. The goal is to create a classification model that predicts the value of a target attribute (often called class or label) based on several input attributes of the ExampleSet. Each interior node of the tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf. For better understanding of the structure of a tree please study the Example Process of the Decision Tree operator.

Pruning is a technique in which leaf nodes that do not add to the discriminative power of the tree are removed. This is done to convert an over-specific or over-fitted tree to a more general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is done after the tree creation process is complete.

Input

  • training set (Data Table)

    This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

  • model (Random Forest Model)

    The Random Forest model is delivered from this output port. This model can be applied on unseen data sets for the prediction of the label attribute. This model is a voting model of all the random trees

  • example set (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • number_of_treesThis parameter specifies the number of random trees to generate. Range: integer
  • criterionSelects the criterion on which attributes will be selected for splitting. It can have one of the following values:
    • information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
    • gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
    • gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
    • accuracy: Such an attribute is selected for a split that maximizes the accuracy of the whole Tree.
    Range: selection
  • maximal_depthThe depth of a tree varies depending upon size and nature of the ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the tree, a tree of maximum depth is generated. If its value is set to '1', a Tree with a single node is generated. Range: integer
  • apply_prepruningBy default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning. Range: boolean
  • minimal_gainThe gain of a node is calculated before splitting it. The node is split if its Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits and thus a smaller tree. A too high value will completely prevent splitting and a tree with a single node is generated. Range: real
  • minimal_leaf_sizeThe size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least the minimal leaf size number of instances. Range: integer
  • minimal_size_for_splitThe size of a node is the number of examples in its subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter. Range: integer
  • number_of_prepruning_alternativesAs prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does not add to the discriminative power of the entire tree. In such a case alternative nodes are tried for splitting. This parameter adjusts the number of alternative nodes tried for splitting when split is prevented by prepruning at a certain node. Range: integer
  • apply_pruningBy default the Decision Tree is generated with pruning. Setting this parameter to false disables the pruning and delivers an unpruned Tree. Range: boolean
  • confidenceThis parameter specifies the confidence level used for the pessimistic error calculation of pruning. Range: real
  • guess_subset_ratioIf this parameter is set to true then log(m) + 1 attributes are used, otherwise a ratio should be specified by the subset ratio parameter. Range: boolean
  • voting_strategySpecifies the prediction strategy in case of dissenting tree model predictions:
    • confidence_vote: Selects the class that has the highest accumulated confidence.
    • majority_vote: Selects the class that was predicted by the majority of tree models.
    Range: selection
  • subset_ratioThis parameter specifies the ratio of randomly chosen attributes to test. Range: real
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same randomization. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Generating a set of random trees using the Random Forest operator

The 'Golf' data set is loaded using the Retrieve operator. The Split Validation operator is applied on it for training and testing a classification model. The Random Forest operator is applied in the training subprocess of the Split Validation operator. The number of trees parameter is set to 10, thus this operator generates a set of 10 random trees. The resultant model is a voting model of all the random trees. The Apply Model operator is used in the testing subprocess to apply this model. The resultant labeled ExampleSet is used by the Performance operator for measuring the performance of the model. The random forest model and its performance vector is connected to the output and it can be seen in the Results Workspace.