Decision Tree (RapidMiner Studio Core)
SynopsisGenerates a Decision Tree for classification of both nominal and numerical data.
A decision tree is a tree-like graph or model. It is more like an inverted tree because it has its root at the top and it grows downwards. This representation of the data has the advantage compared with other approaches of being meaningful and easy to interpret. The goal is to create a classification model that predicts the value of a target attribute (often called class or label) based on several input attributes of the ExampleSet. In RapidMiner an attribute with label role is predicted by the Decision Tree operator. Each interior node of tree corresponds to one of the input attributes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. Outgoing edges of numerical attributes are labeled with disjoint ranges. Each leaf node represents a value of the label attribute given the values of the input attributes represented by the path from the root to the leaf. This description can be easily understood by studying the attached Example Process.
Decision Trees are generated by recursive partitioning. Recursive partitioning means repeatedly splitting on the values of attributes. In every recursion the algorithm follows the following steps:
- An attribute A is selected to split on. Making a good choice of attributes to split on each stage is crucial to generation of a useful tree. The attribute is selected depending upon a selection criterion which can be selected by the criterion parameter.
- Examples in the ExampleSet are sorted into subsets, one for each value of the attribute A in case of a nominal attribute. In case of numerical attributes, subsets are formed for disjoint ranges of attribute values.
- A tree is returned with one edge or branch for each subset. Each branch has a descendant subtree or a label value produced by applying the same algorithm recursively.
In general, the recursion stops when all the examples or instances have the same label value, i.e. the subset is pure. Or recursion may stop if most of the examples are of the same label value. This is a generalization of the first approach; with some error threshold. However there are other halting conditions such as:
- There are less than a certain number of instances or examples in the current subtree. This can be adjusted by using the minimal size for split parameter.
- No attribute reaches a certain threshold. This can be adjusted by using the minimum gain parameter.
- The maximal depth is reached. This can be adjusted by using the maximal depth parameter.
Pruning is a technique in which leaf nodes that do not add to the discriminative power of the decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is done after the tree creation process is complete.
CHAIDThe CHAID operator works exactly like the Decision Tree operator with one exception: it uses a chi-squared based criterion instead of the information gain or gain ratio criteria. Moreover this operator cannot be applied on ExampleSets with numerical attributes.
- training set (Data Table)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
- model (Decision Tree)
The Decision Tree is delivered from this output port. This classification model can now be applied on unseen data sets for the prediction of the label attribute.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
- criterionSelects the criterion on which attributes will be selected for splitting. It can have one of the following values:
- information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
- gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
- gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
- accuracy: Such an attribute is selected for split that maximizes the accuracy of the whole Tree.
- maximal_depthThe depth of a tree varies depending upon size and nature of the ExampleSet. This parameter is used to restrict the size of the Decision Tree. The tree generation process is not continued when the tree depth is equal to the maximal depth. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the tree, a tree of maximum depth is generated. If its value is set to '1', a Tree with a single node is generated. Range: integer
- apply_pruningBy default the Decision Tree is generated with pruning. Setting this parameter to false disables the pruning and delivers an unpruned Tree. Range: boolean
- confidenceThis parameter specifies the confidence level used for the pessimistic error calculation of pruning. Range: real
- apply_prepruningBy default the Decision Tree is generated with prepruning. Setting this parameter to false disables the prepruning and delivers a tree without any prepruning. Range: boolean
- minimal_gainThe gain of a node is calculated before splitting it. The node is split if its Gain is greater than the minimal gain. Higher value of minimal gain results in fewer splits and thus a smaller tree. A too high value will completely prevent splitting and a tree with a single node is generated. Range: real
- minimal_leaf_sizeThe size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least the minimal leaf size number of instances. Range: integer
- minimal_size_for_splitThe size of a node is the number of examples in its subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter. Range: integer
- number_of_prepruning_alternativesAs prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does not add to the discriminative power of the entire tree. In such a case alternative nodes are tried for splitting. This parameter adjusts the number of alternative nodes tried for splitting when split is prevented by prepruning at a certain node. Range: integer
Getting started with Decision Trees
The 'Golf' data set is retrieved using the Retrieve operator. Then the Decision Tree operator is applied on it. Click on the Run button to go to the Results Workspace. First take a look at the resultant tree. First of all, basic terminology of trees is explained here using this resultant tree. The first node of the tree is called the root of the tree; 'Outlook' is the root in this case. As you can see from this tree, trees grow downwards. This example clearly shows why interpretation of data is easy through trees. Just take a glance at this tree and you will come to know that whenever the 'Outlook' attribute value is 'overcast', the 'Play' attribute will have the value 'yes'. Similarly whenever the 'Outlook' attribute value is 'rain' and the 'Wind' attribute has value 'false', then the 'Play' attribute will have the value 'yes'. The Decision Tree operator predicts the value of an attribute with the label role, in this example the 'Play' attribute is predicted. The nodes that do not have child nodes are called the leaf nodes. All leaf non-leaf nodes correspond to one of the input attributes. In this example the 'Outlook', 'Wind' and 'Humidity' attributes are non-leaf nodes. The number of edges of a nominal interior node is equal to the number of possible values of the corresponding input attribute. The 'Outlook' attribute has three possible values: 'overcast', 'rain' and 'sunny' thus it has three outgoing edges. Similarly the 'Wind' attribute has two outgoing edges. As 'Humidity' is a numerical attribute, its outgoing edges are labeled with disjoint ranges. Each leaf node represents a value of the label given the values of the input attributes represented by the path from the root to the leaf. That is why all leaf nodes assume possible values of the label attribute i.e. 'yes' or 'no'.
In this Example Process the 'Gain ratio' is used as the selection criterion. However using any other criterion on this ExampleSet produces the same tree. This is because this is a very simple data set. On large and complex data sets different selection criterion may produce different trees.
When the tree is split on the 'Outlook' attribute, it is divided into three subtrees, one with each value of the 'Outlook' attribute. The 'overcast' subtree is pure i.e. all its label values are same ('yes'), thus it is not split again. The 'rain' subtree and the 'sunny' subtree are split again in the next iteration. In this Example Process the minimal size of split parameter is set to 4. Set it to 10 and run the process again. You will see that you get a tree with a single node. This is because this time the nodes with size less than 10 cannot be split. As the size of all subtrees of the 'Outlook' attribute is less than 10, thus no splitting takes place, and we get a tree with just a single node.
The minimal gain parameter is set to 0.1. In general if you want to reduce the size of tree, you can increase the minimal gain. Similarly if you want to increase the size of your tree, you can reduce the value of the minimal gain parameter. The maximal depth parameter is set to 20. The actual depth of the resultant tree is 3. You can set an upper bound to the depth of tree using the maximal depth parameter. Prepruning is enabled in this Example Process. To disable it, check the no prepruning parameter. Now click the Run button. The resultant tree is much more complex than the previous one. The previous tree was more useful and comprehendible than this one.