Decision Tree (Concurrency)
SynopsisThis Operator generates a decision tree model, which can be used for classification and regression.
A decision tree is a tree like collection of nodes intended to create a decision on values affiliation to a class or an estimate of a numerical target value. Each node represents a splitting rule for one specific Attribute. For classification this rule separates values belonging to different classes, for regression it separates them in order to reduce the error in an optimal way for the selected parameter criterion.
The building of new nodes is repeated until the stopping criteria are met. A prediction for the class label Attribute is determined depending on the majority of Examples which reached this leaf during generation, while an estimation for a numerical value is obtained by averaging the values in a leaf.
This Operator can process ExampleSets containing both nominal and numerical Attributes. The label Attribute must be nominal for classification and numerical for regression.
After generation, the decision tree model can be applied to new Examples using the Apply Model Operator. Each Example follows the branches of the tree in accordance to the splitting rule until a leaf is reached.
To configure the decision tree, please read the documentation on parameters as explained below.
The CHAID Operator provides a pruned decision tree that uses chi-squared based criterion instead of information gain or gain ratio criteria. This Operator cannot be applied on ExampleSets with numerical Attributes but only nominal Attributes.
The ID3 Operator provides a basic implementation of unpruned decision tree. It only works with ExampleSets with nominal Attributes.
The Random Forest Operator creates several random trees on different Example subsets. The resulting model is based on voting of all these trees. Due to this difference, it is less prone to overtraining.
Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm to improve classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid 'overfitting'. Although it is usually applied to decision tree models, it can be used with any type of model.
- training set (Data Table)
The input data which is used to generate the decision tree model.
- model (Decision Tree)
The decision tree model is delivered from this output port.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port.
- weights (Attribute Weights)
An ExampleSet containing Attributes and weight values, where each weight represents the feature importance for the given Attribute. A weight is given by the sum of improvements the selection of a given Attribute provided at a node. The amount of improvement is dependent on the chosen criterion.
Selects the criterion on which Attributes will be selected for splitting. For each of these criteria the split value is optimized with regards to the chosen criterion. It can have one of the following values:
- information_gain: The entropies of all the Attributes are calculated and the one with least entropy is selected for split. This method has a bias towards selecting Attributes with a large number of values.
- gain_ratio: A variant of information gain that adjusts the information gain for each Attribute to allow the breadth and uniformity of the Attribute values.
- gini_index: A measure of inequality between the distributions of label characteristics. Splitting on a chosen Attribute results in a reduction in the average gini index of the resulting subsets.
- accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the whole tree.
- least_square: An Attribute is selected for splitting, that minimizes the squared distance between the average of values in the node with regards to the true value.
The depth of a tree varies depending upon the size and characteristics of the ExampleSet. This parameter is used to restrict the depth of the decision tree. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the tree. In this case the tree is built until other stopping criteria are met. If its value is set to '1', a tree with a single node is generated.Range:
The decision tree model can be pruned after generation. If checked, some branches are replaced by leaves according to the confidence parameter.Range:
This parameter specifies the confidence level used for the pessimistic error calculation of pruning.Range:
This parameter specifies if more stopping criteria than the maximal depth should be used during generation of the decision tree model. If checked, the parameters minimal gain, minimal leaf size, minimal size for split and number of prepruning alternatives are used as stopping criteria.Range:
The gain of a node is calculated before splitting it. The node is split if its gain is greater than the minimal gain. A higher value of minimal gain results in fewer splits and thus a smaller tree. A value that is too high will completely prevent splitting and a tree with a single node is generated.Range:
The size of a leaf is the number of Examples in its subset. The tree is generated in such a way that every leaf has at least the minimal leaf size number of Examples.Range:
The size of a node is the number of Examples in its subset. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter.Range:
When split is prevented by prepruning at a certain node this parameter will adjust the number of alternative nodes tested for splitting. Occurs as prepruning runs parallel to the tree generation process. This may prevent splitting at certain nodes, when splitting at that node does not add to the discriminative power of the entire tree. In such a case, alternative nodes are tried for splitting.Range:
Train a Decision Tree model
Goal: RapidMiner Studio comes with a sample dataset called 'Golf'. This contains Attributes regarding the weather namely 'Outlook', 'Temperature', 'Humidity' and 'Wind'. These are important features to decide whether the game could be played or not. Our goal is to train a decision tree for predicting the 'Play' Attribute.
The 'Golf' dataset is retrieved using the Retrieve Operator. This data is fed to the Decision Tree Operator by connecting the output port of Retrieve to the input port of the Decision Tree Operator. Click on the Run button. This trains the decision tree model and takes you to the Results View, where you can examine it graphically as well as in textual description.
The tree shows that whenever the Attribute 'Outlook' has the value 'overcast', the Attribute 'Play' will have the value 'yes'. If the Attribute 'Outlook' has the value 'rain', then two outcomes are possible:
a) if the Attribute 'Wind' has the value 'false', the 'Play' Attribute has the value 'yes'
b) if the 'Wind' Attribute has the value 'true', the Attribute 'Play' is 'no'.
Finally, if the Attribute 'Outlook' has the value 'sunny', there are again two possibilities.
The Attribute 'Play' is 'yes' if the value of Attribute 'Humidity' is less than or equal to 77.5 and it is 'no' if 'Humidity' is greater than 77.5.
In this example, the leaf node led only to either of the two possible values for the label Attribute. The 'Play' Attribute is either 'yes' or 'no', which shows that the tree model fits the data very well.
Train a Decision Tree model and apply it to predict the outcome
Goal: In this tutorial a predictive analytics process using a decision tree is shown. It is slightly advanced than the first tutorial. It also introduces basic but important concepts such as splitting the dataset into two partitions. The larger half is used for training the decision tree model and the smaller half is used for testing it. Our goal is to see how good the tree model would be able to predict the fate of passengers in the test data set.
ln this tutorial process a Decision Tree is used for regression. The 'Polynominal' data set with a numerical target Attribute is used as a label. Before training the model the data set is split into a training and a test set. Afterwards the regressed values are compared with the label values to obtain a performance measure using the Performance (Regression) Operator.