Random Forest (Concurrency)

Synopsis

This Operator generates a random forest model, which can be used for classification and regression.

Description

A random forest is an ensemble of a certain number of random trees, specified by the number of trees parameter. These trees are created/trained on bootstrapped sub-sets of the ExampleSet provided at the Input Port. Each node of a tree represents a splitting rule for one specific Attribute. Only a sub-set of Attributes, specified with the subset ratio criterion, is considered for the splitting rule selection. This rule separates values in an optimal way for the selected parameter criterion. For classification the rule is separating values belonging to different classes, while for regression it separates them in order to reduce the error made by the estimation. The building of new nodes is repeated until the stopping criteria are met.

After generation, the random forest model can be applied to new Examples using the Apply Model Operator. Each random tree generates a prediction for each Example by following the branches of the tree in accordance to the splitting rules and evaluating the leaf. Class predictions are based on the majority of Examples, while estimations are obtained through the average of values reaching a leaf. The resulting model is a voting model of all created random trees. Since all single predictions are considered equally important, and are based on sub-sets of Examples the resulting prediction tends to vary less than the single predictions.

A concept called pruning can be leveraged to reduce complexity of the model by replacing sub-trees, that only provide little predictive power with leaves. For different types of pruning refer to the parameter descriptions.

Extremely randomized trees are a method similar to random forest, which can be obtained by checking the split random parameter and disabling pruning. Important parameters to tune for this method are the minimal leaf size and split ratio, which can be changed after disabling guess split ratio. Good default choices for the minimal leaf size are 2 for classification and 5 for regression problems.

Differentiation

Decision Tree

The Decision Tree Operator creates one tree, where all Attributes are available at each node for selecting the optimal one with regards to the chosen criterion. Since only one tree is generated the prediction is more comprehensible for humans, but might lead to overtraining.

Bagging

Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm to improve classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid 'overfitting'. Although it is usually applied to decision tree models, it can be used with any type of model. The random forest uses bagging with random trees.

Gradient Boosted Trees

The Gradient Boosted Trees Operator trains a model by iteratively improving a single tree model. After each iteration step the Examples are reweighted based on their previous prediction. The final model is a weighted sum of all created models. Training parameters are optimized based on the gradient of the function described by the errors made.

Input

training set (Data Table)
The input data which is used to generate the random forest model.

Output

model (Random Forest Model)
The random forest model is delivered from this output port.
example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port.
weights (Attribute Weights)
An ExampleSet containing Attributes and weight values, where each weight represents the feature importance for the given Attribute. A weight is given by the sum of improvements the selection of a given Attribute provided at a node. The amount of improvement is dependent on the chosen criterion.

Parameters

number_of_trees
This parameter specifies the number of random trees to generate. For each tree a sub-set of Examples is selected via bootstrapping. If the parameter enable parallel execution is checked, the trees are trained in parallel across available processor threads.
Range:
criterion
Selects the criterion on which Attributes will be selected for splitting. For each of these criteria the split value is optimized with regards to the chosen criterion. It can have one of the following values:
- information_gain: The entropies of all the Attributes are calculated and the one with least entropy is selected for split. This method has a bias towards selecting Attributes with a large number of values.
- gain_ratio: A variant of information gain that adjusts the information gain for each Attribute to allow the breadth and uniformity of the Attribute values.
- gini_index: A measure of inequality between the distributions of label characteristics. Splitting on a chosen Attribute results in a reduction in the average gini index of the resulting subsets.
- accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the whole tree.
- least_square: An Attribute is selected for splitting, that minimizes the squared distance between the average of values in the node with regards to the true value.
Range:
maximal_depth
The depth of a tree varies depending upon the size and characteristics of the ExampleSet. This parameter is used to restrict the depth for each random tree. If its value is set to '-1', the maximal depth parameter puts no bound on the depth of the trees. In this case all trees are built until other stopping criteria are met. If its value is set to '1', only trees with a single node are generated.
Range:
apply_prepruning
This parameter specifies if more stopping criteria than the maximal depth should be used during generation of the decision trees. If checked, the parameters minimal gain, minimal leaf size, minimal size for split and number of prepruning alternatives are used as stopping criteria.
Range:
minimal_gain
The gain of a node is calculated before splitting it. The node is split if its gain is greater than the minimal gain. A higher value of minimal gain results in fewer splits and thus smaller trees. A value that is too high will completely prevent splitting and trees with single nodes are generated.
Range:
minimal_leaf_size
The size of a leaf is the number of Examples in its subset. The trees of the random forest are generated in such a way that every leaf has at least the minimal leaf size number of Examples.
Range:
minimal_size_for_split
The size of a node is the number of Examples in its subset. Only those nodes are split whose size is greater than or equal to the minimal size for split parameter.
Range:
number_of_prepruning_alternatives
When split is prevented by prepruning at a certain node this parameter will adjust the number of alternative nodes tested for splitting. Occurs as prepruning runs parallel to the tree generation process. This may prevent splitting at certain nodes, when splitting at that node does not add to the discriminative power of the entire tree. In such a case, alternative nodes are tried for splitting.
Range:
apply_pruning
The random trees of the random forest model can be pruned after generation. If checked, some branches are replaced by leaves according to the confidence parameter. This parameter is not available for the 'least_square' criterion.
Range:
confidence
This parameter specifies the confidence level used for the pessimistic error calculation of pruning.
Range:
random_splits
If checked, this parameter causes the splits of numerical Attributes to be chosen randomly instead of being optimized. For the random selection a uniform sampling between the minimal and maximal value for the current Attribute is performed. Activating this parameter while disabling pruning configures the random forest to become an extremely randomized tree (also known as Extra-Tree). This also speeds up the model building process.
Range:
guess_subset_ratio
If this parameter is set to true then *int(log(m) + 1) Attributes are used, otherwise a ratio should be specified by the subset ratio* parameter.
Range:
subset_ratio
This parameter specifies the ratio of randomly chosen Attributes to test.
Range:
voting_strategy
Specifies the prediction strategy in case of dissenting tree model predictions:

This parameter is not available for the 'least_square' criterion.
- confidence_vote: Selects the class that has the highest accumulated confidence.
- majority_vote: Selects the class that was predicted by the majority of tree models.
Range:
use_local_random_seed
This parameter indicates if a local random seed should be used for randomization.
Range:
local_random_seed
If the use local random seed parameter is checked this parameter determines the local random seed.
Range:
enable_parallel_execution
This parameter enables the parallel execution of the model building process by distributing the Random Tree generation between all available CPU threads. Please disable the parallel execution if you run into memory problems.
Range:

Tutorial Processes

Generating a set of random trees using the random forest Operator

In this tutorial process the 'Golf' data set is retrieved and used to train a random forest for classification with 10 random trees. The generated model is afterwards applied to a test data set. Resulting predictions, the generated model and feature importance values provided by the Operators are viewed.

Checking the output of the Apply Model Operators 'lab' port reveals the labeled data set with predictions obtained from applying the model to an unseen data set. Inspecting the model shows a Collection of 10 random trees that build up the random forest and contribute to the predictive process. Looking at the output of the 'wei' port from the Random Forest Operator provides information about the Attribute weights. These weights contain importance values regarding the predictive power of an Attribute to the overall decision of the random forest.

Random forest for regression

ln this tutorial process a random forest is used for regression. The 'Polynominal' data set with a numerical target Attribute is used as a label. Before training the model the data set is split into a training and a test set. Afterwards the regressed values are compared with the label values to obtain a performance measure using the Performance (Regression) Operator.

Comparison between decision tree and random forest

In this tutorial process a comparison highlighting the difference between decision trees and random forest is shown. The 'Polynominal' sample data set is split into a training and a test set. Afterwards each training data set is used to generate a decision tree and a random forest model for regression. Applying the models to the test data sets and evaluating the performance shows that both methods provide similar results with a difference in deviation of the result when applied to test data.