Weight by Tree Importance (RapidMiner Studio Core)

Synopsis

This operator calculates the weight of the attributes by analyzing the split points of a Random Forest model. The attributes with higher weight are considered more relevant and important.

Description

This weighting schema will use a given random forest to extract the implicit importance of the used attributes. Therefore each node of each tree is visited and the benefit created by the respective split is retrieved. This benefit is summed per attribute, that had been used for the split. The mean benefit over all trees is used as importance.

This algorithm is implemented following the idea from "A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data" by Menze, Bjoen H et all (2009). It has been extended by additional criterias for computing the benefit created from a certain split. The original paper only mentioned Gini Index, this operator additionally supports the more reliable criterions Information Gain and Information Gain Ratio.

Input

  • random forest (Random Forest Model)

    The input port expects a Random Forest model which is a voting model of random trees. It is output of the Random Forest operator in the attached Example Process.

Output

  • weights (Average Vector)

    This port delivers the weights of the attributes with respect to the label attribute. The attributes with higher weight are considered more relevant.

  • random forest (Random Forest Model)

    The Random Forest model that was given as input is passed without changing to the output through this port. This is usually used to reuse the same model in further operators or to view the model in the Results Workspace.

Parameters

  • criterionThis parameter specifies the criterion to be used for weighting the attributes. It can have one of the following values: information gain, gain ratio, gini index or accuracy. Range: selection
  • normalize_weightsThis parameter indicates if the calculated weights should be normalized or not. If set to true, all weights are normalized in a range from 0 to 1. Range: boolean

Tutorial Processes

Calculating the attribute weights of the Golf data set using Random Forest model

The 'Golf' data set is loaded using the Retrieve operator. The Random Forest operator is applied on it to generate a random forest model. A breakpoint is inserted here so that you can have a look at the generated model. The resultant model is provided as input to the Weight by Tree Importance operator to calculate the weights of the attributes of the 'Golf' data set. All parameters are used with default values. The normalize weights parameter is set to true, thus all the weights will be normalized in a range from 0 to 1. You can verify this by viewing the results of this process in the Results Workspace.