Gradient Boosted Trees (H2O)
SynopsisExecutes GBT algorithm using H2O 188.8.131.52.
Please note that the result of this algorithm may depend on the number of threads used. Different settings may lead to slightly different outputs.
A gradient boosted model is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results through gradually improved estimations. Boosting is a flexible nonlinear regression procedure that helps improving the accuracy of trees. By sequentially applying weak classification algorithms to the incrementally changed data, a series of decision trees are created that produce an ensemble of weak prediction models. While boosting trees increases their accuracy, it also decreases speed and human interpretability. The gradient boosting method generalizes tree boosting to minimize these issues.
The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it uses one node, the execution is parallel. You can set the level of parallelism by changing the Settings/Preferences/General/Number of threads setting. By default it uses the recommended number of threads for the system. Only one instance of the cluster is started and it remains running until you close RapidMiner Studio.
- training set (Data Table)
The input port expects a labeled ExampleSet.
The Gradient Boosted classification or regression model is delivered from this output port. This classification or regression model can be applied on unseen data sets for prediction of the label attribute.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
- weights (Attribute Weights)
This port delivers the weights of the attributes with respect to the label attribute.
- number_of_trees A non-negative integer that defines the number of trees. The default is 20. Range: integer
- reproducible Makes model building reproducible. If set then maximum_number_of_threads parameter controls parallelism level of model building. If this is not set then parallelism level is defined by number of threads in General Preferences. Range: boolean
- maximum_number_of_threads Controls parallelism level of model building. Range: integer
- use_local_random_seed Available only if reproducible is set to true. Indicates if a local random seed should be used for randomization. Range: boolean
- local_random_seed This parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
- maximal_depth The user-defined tree depth. The default is 5. Range: integer
- min_rows The minimum number of rows to assign to the terminal nodes. The default is 10.0. If a weight column is specified, the number of rows are also weighted. E.g. if a terminal node contains two rows with the weights 0.3 and 0.4, it is counted as 0.7 in the minimum number of rows. Range: real
- min_split_improvement Minimum relative improvement in squared error reduction for a split to happen. Range: real
- number_of_bins For numerical columns (real/integer), build a histogram of at least the specified number of bins, then split at the best point The default is 20. Range: integer
- learning_rate The learning rate. Smaller learning rates lead to better models, however, it comes at the price of increasing computational time both during training and scoring: lower learning rate requires more iterations. The default is 0.1 and the range is 0.0 to 1.0. Range: real
- sample_rate Row sample rate per tree (from 0.0 to 1.0). Range: real
The distribution function for the training data. For some function (e.g. tweedie) further tuning
can be achieved via the expert parameters
- AUTO: Automatic selection. Uses multinomial for nominal and gaussian for numeric labels.
- bernoulli: Bernoulli distribution. Can be used for binominal or 2-class polynominal labels.
- gaussian, possion, gamma, tweedie, quantile: Distribution functions for regression.
- early_stopping If true, parameters for early stopping needs to be specified. Range: boolean
- stopping_rounds Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events. This parameter is visible only if early_stopping is set. Range: integer
Metric to use for early stopping. Set stopping_tolerance to tune it. This parameter is visible only if early_stopping is set.
- AUTO: Automatic selection. Uses logloss for classification, deviance for regression.
- deviance, logloss, MSE, AUC, lift_top_group, r2, misclassification: The metric to use to decide if the algorithm should be stopped.
- stopping_tolerance Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This parameter is visible only if early_stopping is set. Range: real
- max_runtime_seconds Maximum allowed runtime in seconds for model training. Use 0 to disable. Range: integer
These parameters are for fine tuning the algorithm. Usually the default values provide a decent model,
but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns.
Arrays can be provided by splitting the values with the comma (,) character.
More information on the parameters can be found in the H2O documentation.
- score_each_iteration: Whether to score during each iteration of model training. Type: boolean, Default: false
- score_tree_interval: Score the model after every so many trees. Disabled if set to 0. Type: integer, Default: 0
- fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default: AUTO
- fold_column: Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
- offset_column: Offset column name. Type: Column, Default: no offset column
- balance_classes: Balance training data class counts via over/under-sampling (for imbalanced data). Type: boolean, Default: false
- max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Type: real, Default: 5.0
- max_confusion_matrix_size: Maximum size (# classes) for confusion matrices to be printed in the Logs. Type: integer, Default: 20
- nbins_top_level: For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level. Type: integer, Default: 1024
- nbins_cats: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Type: integer, Default: 1024
- r2_stopping: Stop making trees when the R^2 metric equals or exceeds this. type: double, Default: 0.999999
- quantile_alpha: Desired quantile for quantile regression (from 0.0 to 1.0). Type: double, Default: 0.5
- tweedie_power: Tweedie Power (between 1 and 2). Type: double, Default: 1.5
- col_sample_rate: Column sample rate (from 0.0 to 1.0). Type: double, Default: 1.0
- col_sample_rate_per_tree: Column sample rate per tree (from 0.0 to 1.0). Type: double, Default: 1.0
- keep_cross_validation_predictions: Keep cross-validation model predictions. Type: boolean, Default: false
- keep_cross_validation_fold_assignment: Keep cross-validation fold assignment. Type: boolean, Default: false
- class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes=true. Type: float array, Default: empty
- learn_rate_annealing: Scale down the learning rate by this factor after each tree. Type: double, Default: 1.0
- sample_rate_per_class: Row sample rate per tree per class (from 0.0 to 1.0) Type: double arary, Default: empty
- col_sample_rate_change_per_level: Relative change of the column sampling rate for every level (from 0.0 to 2.0). Type: double, Default: 1.0
- max_abs_leafnode_pred: Maximum absolute value of a leaf node prediction. Type: double, Default: Infinity
- nfolds: Number of folds for cross-validation. Use 0 to turn off cross-validation. Type: integer, Default: 0
Classification using GBT
The H2O GBT operator is used to predict the future_customer attribute of the Deals sample dataset. Since the label is nominal, classification will be performed. The GBT parameters are slightly changed. The number of trees is decreased to 10 to lower the execution time and to prevent overfitting. The learning rate is increased to 0.3 for similar reasons. The resulting model is connected to an Apply Model operator that applies the GBT model on the Deals_Testset sample data. The labeled ExampleSet is connected to a Performance (Binominal Classification) operator, that calculates the Accuracy metric. On the process output the Performance Vector and the Gradient Boosted Model is shown. The trees of the Gradient Boosted model can be checked on the Results view.
Classification with Split Validation using GBT
The H2O GBT operator is used to predict the label attribute of the Iris sample dataset. Since the label is polynominal, classification will be performed. The learner operator is inside a Split Validation for being able to check the performance of the classification. The number of trees is set to 10, all other parameters are kept at the default value. The Performance (Classification) operator delivers the accuracy and the classification error. The model contains 30 trees, because H2O creates 10 trees for every unique label value.
Regression using GBT
The H2O GBT operator is used to predict the label attribute of the Polynomial sample dataset. Since the label is real, regression is performed. The sample data is retrieved, then splitted into two parts with the Split Data operator. The first output is used as the training, the second as the scoring data set. The GBT operator's distribution parameter is changed to "gamma". After applying on the scoring ExampleSet, the output contains the GradientBoostedModel and the labeled data. If you select Charts/Series Chart style for the labeled data and choose label and prediction label in the Plot Series field, you can check the accuracy of the prediction visually.