Generalized Linear Model (H2O)
Synopsis
Executes GLM algorithm using H2O 3.42.0.1.Description
Please note that the result of this algorithm may depend on the number of threads used. Different settings may lead to slightly different outputs.
Generalized linear models (GLMs) are an extension of traditional linear models. This algorithm fits generalized linear models to the data by maximizing the log-likelihood. The elastic net penalty can be used for parameter regularization. The model fitting computation is parallel, extremely fast, and scales extremely well for models with a limited number of predictors with non-zero coefficients.
The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it uses one node, the execution is parallel. You can set the level of parallelism by changing the Settings/Preferences/General/Number of threads setting. By default it uses the recommended number of threads for the system. Only one instance of the cluster is started and it remains running until you close RapidMiner Studio.
Please note that below version 7.6, a threshold value optimized for maximal F-measure is used for prediction by default.
Input
- training set (Data table)
The input port expects a labeled ExampleSet.
Output
- model
The Generalized Linear classification or regression model is delivered from this output port. This classification or regression model can be applied on unseen data sets for prediction of the label attribute.
- example set (Data table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
- weights (Attribute Weights)
This port delivers the weights of the attributes with respect to the label attribute.
- threshold
This port is used only for binominal classification tasks. It provides a threshold value optimized for maximal F-measure. (This threshold is used in the trained model by default.)
Parameters
- family
Family. Use binomial for classification with logistic regression, others are for regression problems.
- AUTO: Automatic selection. Uses multinomial for polynominal, binomial for binominal and gaussian for numeric labels.
- gaussian: The data must be numeric (real or integer).
- binomial: The data must be binominal or polynominal with 2 levels/classes.
- multinomial: The data must be polynominal with more than two levels/classes.
- poisson: The data must be numeric and non-negative (integer).
- gamma: The data must be numeric and continuous and positive (real or integer).
- tweedie: The data must be numeric and continuous (real) and non-negative.
- solver
Select the solver to use.
IRLSM is fast on problems with a small number of predictors and for lambda-search with L1 penalty,
while L_BFGS scales better for datasets with many columns. COORDINATE_DESCENT is IRLSM with the covariance
updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE is IRLSM with
the naive updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE and
COORDINATE_DESCENT are currently experimental.
Values:
- AUTO
- IRLSM
- L_BFGS
- COORDINATE_DESCENT (experimental)
- COORDINATE_DESCENT_NAIVE (experimental)
- link
The link function relates the linear predictor to the distribution function. The default is the canonical link
for the specified family. Only available for gaussian, poisson and gamma families, because only one link type is possible for the others:
- Family: binomial; Link: logit
- Family: multinomial; Link: multinomial
- Family: tweedie; Link: tweedie
- family_default: Uses identity for gaussian, log for possion and inverse for gamma family.
- identity: Possible family options: Gaussian, Poisson, Gamma
- log: Possible family options: Gaussian, Poisson, Gamma
- inverse: Possible family options: Gaussian, Gamma
- reproducible Makes model building reproducible. If set then maximum_number_of_threads parameter controls parallelism level of model building. If not set then parallelism level is defined by number of threads in General Preferences. Range: boolean
- maximum_number_of_threads Controls parallelism level of model building. Range: integer
- specify_beta_constraints If enabled, beta constraints for the regular attributes can be provided. Range: boolean
- use_regularization Check this box if regularization should be used. For regularization, you can specify the lambda, alpha and the lambda search related parameters. If alpha or lambda is undefined (default), H2O will calculate default values for them based on the training data and the other parameters. If this parameter is set to false, lambda is set to 0.0 (means no regularization). Range: boolean
- lambda The lambda parameter controls the amount of regularization applied. If lambda is 0.0, no regularization is applied and the alpha parameter is ignored (you can set this by disabling the use regularization parameter). The default value for lambda is calculated by H2O using a heuristic based on the training data. Providing multiple lambda values via the advanced parameters triggers a search. Range: real
- lambda_search A logical value indicating whether to conduct a search over the space of lambda values, starting from the max lambda, given lambda will be interpreted as the min lambda. Default is false. Range: boolean
- number_of_lambdas The number of lambda values when lambda search = true. 0 means no preference. Range: integer
- lambda_min_ratio Smallest value for lambda as a fraction of lambda.max, the entry value, which is the smallest value for which all coefficients in the model are zero. If the number of observations is greater than the number of variables then default lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then default lambda_min_ratio = 0.01. Default is 0.0, which means no preference. Range: real
- early_stopping Check this box if early stopping should be performed on the lambda search based on the stopping rounds and stopping tolerance parameters. The used stopping metric is always deviance. Range: boolean
- stopping_rounds Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events. Range: integer
- stopping_tolerance Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). Range: real
- alpha The alpha parameter controls the distribution between the L1 (Lasso) and L2 (Ridge regression) penalties. A value of 1.0 for alpha represents Lasso, and an alpha value of 0.0 produces Ridge regression. Providing multiple alpha values via the advanced parameters triggers a search. Default is 0.0 for the L-BFGS solver, else 0.5. Range: real
- standardize Standardize numeric columns to have zero mean and unit variance Range: boolean
- non-negative_coefficients Restrict coefficients (not intercept) to be non-negative. Range: boolean
- compute_p-values Request p-values computation. P-values work only with IRLSM solver and no regularization. Intercept must also be added to the model. Moreover, non-negative coefficients and specify beta constraints parameters have to be set to false to compute p-values. Range: boolean
- remove_collinear_columns In case of linearly dependent columns remove some of the dependent columns. Works only if intercept is added to the model. Range: boolean
- add_intercept Include constant term in the model. Range: boolean
- missing_values_handling
Handling of missing values. Either Skip or MeanImputation.
- Skip: Missing values are skipped.
- MeanImputation: Missing values are replaced with the mean value.
- max_iterations Maximum number of iterations. 0 means no limit. Range: integer
- beta_constraints
Constraints for beta values. A row consists of the following values:
Names
- Attribute name: The name of the attribute.
- Category: A value from the attribute's domain. Please take care to provide the exact value. Use more rows to specify constraints for multiple categories.
- Lower bound: Lower bound of the beta.
- Upper bound: Upper bound of the beta.
- Beta given: Specifies the given solution in proximal operator interface. The proximal operator interface allows you to run the GLM with a proximal penalty on a distance from a specified given solution.
- Beta start: Starting value of the beta.
- max_runtime_seconds Maximum allowed runtime in seconds for model training. Use 0 to disable. Range: integer
- expert_parameters
These parameters are for fine tuning the algorithm. Usually the default values provide a decent model,
but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns.
Arrays can be provided by splitting the values with the comma (,) character.
More information on the parameters can be found in the H2O documentation.
- score_each_iteration: Whether to score during each iteration of model training. Type: boolean, Default: false
- fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default: AUTO
- fold_column: Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
- offset_column: Offset column name. Type: Column, Default: no offset column
- max_confusion_matrix_size: Maximum size (# classes) for confusion matrices to be printed in the Logs. Type: integer, Default: 20
- keep_cross_validation_predictions: Keep cross-validation model predictions. Type: boolean, Default: false
- keep_cross_validation_fold_assignment: Keep cross-validation fold assignment. Type: boolean, Default: false
- tweedie_variance_power: A numeric value specifying the power for the variance function when family = "tweedie". Type: real, Default: 0
- tweedie_link_power: A numeric value specifying the power for the link function when family = "tweedie". Type: real, Default: 1
- prior: A numeric specifying the prior probability of class 1 in the response when family = "binomial". Must be from (0,1) exclusive range or -1 (no prior). The default value is the observation frequency of class 1. Type: real Default: -1 (no prior)
- beta_epsilon: A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion. Type: real, Default: 0.0001
- objective_epsilon: Specify a threshold for convergence. If the objective value is less than this threshold, the model is converged. Type: real, Default: -1 (no threshold)
- gradient_epsilon: (For L-BFGS only) Specify a threshold for convergence. If the objective value (using the L-infinity norm) is less than this threshold, the model is converged. Type: real, Default: 0.0001
- max_active_predictors: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors. Type: integer, Default: -1 (no limit)
- obj_reg: Likelihood divider in objective value computation, Type: real, Default: 1/nobs
- additional_alphas: Providing additional alphas triggers a search. Ignored if alpha is undefined.
- additional_lambdas: Providing additional lambdas triggers a search. Ignored if lambda is undefined.
- nfolds: Number of folds for cross-validation. Use 0 to turn off cross-validation. Type: integer, Default: 0
Tutorial Processes
Classification using GLM
The GLM operator is used to predict the Future customer attribute of the Deals sample data set. All parameters are kept at the default value in the GLM. This means that because of the binominal label the Family parameter will be set automatically to "binominal", and the corresponding Link function to "logit". The resulting model is connected to an Apply Model operator that applies the Generalized Linear model on the Deals_Testset sample data. The labeled ExampleSet is connected to a Performance (Binominal Classification) operator, that calculates the Accuracy metric. On the process output the Performance Vector, the Generalized Linear Model and the output ExampleSet is shown.
Regression using GLM
The GLM operator is used to predict the label attribute of the Polynominal sample data set using the Split Validation operator. The label is numerical, which means that regression is performed. The "compute p-values" parameter is set to true, which requires multiple parameters to be set: the lambda parameter is set to 0.0 (no regularization), the collinear columns are removed and no beta constraints are specified. The Solver parameter is set to AUTO, which means that the IRLSM solver is used - this allows the computation of the P-values. The resulting model is applied in the Testing subprocess of the Split Validation operator. The labeled ExampleSet is connected to a Performance (Regression Classification) operator, that calculates the Root mean squared error metric. On the process output the Performance Vector and the Generalized Linear Model is shown.