Deep Learning (H2O)

Synopsis

Executes Deep Learning algorithm using H2O 3.42.0.1.

Description

Please note that this algorithm is deterministic only if the reproducible parameter is set to true. In this case the algorithm uses only 1 thread.

Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout and L1 or L2 regularization enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network.

The operator starts a 1-node local H2O cluster and runs the algorithm on it. Although it uses one node, the execution is parallel. You can set the level of parallelism by changing the Settings/Preferences/General/Number of threads setting. By default it uses the recommended number of threads for the system. Only one instance of the cluster is started and it remains running until you close RapidMiner Studio.

Input

training set (Data table)
The input port expects a labeled ExampleSet.

Output

model
The Deep Learning classification or regression model is delivered from this output port. This classification or regression model can be applied on unseen data sets for prediction of the label attribute.
example set (Data table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
weights (Attribute Weights)
This port delivers the weights of the attributes with respect to the label attribute.

Parameters

activation The activation function (non-linearity) to be used by the neurons in the hidden layers.
- Tanh: Hyperbolic tangent function (same as scaled and shifted sigmoid).
- Rectifier: Rectifier Linear Unit: Chooses the maximum of (0, x) where x is the input value.
- Maxout: Choose the maximum coordinate of the input vector.
- ExpRectifier: Exponential Rectifier Linear Unit function.
hidden layer sizes The number and size of each hidden layer in the model. For example, if a user specifies "100,200,100" a model with 3 hidden layers will be produced, and the middle hidden layer will have 200 neurons.
hidden dropout ratios A fraction of the inputs for each hidden layer to be omitted from training in order to improve generalization. Defaults to 0.5 for each hidden layer if omitted. Visible only if an activation function with dropout is selected.
reproducible Force reproducibility on small data (will be slow - only uses 1 thread).
use local random seed Indicates if a local random seed should be used for randomization. Available only if reproducible is set to true.
local random seed Local random seed for random generation. This parameter is only available if the use local random seed parameter is set to true.
epochs How many times the dataset should be iterated (streamed), can be fractional.
compute variable importances Whether to compute variable importances for input features. The implemented method considers the weights connecting the input features to the first two hidden layers.
train samples per iteration The number of training data rows to be processed per iteration. Note that independent of this parameter, each row is used immediately to update the model with (online) stochastic gradient descent. This parameter controls the frequency at which scoring and model cancellation can happen. Special values are 0 for one epoch per iteration, -1 for processing the maximum amount of data per iteration. Special value of -2 turns on automatic mode (auto-tuning).
adaptive rate The implemented adaptive learning rate algorithm (ADADELTA) automatically combines the benefits of learning rate annealing and momentum training to avoid slow convergence. Specification of only two parameters (rho and epsilon) simplifies hyper parameter search. In some cases, manually controlled (non-adaptive) learning rate and momentum specifications can lead to better results, but require the specification (and hyper parameter search) of up to 7 parameters. If the model is built on a topology with many local minima or long plateaus, it is possible for a constant learning rate to produce sub-optimal results. Learning rate annealing allows digging deeper into local minima, while rate decay allows specification of different learning rates per layer. When the gradient is being estimated in a long valley in the optimization landscape, a large learning rate can cause the gradient to oscillate and move in the wrong direction. When the gradient is computed on a relatively flat surface with small learning rates, the model can converge far slower than necessary.
epsilon Similar to learning rate annealing during initial training and momentum at later stages where it allows forward progress. Typical values are between 1e-10 and 1e-4. This parameter is only active if adaptive learning rate is enabled.
rho Similar to momentum and relates to the memory to prior weight updates. Typical values are between 0.9 and 0.999. This parameter is only active if adaptive learning rate is enabled.
standardize If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.
learning rate When adaptive learning rate is disabled, the magnitude of the weight updates are determined by the user specified learning rate (potentially annealed), and are a function of the difference between the predicted value and the target value. That difference, generally called delta, is only available at the output layer. To correct the output at each hidden layer, back propagation is used. Momentum modifies back propagation by allowing prior iterations to influence the current update. Using the momentum parameter can aid in avoiding local minima and the associated instability. Too much momentum can lead to instabilities, that's why the momentum is best ramped up slowly. This parameter is only active if adaptive learning rate is disabled.
rate annealing Learning rate annealing reduces the learning rate to "freeze" into local minima in the optimization landscape. The annealing rate is the inverse of the number of training samples it takes to cut the learning rate in half (e.g., 1e-6 means that it takes 1e6 training samples to halve the learning rate). This parameter is only active if adaptive learning rate is disabled.
rate decay The learning rate decay parameter controls the change of learning rate across layers. For example, assume the rate parameter is set to 0.01, and the rate_decay parameter is set to 0.5. Then the learning rate for the weights connecting the input and first hidden layer will be 0.01, the learning rate for the weights connecting the first and the second hidden layer will be 0.005, and the learning rate for the weights connecting the second and third hidden layer will be 0.0025, etc. This parameter is only active if adaptive learning rate is disabled.
momentum start The momentum_start parameter controls the amount of momentum at the beginning of training. This parameter is only active if adaptive learning rate is disabled.
momentum ramp The momentum_ramp parameter controls the amount of learning for which momentum increases (assuming momentum_stable is larger than momentum_start). The ramp is measured in the number of training samples. This parameter is only active if adaptive learning rate is disabled.
momentum stable The momentum_stable parameter controls the final momentum value reached after momentum_ramp training samples. The momentum used for training will remain the same for training beyond reaching that point. This parameter is only active if adaptive learning rate is disabled.
nesterov accelerated gradient The Nesterov accelerated gradient descent method is a modification to traditional gradient descent for convex functions. The method relies on gradient information at various points to build a polynomial approximation that minimizes the residuals in fewer iterations of the descent. This parameter is only active if adaptive learning rate is disabled.
L1 A regularization method that constrains the absolute value of the weights and has the net effect of dropping some weights (setting them to zero) from a model to reduce complexity and avoid overfitting.
L2 A regularization method that constrains the sum of the squared weights. This method introduces bias into parameter estimates, but frequently produces substantial gains in modeling as estimate variance is reduced.
max w2 A maximum on the sum of the squared incoming weights into any one neuron. This tuning parameter is especially useful for unbound activation functions such as Maxout or Rectifier. A special value of 0 means infinity.
loss function The loss (error) function to be minimized by the model. Absolute, Quadratic, and Huber are applicable for regression or classification, while CrossEntropy is only applicable for classification. Huber can improve for regression problems with outliers. CrossEntropy loss is used when the model output consists of independent hypotheses, and the outputs can be interpreted as the probability that each hypothesis is true. Cross entropy is the recommended loss function when the target values are class labels, and especially for imbalanced data. It strongly penalizes error in the prediction of the actual class label. Quadratic loss is used when the model output are continuous real values, but can be used for classification as well (where it emphasizes the error on all output classes, not just for the actual class).
distribution function The distribution function for the training data. For some function (e.g. tweedie) further tuning can be achieved via the expert parameters
- AUTO: Automatic selection. Uses multinomial for nominal and gaussian for numeric labels.
- bernoulli: Bernoulli distribution. Can be used for binominal or 2-class polynominal labels.
- gaussian, possion, gamma, tweedie, quantile, laplace: Distribution functions for regression.
early stopping If true, parameters for early stopping needs to be specified.
stopping rounds Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable). This parameter is visible only if early_stopping is set.
stopping metric Metric to use for early stopping. Set stopping_tolerance to tune it. This parameter is visible only if early_stopping is set.
- AUTO: Automatic selection. Uses logloss for classification, deviance for regression.
- deviance, logloss, MSE, AUC, lift_top_group, r2, misclassification, mean_per_class_error: The metric to use to decide if the algorithm should be stopped.
stopping tolerance Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much). This parameter is visible only if early_stopping is set.
missing values handling Handling of missing values. Either Skip or MeanImputation.
- Skip: Missing values are skipped.
- MeanImputation: Missing values are replaced with the mean value.
max runtime seconds Maximum allowed runtime in seconds for model training. Use 0 to disable.
expert parameters These parameters are for fine tuning the algorithm. Usually the default values provide a decent model, but in some cases it may be useful to change them. Please use true/false values for boolean parameters and the exact attribute name for columns. Arrays can be provided by splitting the values with the comma (,) character. More information on the parameters can be found in the H2O documentation.
- score_each_iteration: Whether to score during each iteration of model training. Type: boolean, Default: false
- fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. Options: AUTO, Random, Modulo, Stratified. Type: enumeration, Default: AUTO
- fold_column: Column name with cross-validation fold index assignment per observation. Type: column, Default: no fold column
- offset_column: Offset column name. Type: Column, Default: no offset column
- balance_classes: Balance training data class counts via over/under-sampling (for imbalanced data). Type: boolean, Default: false
- keep_cross_validation_predictions: Keep cross-validation model predictions. Type: boolean, Default: false
- keep_cross_validation_fold_assignment: Keep cross-validation fold assignment. Type: boolean, Default: false
- max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Type: real, Default: 5.0
- max_confusion_matrix_size: Maximum size (# classes) for confusion matrices to be printed in the Logs. Type: integer, Default: 20
- quantile_alpha: Desired quantile for quantile regression (from 0.0 to 1.0). Type: real, Default: 0.5
- tweedie_power: Tweedie Power (between 1 and 2). Type: real, Default: 1.5
- class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes=true. Type: float array, Default: empty
- input_dropout_ratio: A fraction of the features for each training row to be omitted from training in order to improve generalization (dimension sampling). Type: real, Default: 0
- score_interval: The minimum time (in seconds) to elapse between model scoring. The actual interval is determined by the number of training samples per iteration and the scoring duty cycle. Type: integer, Default: 5
- score_training_samples: The number of training dataset points to be used for scoring. Will be randomly sampled. Use 0 for selecting the entire training dataset. Type: integer, Default: 10000
- score_validation_samples: The number of validation dataset points to be used for scoring. Can be randomly sampled or stratified (if "balance classes" is set and "score validation sampling" is set to stratify). Use 0 for selecting the entire training dataset. Type: integer, Default: 0
- score_duty_cycle: Maximum fraction of wall clock time spent on model scoring on training and validation samples, and on diagnostics such as computation of feature importances (i.e., not on training). Lower: more training, higher: more scoring. Type: real, Default: 0.1
- overwrite_with_best_model: If enabled, store the best model under the destination key of this model at the end of training. Type: boolean, Default: true.
- initial_weight_distribution: The distribution from which initial weights are to be drawn. The default option is an optimized initialization that considers the size of the network. The "uniform" option uses a uniform distribution with a mean of 0 and a given interval. The "normal" option draws weights from the standard normal distribution with a mean of 0 and given standard deviation. Type: enumeration, Options: UniformAdaptive, Uniform, Normal. Default: UniformAdaptive
- initial_weight_scale: The scale of the distribution function for Uniform or Normal distributions. For Uniform, the values are drawn uniformly from -initial_weight_scale...initial_weight_scale. For Normal, the values are drawn from a Normal distribution with a standard deviation of initial_weight_scale. Type: real, Default: 1
- classification_stop: The stopping criteria in terms of classification error (1-accuracy) on the training data scoring dataset. When the error is at or below this threshold, training stops. Type: real, Default: 0
- regression_stop: The stopping criteria in terms of regression error (MSE) on the training data scoring dataset. When the error is at or below this threshold, training stops. Type: real, Default: 1e-6.
- score_validation_sampling: Method used to sample the validation dataset for scoring. Type: enumeration, Options: Uniform, Stratified. Default: Uniform.
- fast_mode: Enable fast mode (minor approximation in back-propagation), should not affect results significantly. Type: boolean, Default: true.
- force_load_balance: Increase training speed on small datasets by splitting it into many chunks to allow utilization of all cores. Type: boolean, Default: true.
- shuffle_training_data: Enable shuffling of training data (on each node). This option is recommended if training data is replicated on N nodes, and the number of training samples per iteration is close to N times the dataset size, where all nodes train will (almost) all the data. It is automatically enabled if the number of training samples per iteration is set to -1 (or to N times the dataset size or larger). This parameter usually doesn't need to be set, because RapidMiner runs H2O always on 1 node. Type: boolean, Default: false.
- quiet_mode: Enable quiet mode for less output to standard output. Type: boolean, Default: false.
- sparse: Sparse data handling (more efficient for data with lots of 0 values). Type: boolean, Default: false.
- average_activation: Average activation for sparse auto-encoder (Experimental) Type: double, Default: 0.0.
- sparsity_beta: Sparsity regularization. (Experimental) Type: double, Default: 0.0.
- max_categorical_features: Max. number of categorical features, enforced via hashing (Experimental) Type: integer, Default: 2147483647.
- export_weights_and_biases: Whether to export Neural Network weights and biases to H2O Frames. Type: boolean, Default: false.
- mini_batch_size: Mini-batch size (smaller leads to better fit, larger can speed up and generalize better). Type: integer, Default: 1
- elastic_averaging: Elastic averaging between compute nodes can improve distributed model convergence. (Experimental) Type: boolean, Default: false.
- use_all_factor_levels: Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder. Type: boolean, Default: true.
- nfolds: Number of folds for cross-validation. Use 0 to turn off cross-validation. Type: integer, Default: 0

Tutorial Processes

Classification with Split Validation using Deep Learning

The H2O Deep Learning operator is used to predict the Survived attribute of the Titanic sample dataset. Since the label is binominal, classification will be performed. To check the quality of the model, the Split Validation operator is used to generate the training and testing datasets. The Deep Learning operator's parameters are the default values. This means that 2 hidden layers, each with 50 neurons will be constructed. The labeled ExampleSet is connected to a Performance (Binominal Classification) operator, that calculates the Accuracy metric. On the process output the Deep Learning Model, the labeled data and the Performance Vector is shown.

Regression using Deep Learning

The H2O Deep Learning operator is used to predict the numerical label attribute of a generated dataset. Since the label is real, regression is performed. The data is generated, then splitted into two parts with the Split Data operator. The first output is used as the training, the second as the scoring data set. The Deep Learning operator uses the adaptive learning rate option (default). The algorithm automatically determines the learning rate based on the epsilon and rho parameters. The only non-default parameter is the hidden layer sizes, where 3 layers are used, each with 50 neurons. After applying on the testing ExampleSet, the labelled data is connected to the process output.

Categories

Versions