One-Hot Encoding (Model Simulator)

Synopsis

This operator can remove nominal attributes with too many values and perform an encoding on the remaining nominal attributes which will transform them into numerical ones.

Description

This operators works in two steps. First, and if desired, all nominal columns which have too many different values can be removed from the data set. Second, and if desired, this operator can perform an encoding on the remaining nominal attributes. Those settings and the removed and transformed columns are stored in a preprocessing model which can be applied to new data sets in scoring situations to get the same compatible data set structure as a result.

But why do we need to perform encoding in the first place? Some machine learning algorithms cannot work with nominal data but require all input attributes to be numeric. And this means that nominal data must be converted to a numeric form first. And this is exactly what encoding is doing. For each nominal value of a column, we generate one or several new numeric columns.

We replace the original nominal attribute with one new numerical attribute and replace each category of the nominal attribute with its corresponding probability of the label (if categorical) or average of the label (if numerical). Label is also known as target, hence the name. In case of more than two classes, we will have one column for each class instead. But since the number of classes is typically small the number of additional columns is much smaller than for one-hot encoding.

The problem with target encoding is a risk for overfitting. Of course you should always perform correct validation by creating a preprocessing model with this operator on the training data and apply this then on the validation data instead of allowing for label leakage by performing this operator before validation splits. The other technique offered by this operator is smoothing. Smoothing balances the averages for each category with the overall averages. This way small categories will suffer less from extreme values.

Input

example set input (Data table)
This port expects an ExampleSet for which nominal columns should be removed or encoded.

Output

example set output (Data table)
The processed data where all nominal columns with too many values have been removed (if so desired) and all remaining nominal columns have been transformed into one or multiple numerical columns using the selected encoding approach.
original (Data table)
The original data set.
preprocessing model (Preprocessing Model)
You can apply this model on new data sets with Apply Model so that the same nominal attributes are removed from that data and all remaining nominal attributes will be transformed with the encoding settings derived from the training phase.

Parameters

remove with too many values Indicates if nominal attributes with too many values should be removed.
maximum number of values Attributes with more values than this will be removed from the example set.
perform encoding Indicates if an encoding should be performed on the remaining nominal columns.
perform smoothing Indicates if smoothing should be performed. Smoothing can reduce the risk of overfitting by balancing global averages with the average for the categories. Only available for target encoding.
smoothing strength Smoothing balances the category value with the overall average values to reduce the influence of small groups. A value of 0 means that no smoothing is used and large values will eventually lead to global averages. Only available for target encoding and if smoothing is activated.

Tutorial Processes

Target Encoding for Titanic

This process performs a target encoding on the Titanic data. The resulting columns will be all of type numeric. We first remove all columns with more than 20 values which removes the Ticket Number, Name, Cabin, and Lifeboat columns.

The remaining three nominal columns are Passenger Class, Sex, and Port of Embarkation. They will transformed into numerical columns using the target encoding approach.

Categories

Versions