One-Hot Encoding (Model Simulator)
Synopsis
This operator can remove nominal attributes with too many values and perform an encoding on the remaining nominal attributes which will transform them into new numerical ones.Description
This operators works in two steps. First, and if desired, all nominal columns which have too many different values can be removed from the data set. Second, and if desired, this operator can perform an encoding on the remaining nominal attributes. Those settings and the removed and transformed columns are stored in a preprocessing model which can be applied to new data sets in scoring situations to get the same compatible data set structure as a result.
But why do we need to perform encoding in the first place? Some machine learning algorithms cannot work with nominal data but require all input attributes to be numeric. And this means that nominal data must be converted to a numeric form first. And this is exactly what encoding is doing.
The original nominal attribute will be removed and a set of new binominal attributes is added instead. There will be one new column for each possible nominal value of the original column but one (the so-called comparison group). If the original column is Color and the possible values are red, green, and blue this will result in two new columns Color = red and Color = green. If the original color value was red, the column Color = red will contain a 1 and 0 otherwise. Same for green. But what about blue? We do actually not need an extra column for blue since a 0 in the other two columns automatically mean that the color was the remaining option which is blue. The column, or nominal value, which is not getting a new attribute is called the comparison group. We automatically select the least frequent value as the comparison group.
In contrast to the operator Nominal to Numerical, this operator performs addtional column removal and also automatically calculates a comparison group for one-hot encoding by using the value with lowest value count in the one hot encoding case. This further reduces the number of resulting columns. For example, if you have three possible values A, B, and C and C is the least frequent value, then values of 0 for both the new A and B columns indicate that the value has been C.
Input
- example set input (Data table)
This port expects an ExampleSet for which nominal columns should be removed or encoded.
Output
- example set output (Data table)
The processed data where all nominal columns with too many values have been removed (if so desired) and all remaining nominal columns have been transformed into one or multiple numerical columns using the selected encoding approach.
- original (Data table)
The original data set.
- preprocessing model (Preprocessing Model)
You can apply this model on new data sets with Apply Model so that the same nominal attributes are removed from that data and all remaining nominal attributes will be transformed with the encoding settings derived from the training phase.
Parameters
- remove with too many values Indicates if nominal attributes with too many values should be removed.
- maximum number of values Attributes with more values than this will be removed from the example set.
- perform encoding Indicates if an encoding should be performed on the remaining nominal columns.
Tutorial Processes
One-Hot Encoding for Titanic
This process performs a one-hot encoding on the Titanic data. The resulting columns will be all of type numeric. We first remove all columns with more than 20 values which removes the Ticket Number, Name, Cabin, and Lifeboat columns.
The remaining three nominal columns are Passenger Class, Sex, and Port of Embarkation. They will transformed into numerical columns using the one-hot encoding approach. We will get two new numerical columns for Passenger Class, namely Passenger Class = First and Passenger Class = Third. A one in the first column means that the original passenger class was First. A zero in both columns means it was neither First nor Third, i.e. it was Second. The second class became the comparison group since it was the least frequent in the data.
The same logic applied for the columns Sex and Port of Embarkation. Please note that we only get one new column for Sex called Sex = Male. Since Sex had only two values, the other column again is not necessary since a 0 in Sex = Male automatically means that this was a female passenger.