Categories

Versions

Handle Unknown Values (Model Simulator)

Synopsis

This operator collects known values when applied on data and also allows to replace unknown values. The result is a preprocessing model which can be applied to new data sets for which all nominal values which are not known will be replaced by missings.

Description

This operator collects all the values for all nominal columns in a data set and stores them in a preprocessing model. While the operator does not change the input data at all, such a preprocessing model can be very useful when you want to ensure that new data sets are only using nominal values which have been known before. Many models cannot deal well with new data sets and may break if you do not handle this beforehand.

If the preprocessing model is applied on new data sets, all nominal values which have not been part of the input of this operator will be replaced by missing values. They can be handled by regular missing value handling operator afterwards.

Input

  • example set input (Data Table)

    This port expects an ExampleSet for which all nominal values should be remembered.

Output

  • example set output (Data Table)

    The processed data which is the same as the input data because actually there is no processing happening but just values are remembered.

  • original (Data Table)

    The original data set.

  • preprocessing model (Preprocessing Model)

    You can apply this model on new data sets with Apply Model to so that all nominal values which have not been part of the input data will be replaced by missing values.

Tutorial Processes

Handle Unknown Values for Titanic

This process first separates the Titanic data by passenger class. We then use the operator Handle Unknown Values on the first data set which only contains the values First and Second class. Finally we use Apply Model on the resulting preprocessing model on the data containing Second and Third class. As you can see, all rows which used to be Third class (unknown to the preprocessing models) now show a missing value indicated by the question mark.