Normalize (RapidMiner Studio Core)

Synopsis

This operator normalizes the attribute values of the selected attributes.

Description

Normalization is a preprocessing technique used to rescale attribute values to fit in a specific range. Normalization of the data is very important when dealing with attributes of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all attributes should have the same scale for a fair comparison between them. In other words normalization is a technique used to level the playing field when looking at attributes that widely vary in size as a result of the units selected for representation.This operator performs normalization of selected attributes. Four normalization methods are provided. These methods are explained in the parameters.

Input

  • example set (Data Table)

    This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input. It is essential that meta data should be attached with the data for input because the attributes are specified in their meta data. The Retrieve operator provides meta data along-with data.

Output

  • example set (Data Table)

    The ExampleSet with selected attributes in normalized form is output of this port.

  • original (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

  • preprocessing model

    This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.

Parameters

  • create_view It is possible to create a View instead of changing the underlying data. Simply select this parameter to enable this option. The transformation that would be normally performed directly on the data will then be computed every time a value is requested and the result is returned without changing the data. Range: boolean
  • attribute_filter_typeThis parameter allows you to select the attribute selection filter; the method you want to use for selecting attributes that you want to normalize. It has the following options:
    • all: This option simply selects all the attributes of the ExampleSet. This is the default option.
    • single: This option allows selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the the Parameters panel.
    • subset: This option allows selection of multiple attributes through a list. All attributes of ExampleSet are present in the list; required attributes can be easily selected. This option will not work if the meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
    • regular_expression: This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
    • value_type: This option allows selection of all the attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to numeric type. The user should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
    • block_type: This option is similar in working to the value_type option. This option allows selection of all the attributes of a particular block type. It should be noted that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. When this option is selected some other parameters (block type, use block type exception) become visible in the Parameters panel.
    • no_missing_values: This option simply selects all the attributes of the ExampleSet which don't contain a missing value in any example. Attributes that have even a single missing value are not selected.
    • numeric_value_filter: When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
    Range: selection
  • attributeThe required attribute can be selected from this option. The attribute name can be selected from the drop down box of the parameter attribute if the meta data is known. Range: string
  • attributesThe required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list and can be shifted to the right list which is the list of selected attributes. Range: string
  • regular_expressionAttributes whose name match this expression will be selected. Regular expression is very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions. This menu also allows you to try different expressions and preview the results simultaneously. This will enhance your concept of regular expressions. Range: string
  • use_except_expressionIf enabled, an exception to the first regular expression can be specified. When this option is selected another parameter (except regular expression) becomes visible in the Parameters panel. Range: boolean
  • except_regular_expressionThis option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter). Range: string
  • value_typeThe type of attributes to be selected can be chosen from drop down list. Range: selection
  • use_value_type_exceptionIf enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible in the Parameters panel. Range: boolean
  • except_value_typeAttributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter's value. Range: selection
  • block_typeThe Block type of the attributes to be selected can be chosen from a drop down list. Range: selection
  • use_block_type_exceptionIf enabled, an exception to the selected block type can be specified. When this option is selected another parameter (except block type) becomes visible in the Parameters panel. Range: boolean
  • except_block_typeAttributes matching this block type will be removed from the final output even if they matched the previously mentioned block type. Range: selection
  • numeric_conditionNumeric condition for testing examples of numeric attributes is mention here. For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||. Use a blank space after '>', '=' and '<' e.g. '<5' will not work, so use '< 5' instead. Range: string
  • include_special_attributesSpecial attributes are attributes with special roles. Special attributes are those attributes which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch. By default all special attributes are selected irrespective of the conditions in the Select Attribute operator. If this parameter is set to true, Special attributes are also tested against conditions specified in the Select Attribute operator and only those attributes are selected that satisfy the conditions. Range: boolean
  • invert_selectionIf this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected attributes are unselected and previously unselected attributes are selected. For example if attribute 'att1' is selected and attribute 'att2' is removed prior to selection of this parameter. After selection of this parameter 'att1' will be removed and 'att2' will be selected. Range: boolean
  • methodFour methods are provided here for normalizing data. These methods are also explained in the attached Example Process.
    • z_transformation: This is also called Statistical normalization. The purpose of statistical normalization is to convert a data into Normal distribution with mean = 0 and variance = 1. The formula of statistical normalization is Z = (X-u) /s .You have your attribute values as vector X then you subtract the mean of the attribute values, u, and divide this difference by the standard deviation, you will get another vector Z that has normal distribution with zero mean and unit variance. It is also called Standard Normal distribution, N(0,1) . However, the range of the standard Normal distribution is not between [0,1] but about -3 to +3 (actually infinity to infinity but by using -3 to +3 you already capture 99.9% of your data).
    • range_transformation: When this method is selected, two other parameters (min, max) appear in the Parameters panel. Range transformation normalizes all attribute values in the specified range [min,max]. min and max are specified using min and max parameters respectively.
    • proportion_transformation: Each attribute value is normalized as proportion of the total sum of the respective attribute i.e. each attribute value is divided by the total sum of that attribute values. The sum is only formed from finite values, ignoring NaN/missing values and positive as well as negative infinity. When this method is selected, another parameter (allow negative values) appears in the Parameters panel. If the additional parameter is checked, negative values will be treated as absolute values, otherwise they will produce an error when executed.
    • interquartile_range: Normalization is performed using interquartile range. The range is the difference between the largest and the smallest value in the data set. Since the range only takes into account two values from the entire data set, it may be heavily influenced by outliers in the data. Therefore, another criterion - the interquartile range - is commonly used. It is the distance between the 25th and 75th percentiles (Q3 - Q1). The interquartile range is essentially the range of the middle 50% of the data. Because it uses the middle 50%, the interquartile range is not affected by outliers or extreme values. NaN/missing values as well as infinte values will be ignored for this method. Also, if no finite values could be found, the corresponding attribute will be ignored.
    Range: selection
  • minThis parameter is available only when the method parameter is set to 'range transformation'. It is used to specify the minimum point of the range. Range: real
  • maxThis parameter is available only when the method parameter is set to 'range transformation'. It is used to specify the maximum point of the range. Range: real
  • allow_negative_valuesThis parameter is available only when the method parameter is set to 'proportion transformation'. It is used to allow or disallow negative values in the processed attributes. Negative values then will be counted as their absolute values. Range: boolean

Tutorial Processes

Different methods of normalization

The focus of this process is to show different methods available for normalization. All parameters other than the method parameter are for selection of attributes on which normalization is to be applied. To understand these parameters please study the Example Process of the Select Attributes operator.

In this process the Retrieve operator is used to load the 'golf' data set from the Repository. The Filter Examples operator is applied on it to select just four examples of the 'golf' data set. This is done to just simplify the calculations. The breakpoint is inserted after this operator so that you can have a look at the examples. There are four examples with 'Humidity' attribute values 65, 70, 70 and 70. The 'Humidity' attribute is selected for normalization in the Normalize operator.

The method parameter is set to 'proportion transformation'. All values of the 'Humidity' attribute are divided by the sum of all values of the 'Humidity' attribute. The sum is 275 (65+70+70+70). Thus the values after normalization are 0.236 (65/275) and 0.255 (70/275).

Now run the process again with the method parameter set to 'z-transformation'. The mean of the four 'Humidity' attribute values (65, 70, 70, 70,) is 68.75. The Standard deviation of these values is calculated to be 2.5. Now for each attribute value, subtract the mean from the attribute value and divide the result by the standard deviation. You will see that results are the same as in the Results Workspace.

Select the 'Temperature' attribute and set the method parameter to 'range transformation'. Use 0 and 1 for min and max parameters. Run the process. You will see that all values of the 'Temperature' attribute are in range [0,1].