Categories

Versions

Discretize by Entropy (AI Studio Core)

Synopsis

This operator converts the selected numerical attributes into nominal attributes. The boundaries of the bins are chosen so that the entropy is minimized in the induced partitions.

Description

This operator discretizes the selected numerical attributes to nominal attributes. The discretization is performed by selecting a bin boundary that minimizes the entropy in the induced partitions. Each bin range is named automatically. The naming format of the range can be changed using the range name type parameter. The values falling in the range of a bin are named according to the name of that range.

The discretization is performed by selecting a bin boundary that minimizes the entropy in the induced partitions. The method is then applied recursively for both new partitions until the stopping criterion is reached. For more information please study:

  • Multi-interval discretization of continued-values attributes for classification learning (Fayyad,Irani)
  • Supervised and Unsupervised Discretization (Dougherty,Kohavi,Sahami).

This operator can automatically remove all attributes with only one range i.e. those attributes which are not actually discretized since the entropy criterion is not fulfilled. This behavior can be controlled by the remove useless parameter.

Differentiation

Discretize by Binning

The Discretize By Binning operator creates bins in such a way that the range of all bins is (almost) equal.

Discretize by Frequency

The Discretize By Frequency operator creates bins in such a way that the number of unique values in all bins are (almost) equal.

Discretize by Size

The Discretize By Size operator creates bins in such a way that each bin has user-specified size (i.e. number of examples).

Discretize by User Specification

This operator discretizes the selected numerical attributes into user-specified classes.

Input

  • example set input (Data table)

    This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input. Please note that there should be at least one numerical attribute in the input ExampleSet, otherwise the use of this operator does not make sense.

Output

  • example set output (Data table)

    The selected numerical attributes are converted into nominal attributes by discretization and the resultant ExampleSet is delivered through this port.

  • original (Data table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

  • preprocessing model (Preprocessing Model)

    This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.

Parameters

  • attribute filter typeThis parameter allows you to select the attribute selection filter; the method you want to use for selecting the required attributes. It has the following options:
    • all: This option simply selects all the attributes of the ExampleSet. This is the default option.
    • single: This option allows selection of a single attribute. When this option is selected another parameter (attribute) becomes visible in the Parameters panel.
    • subset: This option allows selection of multiple attributes through a list. All attributes of the ExampleSet are present in the list; required attributes can be easily selected. This option will not work if the meta data is not known. When this option is selected another parameter becomes visible in the Parameters panel.
    • regular expression: This option allows you to specify a regular expression for attribute selection. When this option is selected some other parameters (regular expression, use except expression) become visible in the Parameters panel.
    • value type: This option allows selection of all the attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. Users should have a basic understanding of type hierarchy when selecting attributes through this option. When this option is selected some other parameters (value type, use value type exception) become visible in the Parameters panel.
    • block type: This option is similar in working to the value type option. This option allows selection of all the attributes of a particular block type. When this option is selected some other parameters (block type, use block type exception) become visible in the Parameters panel.
    • no missing values: This option simply selects all the attributes of the ExampleSet which don't contain a missing value in any example. Attributes that have even a single missing value are removed.
    • numeric value filter: When this option is selected another parameter (numeric condition) becomes visible in the Parameters panel. All numeric attributes whose examples all satisfy the mentioned numeric condition are selected. Please note that all nominal attributes are also selected irrespective of the given numerical condition.
  • attributeThe desired attribute can be selected from this option. The attribute name can be selected from the drop down box of attribute parameter if the meta data is known.
  • attributesThe required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list and can be shifted to the right list which is the list of selected attributes on which the conversion from nominal to numeric will take place; all other attributes will remain unchanged.
  • regular expressionThe attributes whose name matches this expression will be selected. Regular expression is a very powerful tool but needs a detailed explanation to beginners. It is always good to specify the regular expression through the edit and preview regular expression menu. This menu gives a good idea of regular expressions. This menu also allows you to try different expressions and preview the results simultaneously. This will enhance your concept of regular expressions.
  • use except expressionIf enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible in the Parameters panel.
  • except regular expressionThis option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in the regular expression parameter).
  • value typeThe type of attributes to be selected can be chosen from a drop down list. One of the following types can be chosen: nominal, text, binominal, polynominal, file_path.
  • use value type exception If enabled, an exception to the selected type can be specified. When this option is selected another parameter (except value type) becomes visible in the Parameters panel.
  • except value typeThe attributes matching this type will be removed from the final output even if they matched the previously mentioned type i.e. value type parameter's value. One of the following types can be selected here: nominal, text, binominal, polynominal, file_path.
  • block typeThe block type of attributes to be selected can be chosen from a drop down list. The only possible value here is 'single_value'
  • use block type exceptionIf enabled, an exception to the selected block type can be specified. When this option is selected another parameter (except block type) becomes visible in the Parameters panel.
  • except block typeThe attributes matching this block type will be removed from the final output even if they matched the previously mentioned block type.
  • numeric conditionThe numeric condition for testing examples of numeric attributes is specified here. For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||. Use a blank space after '>', '=' and '<' e.g. '<5' will not work, so use '< 5' instead.
  • include special attributesThe special attributes are attributes with special roles which identify the examples. In contrast regular attributes simply describe the examples. Special attributes are: id, label, prediction, cluster, weight and batch.
  • invert selectionIf this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected attributes are unselected and previously unselected attributes are selected. For example if attribute 'att1' is selected and attribute 'att2' is unselected prior to checking of this parameter. After checking of this parameter 'att1' will be unselected and 'att2' will be selected.
  • remove uselessThis parameter indicates if the useless attributes, i.e. attributes containing only a single range, should be removed. If this parameter is set to true then all those attributes that are not actually discretized since the entropy criterion is not fulfilled are removed.
  • range name typeThis parameter is used for changing the naming format for range. 'long', 'short' and 'interval' formats are available.
  • automatic number of digitsThis is an expert parameter. It is only available when the range name type parameter is set to 'interval'. It indicates if the number of digits should be automatically determined for the range names.
  • number of digitsThis is an expert parameter. It is used to specify the minimum number of digits used for the interval names.

Tutorial Processes

Discretizing the 'Sonar' data set by entropy

The focus of this Example Process is the discretization procedure. For understanding the parameters related to attribute selection please study the Example Process of the Select Attributes operator.

The 'Sonar' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can gave a look at the ExampleSet. You can see that this data set has 60 regular attributes (all of real type). The Discretize by Entropy operator is applied on it. The attribute filter type parameter is set to 'all', thus all the numerical attributes will be discretized.The remove useless parameter is set to true, thus attributes with only one range are removed from the ExampleSet. Run the process and switch to the Results Workspace. You can see that the 'Sonar' data set has been reduced to just 22 regular attributes. These numerical attributes have been discretized to nominal attributes.