Filter Examples (RapidMiner Studio Core)

Synopsis

This operator selects which examples (i.e. rows) of an ExampleSet should be kept and which examples should be removed. Examples satisfying the given condition are kept, remaining examples are removed.

Description

This operator takes an ExampleSet as input and returns a new ExampleSet including only those examples that satisfy the specified condition. Several predefined conditions are provided; users can select any of them. Users can also define their own conditions to filter examples.This operator may reduce the number of examples in an ExampleSet but it has no effect on the number of attributes. The select Attributes operator is used to select required attributes.

The Filter Examples operator is frequently used to filter examples that have (or do not have) missing values. It is also frequently used to filter examples with correct or wrong predictions (usually after testing a learnt model).

Input

  • example set input (Data Table)

    This input port expects an ExampleSet. It is output of Retrieve operator in the attached Example Process.

Output

  • example set output (Data Table)

    The new ExampleSet including only the examples that satisfied the specified condition is output of this port.

  • original (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

  • unmatched example set (Data Table)

    An ExampleSet including only the examples that did not satisfy the specified condition is output of this port.

Parameters

  • condition_classVarious predefined conditions are available for filtering examples. Users can select any of them. Examples satisfying the selected condition are passed to the output port, others are removed. Following conditions are available:
    • all: if this option is selected, no examples are removed.
    • correct_predictions: if this option is selected, only those examples make it to the output port that have correct predictions i.e. the value of the label attribute and prediction attribute are the same.
    • wrong_predictions: if this option is selected, only those examples make to the output port that have wrong predictions i.e. the value of the label attribute and prediction attribute are not the same.
    • no_missing_attributes: if this option is selected, only those examples make it to the output port that have no missing values in their attribute values. Missing values or null values are usually shown by '?' in RapidMiner.
    • missing_attributes: if this option is selected, only those examples make it to the output port that have some missing values in their attribute values.
    • no_missing_labels: if this option is selected, only those examples make it to the output port that do not have any missing values in their label attribute values. Missing values or null values are usually shown by '?' in RapidMiner.
    • missing_label: if this option is selected, only those examples make to the output port that have some missing values in their label attribute values.
    • attribute_value_filter: if this option is selected, another parameter (parameter string)is enabled in the Parameters panel.
    Range: selection
  • string

    parameter string(Range: string):Instead of using one of the predefined conditions users can define their own conditions here. It is important to understand how to specify conditions here because the true power of this operator lies in using it with defining own conditions according to requirements.For numerical attributes conditions can be specified easily using "attribute op value" format. Where 'attribute' is the name of the attribute, 'value' is a value that the attribute can take and 'op' represents binary logical operators like >, <, =>, <=, = and !=. For nominal attributes conditions can be specified easily using "attribute op exp" format. Where 'attribute' is the name of the attribute, 'op' can be either '=' or '!=' and 'exp' stands for the regular expression. Users should have a good understanding of regular expressions. You can have a good idea of regular expressions if you use the Select Attributes operator with the attribute filter type parameter set to regular_expression and then using the edit and preview regular expression menu.

    Multiple conditions can be linked by using logical AND (written as &&) or logical OR (written as || ) operators. Instead of writing multiple AND conditions you can use multiple Filter Examples operators in a row to reduce complexity.

    Missing values or null values can be written as '?' for numerical attributes and as '\?' for nominal attributes. '\?' is used instead of '?' in nominal attributes because this is the way missing values are specified in regular expressions.

    For 'unknown_attributes' the parameter string must be empty. This filter removes all examples containing attributes that have missing or illegal values. For 'unknown_label' the parameter string must also be empty. This filter removes all examples with an unknown label value.

    Range: string
  • invert_filter

    If this parameter is set to true, it acts as a NOT gate, it reverses the selection. In that case all the selected examples are removed and previously removed examples are selected. In other words it inverts the condition. For example if missing_attributes option is selected in condition class parameter and invert filter parameter is also set to true. Then output port will deliver an ExampleSet with no missing values.

    Range: boolean

Tutorial Processes

Filtering correctly predicted examples

The 'Golf' dataset is loaded using the Retrieve operator and the k-NN operator is applied on it to generate a classification model. That model is then applied on the 'Golf-Testset' data set using the Apply Model operator. the Apply Model operator applies the model learnt by the k-NN operator on the 'Golf-Testset' data set and records the predicted values in a new attribute named 'prediction(Play)'. Labeled data from the Apply Model opartor serves as input to the Filter Examples operator. The correct_predictions option is selected in the condition class parameter which ensures that only those examples make it to the output port that have correct predictions. Correct prediction means the value of the attributes label and prediction are the same in that example. But, as the invert filter parameter is set to true, it reverses the selection and instead of correct predictions, wrong predictions are delivered through the output port. It can be seen in the Results Workspace that the label attribute (Play) and the prediction attribute (prediction(Play)) have opposite values in all the resultant examples. A breakpoint is inserted before the Filter Examples operator to have a look at the examples before the application of Filter Examples operator. Press the green-colored Run button to continue with the process.

Filtering examples according to their values

'Golf' data set is loaded using Retrieve operator and Filter Examples is applied on it with parameter string:"Outlook = .*n.* && Temperature>70". Outlook attribute is a nominal attribute thus regular expression is used to describe it. Regular expression "Outlook=.*n.*" means all examples that have alphabet 'n' in its Outlook attribute value. 10 examples qualify, all have 'Outlook = rain' or 'Outlook=sunny'. Temperature attribute is a numerical attribute so "attribute op value" syntax is used to select rows. 9 examples satisfy the condition where Temperature attribute has a value greater than 70. As these two conditions are joined using logical AND (&&), finally selected examples are those that meet both the conditions. Only 6 such rows are present that have an 'n' in Outlook attribute value and their Temperature attribute value is also greater than 70. This can be seen clearly in the Results Workspace.

Filtering examples according to their values with or condition

Labor-Negotiations data set is loaded using the Retrieve operator and Filter Examples is applied on it with parameter string:"duration=? || pension !=\?". Duration attribute is a numerical attribute so "attribute op value" syntax is used to select rows. 1 example satisfies the condition where Duration attribute has a missing value. Pension attribute is a nominal attribute thus regular expression is used to describe it. Regular expression "pension !=\?" means all examples that do not have missing values in its Pension attribute value. 18 examples qualify; all have no missing values in their Pension attribute. Note that '?' is used for missing values of numerical attributes and '\?' is used for missing values of nominal attributes. Note that for nominal values the question mark must be escaped ("\?") because, as noted above, a regular expression is expected in this case. As these two conditions are joined using logical OR (||), finally selected examples are those that meet both the conditions. 18 such rows are present that have no missing values in Pension attribute values or have missing values in Duration attribute values. This can be seen clearly in the Results Workspace.