Detect Outlier (Distances) (RapidMiner Studio Core)

Synopsis

This operator identifies n outliers in the given ExampleSet based on the distance to their k nearest neighbors. The variables n and k can be specified through parameters.

Description

This operator performs outlier search according to the outlier detection approach recommended by Ramaswamy, Rastogi and Shim in "Efficient Algorithms for Mining Outliers from Large Data Sets". In their paper, a formulation for distance-based outliers is proposed that is based on the distance of a point from its k-th nearest neighbor. Each point is ranked on the basis of its distance to its k-th nearest neighbor and the top n points in this ranking are declared to be outliers. The values of k and n can be specified by the number of neighbors and number of outliers parameters respectively. This search is based on simple and intuitive distance-based definitions for outliers by Knorr and Ng which in simple words is: 'A point p in a data set is an outlier with respect two parameters k and d if no more than k points in the data set are at a distance of d or less from p'.

This operator adds a new boolean attribute named 'outlier' to the given ExampleSet. If the value of this attribute is true that example is an outlier and vice versa. n examples will have the value true in the 'outlier' attribute (where n is the value specified in the number of outliers parameter). Different distance functions are supported by this operator. The desired distance function can be selected by the distance function parameter.

An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.

Input

• example set input (Data Table)

This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.

Output

• example set output (Data Table)

A new boolean attribute 'outlier' is added to the given ExampleSet and the ExampleSet is delivered through this output port.

• original (Data Table)

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

• number_of_neighborsThis parameter specifies the k value for the k-th nearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and 1 million respectively. Range: integer
• number_of_outliersThis parameter specifies the number of top-n outliers to be looked for. The resultant ExampleSet will have n number of examples that are considered outliers. The minimum and maximum values for this parameter are 2 and 1 million respectively. Range: integer
• distance_functionThis parameter specifies the distance function that will be used for calculating the distance between two examples. Range: selection

Tutorial Processes

Detecting outliers from an ExampleSet

The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to 'gaussian mixture clusters'. The number examples and number of attributes parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view the ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching to the 'Plot View' tab. Set Plotter to 'Scatter', x-Axis to 'att1' and y-Axis to 'att2' to view the scatter plot of the ExampleSet.

The Detect Outlier (Distances) operator is applied on this ExampleSet. The number of neighbors and number of outliers parameters are set to 4 and 12 respectively. Thus 12 examples of the resultant ExampleSet will have true value in the 'outlier' attribute. This can be verified by viewing the ExampleSet in the Results Workspace. For better understanding switch to the 'Plot View' tab. Set Plotter to 'Scatter', x-Axis to 'att1', y-Axis to 'att2' and Color Column to 'outlier' to view the scatter plot of the ExampleSet (the outliers are marked red).