Detect Outlier (COF) (RapidMiner Studio Core)

Synopsis

This operator identifies outliers in the given ExampleSet based on the Class Outlier Factors (COF).

Description

The main concept of an ECODB (Enhanced Class Outlier - Distance Based) algorithm is to rank each instance in the ExampleSet given the parameters N (top N class outliers), and K (the number of nearest neighbors). The rank of each instance is found using the formula:

COF = PCL(T,K) - norm(deviation(T)) + norm(kDist(T))

PCL(T,K) is the Probability of the Class Label of the instance T with respect to the class labels of its K nearest neighbors.
norm(Deviation(T)) and norm(KDist(T)) are the normalized values of Deviation(T) and KDist(T) respectively and their values fall in the range [0 - 1].
Deviation(T) is how much the instance T deviates from instances of the same class. It is computed by summing the distances between the instance T and every instance belonging to the same class.
KDist(T) is the summation of the distance between the instance T and its K nearest neighbors.

This operator adds a new boolean attribute named 'outlier' to the given ExampleSet. If the value of this attribute is true, that example is an outlier and vice versa. Another special attribute 'COF Factor' is also added to the ExampleSet. This attribute measures the degree of being Class Outlier for an example.

An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.

Input

example set input (Data Table)
This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.

Output

example set output (Data Table)
A new boolean attribute 'outlier' and a real attribute 'COF Factor' is added to the given ExampleSet and the ExampleSet is delivered through this output port.
original (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

number_of_neighborsThis parameter specifies the k value for the k nearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and 1 million respectively. Range: integer
number_of_class_outliersThis parameter specifies the number of top-n Class Outliers to be looked for. The resultant ExampleSet will have n number of examples that are considered outliers. The minimum and maximum values for this parameter are 2 and 1 million respectively. Range: integer
measure_typesThis parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences. Range: selection
mixed_measureThis parameter is available when the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance' Range: selection
nominal_measureThis parameter is available when the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected. Range: selection
numerical_measureThis parameter is available when the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected. Range: selection
divergenceThis parameter is available when the measure type parameter is set to 'Bregman divergences'. Range: selection
kernel_typeThis parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:
- dot: The dot kernel is defined by k(x,y)=x*y i.e.it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
- neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- sigmoid: This is the sigmoid kernel. Please note that the sigmoid kernel is not valid under some parameters.
- anova: This is the anova kernel. It has adjustable parameters gamma and degree.
- epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
- gaussian_combination: This is the gaussian combination kernel. It has adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
- multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has adjustable parameters kernel sigma1 and kernel sigma shift.
Range: selection
kernel_gammaThis is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova. Range: real
kernel_sigma1This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
kernel_sigma2This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
kernel_sigma3This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
kernel_shiftThis is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric. Range: real
kernel_degreeThis is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
kernel_aThis is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real
kernel_bThis is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real

Tutorial Processes

Detecting outliers from an ExampleSet

The Generate Data operator is used for generating an ExampleSet. The target function parameter is set to 'gaussian mixture clusters'. The number examples and number of attributes parameters are set to 200 and 2 respectively. A breakpoint is inserted here so that you can view the ExampleSet in the Results Workspace. A good plot of the ExampleSet can be seen by switching to the 'Plot View' tab. Set Plotter to 'Scatter', x-Axis to 'att1' and y-Axis to 'att2' to view the scatter plot of the ExampleSet.

The Detect Outlier (COF) operator is applied on the ExampleSet. The number of neighbors and number of class outliers parameters are set to 7. The resultant ExampleSet can be viewed in the Results Workspace. For better understanding, switch to the 'Plot View' tab. Set Plotter to 'Scatter', x-Axis to 'att1', y-Axis to 'att2' and Color Column to 'outlier' to view the scatter plot of the ExampleSet (the outliers are marked red).