k-NN (RapidMiner Studio Core)

Synopsis

This Operator generates a k-Nearest Neighbor model, which is used for classification or regression.

Description

The k-Nearest Neighbor algorithm is based on comparing an unknown Example with the k training Examples which are the nearest neighbors of the unknown Example.

The first step of the application of the k-Nearest Neighbor algorithm on a new Example is to find the k closest training Examples. "Closeness" is defined in terms of a distance in the n-dimensional space, defined by the n Attributes in the training ExampleSet.

Different metrices, such as the Euclidean distance, can be used to calculate the distance between the unknown Example and the training Examples. Due to the fact that distances often depends on absolut values, it is recommended to normalize data before training and applying the k-Nearest Neighbor algorithm. The metric used and its exact configuration are defined by the parameters of the Operator.

In the second step, the k-Nearest Neighbor algorithm classify the unknown Example by a majority vote of the found neighbors. In case of a regression, the predicted value is the average of the values of the found neighbors.

It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

Differentiation

k-Means

This Operator performs clustering on an unlabeled data set. It can be considered as the equivalent of the k-NN algorithm for unsupervised learning. A cluster calculated by the k-Means algorithm consists of similar Examples, where the similarity is calculated by a given measure in the Attribute space.

Input

  • training set (Data Table)

    This input port expects an ExampleSet.

Output

  • model (Model)

    The k-Nearest Neighbor model is delivered from this output port. This model can now be applied on unseen data sets for prediction of the label Attribute.

  • example set (Data Table)

    The ExampleSet that was given as input is passed through without changes.

Parameters

  • k

    Finding the k training Examples that are closest to the unknown Example is the first step of the k-NN algorithm. If k = 1, the Example is simply assigned to the class of its nearest neighbor. k is typically a small, positive and odd integer.

    Range:
  • weighted_vote

    If this parameter is set, the distance values between the Examples are also taken into account for the prediction. It can be useful to weight the contributions of the neighbors, so that nearer neighbors contribute more than more distant ones.

    Range:
  • measure_types

    This parameter is used for selecting the type of measure to be used for finding the nearest neighbors. The following options are available:

    • MixedMeasures: Mixed Measures are used to calculate distances in case of both nominal and numerical Attributes.
    • NominalMeasures: In case of only nominal Attributes different distance metrices can be used to calculate distances on this nominal Attributes.
    • NumericalMeasures: In case of only numerical Attributes different distance metrices can be used to calculate distances on this numerical Attributes.
    • BregmannDivergences: Bregmann divergences are more generic "closeness" measure types with does not satisfy the triangle inequality or symmetry. For more details see the parameter divergence.
    Range:
  • mixed_measure

    The only available option for mixed measure is the 'Mixed Euclidean Distance'. For numerical values the euclidean distance is calculated. For nomimal values, a distance of 0 is taken if both values are the same and a distance of one is taken otherwise. This parameter is available when the measure type parameter is set to 'mixed measures'.

    Range:
  • nominal_measure

    This parameters defines how to calculate distances for only nominal Attributes in the input ExampleSet, in case the measure type is set to nominal measure. In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity. For the different similarities the following variables are defined:

    e: number of Attribute for which both Examples have equal and non-zero values

    u: number of Attribute for which both Examples have not equal values

    z: number of Attribute for which both Examples have zero values

    • NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
    • DiceSimilarity: With the above mentioned defintions the DiceSimilarity is: 2*e/(2*e+u)
    • JaccardSimilarity: With the above mentioned defintions the JaccardSimilarity is: e/(e+u)
    • KulczynskiSimilarity: With the above mentioned defintions the KulczynskiSimilarity is: e/u
    • RogersTanimotoSimilarity: With the above mentioned defintions the RogersTanimotoSimilarity is: (e+z)/(e+2*u+z)
    • RussellRaoSimilarity: With the above mentioned defintions the RussellRaoSimilarity is: e/(e+u+z)
    • SimpleMatchingSimilarity: With the above mentioned defintions the SimpleMatchingSimilarity is: (e+z)/(e+u+z)
    Range:
  • numerical_measure

    This parameters defines how to calculate distances for only numerical Attributes in the input ExampleSet, in case the measure type is set to numerical measure. For the different distance measures the following variable is defined:

    y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

    In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity.

    • EuclideanDistance: Square root of the sum of quadratic differences over all Attributes. Dist = Sqrt ( Sum_(j=1) [y(1,j)-y(2,j)]^2 )
    • CanberraDistance: Sum over all Attributes. The summand is the absolut of the difference of the value, divided by the sum of the absolute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)| / (|y(1,j)|+|y(2,j)|) The CanberraDistance is often used to compare ranked list or for intrusion detection in computer security.
    • ChebychevDistance: Maximum of all differences of all Attributes. Dist = max_(j=1) (|y(1,j)-y(2,j)|)
    • CorrelationSimilarity: The similarity is calculated as the correlation between the Attribute vectors of the two Examples.
    • CosineSimilarity: Similarity measure measuring the cosine of the angle between the Attribute vectors of the two Examples.
    • DiceSimilarity: The DiceSimilarity for numerical Attributes is calculated as 2*Y1Y2/(Y1+Y2). Y1Y2 = Sum over product of values = Sum_(j=1) y(1,j)*y(2,j). Y1 = Sum over values of first Example = Sum_(j=1) y(1,j) Y2 = Sum over values of second Example = Sum_(j=1) y(2,j)
    • DynamicTimeWarpingDistance: Dynamic Time Warping is often use in Time Series analysis for measuring the distance between two temporal sequences. Here the distance on an optimal "warping" path from the Attribute vector of the first Example to the second Example is calculated.
    • InnerProductSimilarity: The similarity is calculated as the sum of the product of the Attribute vectors of the two Examples. Dist = -Similarity = -Sum_(j=1) y(1,j)*y(2,j)
    • JaccardSimilarity: The JaccardSimilarity is calculated as Y1Y2/(Y1+Y2-Y1Y2). See DiceSimilarity for the definition of Y1Y2, Y1 and Y2.
    • KernelEuclideanDistance: The distance is calculated by the euclidean distance of the two Examples, in a transformed space. The transformation is defined by the chosen kernel and configured by the parameters kernel type, gamma, sigma1, sigma2, sigma 3, shift, degree, a, b.
    • ManhattanDistance: Sum of the absolute distances of the Attribute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)|
    • MaxProductSimilarity: The similarity is the maximum of all products of all Attribute values. If the maximum is less or equal to zero the similarity is not defined. Dist = -Similarity = -max_(j=1) (y(1,j)*y(2,j))
    • OverlapSimilarity: The similarity is a variant of simple matching for numerical Attributes and is calculated as minY1Y2 / min(Y1,Y2). See DiceSimilarity for the definition of Y1, Y2. minY1Y2 = Sum over the minimum of values = Sum_(j=1) min [y(1,j),y(2,j)] .
    Range:
  • divergence

    This parameter defines which type of Bregmann divergence is used when the measure type parameter is set to 'bregman divergences'. For the different distance measures the following variable is defined:

    y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

    • GeneralizedIDivergence: The distance is caluclated as Sum1 Sum2. It is not applicable if any Attribute value is less or equal to 0. Sum1 = Sum_(j=1) y(1,j)*ln[y(1,j)/y(2,j)] Sum2 = Sum_(j=1) [y(1,j)-y(2,j)]
    • ItakuraSaitoDistance: The ItakuraSaitoDistance can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)/y(2,1)-ln[y(1,1)/y(2,1)]-1
    • KLDivergence: The Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second expected probability distribution. Dist = Sum_(j=1) [y(1,j)*log_2(y(1,j)/y(2,j))]
    • LogarithmicLoss: The LogarithmicLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]-(y(1,1)-y(2,1))
    • LogisticLoss: The LogisticLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]+(1-y(1,1))*ln[(1-y(1,1))/(1-y(2,1))]
    • MahalanobisDistance: The Mahalanobis distance measures the distance between the two Examples under the assumption they are both random vectors of the same distribution. Therefore the covariance matrix S is calculated on the whole ExampleSet and the Distance is calculated as: Dist = Sqrt [ (vecY1-vecY2) S (vecY1-vecY2) ] vecY1 = Attribute vector of Example 1 vecY2 = Attribute vector of Example 2
    • SquaredEuclideanDistance: Sum of quadratic differences over all Attributes. Dist = Sum_(j=1) [y(1,j)-y(2,j)]^2
    • SquaredLoss: The SquaredLoss can only be calculated for ExampleSets with 1 Attribute. Dist = [y(1,1)-y(2,1)]^2
    Range:
  • kernel_type

    This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

    • dot: The dot kernel is defined by k(x,y) = x*y i.e. it is the inner product of x and y.
    • radial: The radial kernel is defined by k(x,y) = exp(-g*||x-y||^2) where g is gamma, specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
    • polynomial: The polynomial kernel is defined by k(x,y) = (x*y+1)^d where d is the degree of the polynomial and is specified by the _kernel degree_ parameter. Polynomial kernels are well suited for problems where all the training data is normalized.
    • sigmoid: This is a hyperbolic tangent sigmoid kernel. The distance is calculated as tanh[a*Y1Y2+b] where Y1Y2 is the inner product of the Attribute vector of the two Examples. a and b can be adjusted using the kernel a and kernel b parameters. A common value for a is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
    • anova: The anova kernel is defined by the raised to the power d of summation of exp(-g(x-y)) where g is gamma and d is degree. The two are adjusted by the kernel gamma and kernel degree parameters respectively.
    • epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has the two adjustable parameters kernel sigma1 and kernel degree.
    • gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
    • multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2+c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift.
    Range:
  • kernel_gamma

    This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova.

    Range:
  • kernel_sigma1

    This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.

    Range:
  • kernel_sigma2

    This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

    Range:
  • kernel_sigma3

    This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

    Range:
  • kernel_shift

    This is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric.

    Range:
  • kernel_degree

    This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov.

    Range:
  • kernel_a

    This is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.

    Range:
  • kernel_b

    This is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.

    Range:

Tutorial Processes

Classification of the Golf-Testset data set using the K-NN Operator

This tutorial process demonstrate the usage of the k-NN Operator to classify the Golf-Testset from the Samples folder. The k-NN is trained on the Golf data set and applied to the independent Golf-Testset. Both data sets are retrieved from the Samples folder.

k is set to 3, so each Example in the Golf-Testset is classified as the majority class of its 3 nearest neighbors in the training set. For the distance measure the MixedEuclideanDistance (which is the default setting) is used.

For more details, see the comments in the process.

Performance difference between normalized and not normalized data

This tutorial process demonstrate the enhancement of the performance of the k-NN in case the data is normalized before the k-NN is trained.

The Sonar data set is retrieved from the Samples folder and fed into two Cross Validation Operators. In both Cross Validations a k-NN model with k=3 is trained. But in the second the training data is normalized before the training of the k-NN. A Group Models Operator ensures that the test data is normalized in the same way, before the k-NN is applied on the test data.

Note that the accuracy of this k-NN is higher for the normalized values (84.24 % > 81.69 %).

Optimizing k of a k-NN model and logging all results

In this tutorial process the parameter k of a k-NN model is optimized and results are logged to investigate the dependency of the performance on the parameter.

The Sonar data set is retrieved from the Samples folder. An Optimized Parameters (Grid) Operator is used to iterate over k (from 1 to 30) and calculate the best performing k-NN model. A Log Operator is used within the Cross Validation Operator in the Optimize Parameters (Grid) Operator to log the parameter k and the corresponding performance. The Log to Data Operator and the Aggregate Operator are used to convert this log entries into an ExampleSet, which can be investigated in the results view.