K-NN (RapidMiner Studio Core)

Synopsis

This operator generates a k-Nearest Neighbor model from the input ExampleSet. This model can be a classification or regression model depending on the input ExampleSet.

Description

The k-Nearest Neighbor algorithm is based on learning by analogy, that is, by comparing a given test example with training examples that are similar to it. The training examples are described by n attributes. Each example represents a point in an n-dimensional space. In this way, all of the training examples are stored in an n-dimensional pattern space. When given an unknown example, a k-nearest neighbor algorithm searches the pattern space for the k training examples that are closest to the unknown example. These k training examples are the k "nearest neighbors" of the unknown example. "Closeness" is defined in terms of a distance metric, such as the Euclidean distance.

The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an example is classified by a majority vote of its neighbors, with the example being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the example is simply assigned to the class of its nearest neighbor.The same method can be used for regression, by simply assigning the label value for the example to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

The neighbors are taken from a set of examples for which the correct classification (or, in the case of regression, the value of the label) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

The basic k-Nearest Neighbor algorithm is composed of two steps: Find the k training examples that are closest to the unseen example. Take the most commonly occurring classification for these k examples (or, in the case of regression, take the average of these k label values).

Input

  • training set (Data Table)

    This input port expects an ExampleSet. It is output of the Select Attributes operator in the attached Example Processes. Output of other operators can also be used as input.

Output

  • model

    The k-Nearest Neighbor model is delivered from this output port. This model can now be applied on unseen data sets for prediction of the label attribute.

  • example set (Data Table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • kFinding the k training examples that are closest to the unseen example is the first step of the k-NN algorithm. If k = 1, the example is simply assigned to the class of its nearest neighbor. k is a typically small positive integer. Mostly k is provided with an odd integer value. Range: integer
  • weighted_voteIf this parameter is set, the distance values between the examples are also taken into account. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more than the more distant ones. Range: boolean
  • measure_typesThis parameter is used for selecting the type of measure to be used for finding the nearest neighbors.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences. Range: selection
  • mixed_measureThis parameter is available when the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance' Range: selection
  • nominal_measureThis parameter is available when the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. Otherwise the 'numerical measure' option should be selected. Range: selection
  • numerical_measureThis parameter is available when the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. Otherwise the 'nominal measure' option should be selected. Range: selection
  • divergenceThis parameter is available when the measure type parameter is set to 'bregman divergences'. Range: selection
  • kernel_typeThis parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:
    • dot: The dot kernel is defined by k(x,y)=x*y i.e.it is the inner product of x and y.
    • radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
    • polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
    • neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
    • sigmoid: This is the sigmoid kernel. Please note that it is not valid under some parameters.
    • anova: The anova kernel is defined by the raised to the power d of summation of exp(-g (x-y)) where g is gamma and d is degree. The two are adjusted by the kernel gamma and kernel degree parameters respectively.
    • epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has the two adjustable parameters kernel sigma1 and kernel degree.
    • gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
    • multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift.
    Range: selection
  • kernel_gammaThis is the SVM kernel parameter gamma.This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova. Range: real
  • kernel_sigma1This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
  • kernel_sigma2This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
  • kernel_sigma3This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
  • kernel_shiftThis is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric. Range: real
  • kernel_degreeThis is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
  • kernel_aThis is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real
  • kernel_bThis is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real

Tutorial Processes

Classification of the 'Golf-Testset' data set using the K-NN operator

The 'Golf' data set is loaded using the Retrieve operator. The Select Attributes operator is applied on it to select just the 'Play' and 'Temperature' attributes to simplify this example process. Then the K-NN operator is applied on it. All the parameters of the K-NN operator are used with default values. The resultant classification model is applied on the 'Golf-Testset' data set using the Apply Model operator. Note that the same two attributes of the 'Golf-Testset' data set were selected before the application of the classification model on it.

Run the process. You can see the 'Golf' data set and the labeled 'Golf-Testset' data set in the Results Workspace. As the k parameter was set to 1, each example of the 'Golf-Testset' data set is simply assigned the class of its nearest neighbor in the 'Golf' data set. To understand how examples were classified, simply select an example in the 'Golf-Testset' data set and note the value of the 'Temperature' attribute of that example, we call it t1 here. Now, have a look at the 'Golf' data set and find the example where the 'Temperature' value is closest to t1. The example in the 'Golf-Testset' data set is assigned the same class as the class of this example in the 'Golf' data set. For example let us consider the first example of the 'Golf-Testset' data set. The value of the 'Temperature' attribute is 85 in this example. Now find the example in the 'Golf' data set where 'Temperature' value is closest to 85. As you can see that the first example of the 'Golf' data set has a 'Temperature' value equal to 85. This example is labeled 'no', thus the corresponding example in the 'Golf-Testset' data set is also predicted as 'no'.

Regression of the Polynomial data set using the K-NN operator

The 'Polynomial' data set is loaded using the Retrieve operator. The Filter Example Range operator is applied on it to select just the first 100 examples. Then the Select Attributes operator is applied on it to select just the 'label' and 'a1' attributes to simplify this example process. Then the K-NN operator is applied on it. The k parameter is set to 3, the measure types parameter is set to 'Numerical Measures' and the numerical measure parameter is set to 'Euclidean Distance'. The resultant regression model is applied on the last 100 examples of the 'Polynomial' data set using the Apply Model operator. Note that the same two attributes of the 'Polynomial' data set were selected before the application of the regression model on it.

Run the process. You can see the 'Polynomial' data set (first 100 examples) and the labeled 'Polynomial' data set (last 100 examples) in the Results Workspace. For convenience we call these data sets as 'Polynomial'_first and 'Polynomial'_last data sets respectively. As the k parameter was set to 3, each example of the 'Polynomial'_last data set is simply assigned the average label value of its 3 nearest neighbors in the 'Polynomial'_first data set. To understand how regression was performed, simply select an example in the 'Polynomial'_last data set and note the value of the 'a1' attribute of that example, we call it a1_val here. Now, have a look at the 'Polynomial'_first data set and find 3 examples where the 'a1' attribute value is closest to a1_val. The 'label' attribute of the example of the 'Polynomial'_last data set is assigned the average of these three label values of the 'Polynomial'_first data set. For example let us consider the last example of the 'Polynomial'_last data set. The value of the 'a1' attribute is 4.788 in this example. Now find 3 examples in the 'Polynomial'_first data set where the value of the 'a1' attribute is closest to 4.788. These 3 examples are at Row No. 65, 71 and 86. The value of the 'label' attribute of these examples is 41.798, 124.250 and 371.814 respectively. The average of these three 'label' values is 179.288. Thus the value of the 'label' attribute in the last example of the 'Polynomial'_last data set is predicted to be 179.288.