k-Means (RapidMiner Studio Core)

Synopsis

This Operator performs clustering using the k-means algorithm.

Description

This Operator performs clustering using the k-means algorithm. Clustering groups Examples together which are similar to each other. As no Label Attribute is necessary, Clustering can be used on unlabelled data and is an algorithm of unsupervised machine learning.

The k-means algorithm determines a set of k clusters and assignes each Examples to exact one cluster. The clusters consist of similar Examples. The similarity between Examples is based on a distance measure between them.

A cluster in the k-means algorithm is determined by the position of the center in the n-dimensional space of the n Attributes of the ExampleSet. This position is called centroid. It can, but do not have to be the position of an Example of the ExampleSets.

The k-means algorithm starts with k points which are treated as the centroid of k potential clusters. These start points are either the position of k randomly drawn Examples of the input ExampleSet, or are determined by the k-means++ heuristic if determine good start values is set to true.

All Examples are assigned to their nearest cluster (nearest is defined by the measure type). Next the centroids of the clusters are recalculated by averaging over all Examples of one cluster. The previous steps are repeated for the new centroids until the the centroids no longer move or max optimization steps is reached. Be aware that it is not ensured that the k-means algorithm converges if the measure type is not based on Euclidean Distance calculation (cause the recalculation of the centroids by averaging is assuming Euclidean space).

The procedure is repeated max runs times with each time a different set of start points. The set of clusters is delivered which has the minimal sum of squared distances of all Examples to their corresponding centroids.

Differentiation

k-Medoids

In case of the k-medoids algorithm the centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm.

k-Means (Kernel)

Kernel k-means uses kernels to estimate distances between Examples and clusters. Because of the nature of kernels it is necessary to sum over all Examples of a cluster to calculate one distance. So this algorithm is quadratic in number of Examples and does not return a Centroid Cluster Model (on the contrary the K-Means operator returns a Centroid Cluster Model).

Input

  • example set input (Data Table)

    This input port expects an ExampleSet.

Output

  • cluster model (Centroid Cluster Model)

    This port delivers the cluster model. It contains the information which Examples are part of which cluster. It also stores the position of the centroids of the clusters. It can be used by the Apply Model Operator to perform the specified clustering on another ExampleSet. The cluster model can also be grouped together with other clustering models, preprocessing models and learning models by the Group Models Operator.

  • clustered set (Data Table)

    An Attribute 'id' with special role 'Id' is added to the input ExampleSet to distinguish Examples. Depending on the add cluster attribute and the add as label parameters an Attribute 'cluster' with special role 'Cluster' or 'Label' is also added. The resulting ExampleSet is delivered at this output port.

Parameters

  • add_cluster_attribute

    If true, a new Attribute called 'cluster' with the cluster_id for each Example is generated. By default the Attribute is created with the special role 'Cluster', except the parameter add as label is true. If so the new Attribute is called 'label' and has the special role 'Label'. The Attribute can also be generated later by using the Apply Model Operator.

    Range:
  • add_as_label

    If true the new Attribute with the cluster_id is called 'label' and has the special role 'Label'. If the parameter add cluster attribute is false, no new Attribute is created.

    Range:
  • remove_unlabeled

    If set to true, Examples which cannot be assigned to a cluster are removed from the output ExampleSet.

    Range:
  • k

    This parameter specifies the number of clusters to deterime.

    Range:
  • max_runs

    This parameter specifies the maximal number of runs of k-Means with random initialization of the start points that are performed.

    Range:
  • determine_good_start_values

    If true the k start points are determined using the k-means++ heuristic, descriped in "k-means++: The Advantages of Careful Seeding" by David Arthur and Sergei Vassilvitskii 2007.

    Range:
  • max_optimization_steps

    This parameter specifies the maximal number of iterations performed for one run of k-Means.

    Range:
  • measure_types

    This parameter is used for selecting the type of measure to be used for finding the nearest neighbors. The following options are available:

    • MixedMeasures: Mixed Measures are used to calculate distances in case of both nominal and numerical Attributes.
    • NominalMeasures: In case of only nominal Attributes different distance metrices can be used to calculate distances on this nominal Attributes.
    • NumericalMeasures: In case of only numerical Attributes different distance metrices can be used to calculate distances on this numerical Attributes.
    • BregmannDivergences: Bregmann divergences are more generic "closeness" measure types with does not satisfy the triangle inequality or symmetry. For more details see the parameter divergence.
    Range:
  • mixed_measure

    The only available option for mixed measure is the 'Mixed Euclidean Distance'. For numerical values the euclidean distance is calculated. For nomimal values, a distance of 0 is taken if both values are the same and a distance of one is taken otherwise. This parameter is available when the measure type parameter is set to 'mixed measures'.

    Range:
  • nominal_measure

    This parameters defines how to calculate distances for only nominal Attributes in the input ExampleSet, in case the measure type is set to nominal measure. In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity. For the different similarities the following variables are defined:

    e: number of Attribute for which both Examples have equal and non-zero values

    u: number of Attribute for which both Examples have not equal values

    z: number of Attribute for which both Examples have zero values

    • NominalDistance: Distance of two values is 0 if both values are the same and 1 otherwise.
    • DiceSimilarity: With the above mentioned defintions the DiceSimilarity is: 2*e/(2*e+u)
    • JaccardSimilarity: With the above mentioned defintions the JaccardSimilarity is: e/(e+u)
    • KulczynskiSimilarity: With the above mentioned defintions the KulczynskiSimilarity is: e/u
    • RogersTanimotoSimilarity: With the above mentioned defintions the RogersTanimotoSimilarity is: (e+z)/(e+2*u+z)
    • RussellRaoSimilarity: With the above mentioned defintions the RussellRaoSimilarity is: e/(e+u+z)
    • SimpleMatchingSimilarity: With the above mentioned defintions the SimpleMatchingSimilarity is: (e+z)/(e+u+z)
    Range:
  • numerical_measure

    This parameters defines how to calculate distances for only numerical Attributes in the input ExampleSet, in case the measure type is set to numerical measure. For the different distance measures the following variable is defined:

    y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

    In case of using a similarity as a distance measure, the actual distance is calculated as the negative similarity.

    • EuclideanDistance: Square root of the sum of quadratic differences over all Attributes. Dist = Sqrt ( Sum_(j=1) [y(1,j)-y(2,j)]^2 )
    • CanberraDistance: Sum over all Attributes. The summand is the absolut of the difference of the value, divided by the sum of the absolute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)| / (|y(1,j)|+|y(2,j)|) The CanberraDistance is often used to compare ranked list or for intrusion detection in computer security.
    • ChebychevDistance: Maximum of all differences of all Attributes. Dist = max_(j=1) (|y(1,j)-y(2,j)|)
    • CorrelationSimilarity: The similarity is calculated as the correlation between the Attribute vectors of the two Examples.
    • CosineSimilarity: Similarity measure measuring the cosine of the angle between the Attribute vectors of the two Examples.
    • DiceSimilarity: The DiceSimilarity for numerical Attributes is calculated as 2*Y1Y2/(Y1+Y2). Y1Y2 = Sum over product of values = Sum_(j=1) y(1,j)*y(2,j). Y1 = Sum over values of first Example = Sum_(j=1) y(1,j) Y2 = Sum over values of second Example = Sum_(j=1) y(2,j)
    • DynamicTimeWarpingDistance: Dynamic Time Warping is often use in Time Series analysis for measuring the distance between two temporal sequences. Here the distance on an optimal "warping" path from the Attribute vector of the first Example to the second Example is calculated.
    • InnerProductSimilarity: The similarity is calculated as the sum of the product of the Attribute vectors of the two Examples. Dist = -Similarity = -Sum_(j=1) y(1,j)*y(2,j)
    • JaccardSimilarity: The JaccardSimilarity is calculated as Y1Y2/(Y1+Y2-Y1Y2). See DiceSimilarity for the definition of Y1Y2, Y1 and Y2.
    • KernelEuclideanDistance: The distance is calculated by the euclidean distance of the two Examples, in a transformed space. The transformation is defined by the chosen kernel and configured by the parameters kernel type, gamma, sigma1, sigma2, sigma 3, shift, degree, a, b.
    • ManhattanDistance: Sum of the absolute distances of the Attribute values. Dist = Sum_(j=1) |y(1,j)-y(2,j)|
    • MaxProductSimilarity: The similarity is the maximum of all products of all Attribute values. If the maximum is less or equal to zero the similarity is not defined. Dist = -Similarity = -max_(j=1) (y(1,j)*y(2,j))
    • OverlapSimilarity: The similarity is a variant of simple matching for numerical Attributes and is calculated as minY1Y2 / min(Y1,Y2). See DiceSimilarity for the definition of Y1, Y2. minY1Y2 = Sum over the minimum of values = Sum_(j=1) min [y(1,j),y(2,j)] .
    Range:
  • divergence

    This parameter defines which type of Bregmann divergence is used when the measure type parameter is set to 'bregman divergences'. For the different distance measures the following variable is defined:

    y(i,j) : Value of the j.th Attribute of the i.th Example. Hence y(1,3) - y(2,3) is the difference of the values of the third Attribute of the first and second Example.

    • GeneralizedIDivergence: The distance is caluclated as Sum1 Sum2. It is not applicable if any Attribute value is less or equal to 0. Sum1 = Sum_(j=1) y(1,j)*ln[y(1,j)/y(2,j)] Sum2 = Sum_(j=1) [y(1,j)-y(2,j)]
    • ItakuraSaitoDistance: The ItakuraSaitoDistance can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)/y(2,1)-ln[y(1,1)/y(2,1)]-1
    • KLDivergence: The Kullback-Leibler divergence is a measure of how one probability distribution diverges from a second expected probability distribution. Dist = Sum_(j=1) [y(1,j)*log_2(y(1,j)/y(2,j))]
    • LogarithmicLoss: The LogarithmicLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]-(y(1,1)-y(2,1))
    • LogisticLoss: The LogisticLoss can only be calculated for ExampleSets with 1 Attribute and values larger 0. Dist = y(1,1)*ln[y(1,1)/y(2,1)]+(1-y(1,1))*ln[(1-y(1,1))/(1-y(2,1))]
    • MahalanobisDistance: The Mahalanobis distance measures the distance between the two Examples under the assumption they are both random vectors of the same distribution. Therefore the covariance matrix S is calculated on the whole ExampleSet and the Distance is calculated as: Dist = Sqrt [ (vecY1-vecY2) S (vecY1-vecY2) ] vecY1 = Attribute vector of Example 1 vecY2 = Attribute vector of Example 2
    • SquaredEuclideanDistance: Sum of quadratic differences over all Attributes. Dist = Sum_(j=1) [y(1,j)-y(2,j)]^2
    • SquaredLoss: The SquaredLoss can only be calculated for ExampleSets with 1 Attribute. Dist = [y(1,1)-y(2,1)]^2
    Range:
  • kernel_type

    This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

    • dot: The dot kernel is defined by k(x,y) = x*y i.e. it is the inner product of x and y.
    • radial: The radial kernel is defined by k(x,y) = exp(-g*||x-y||^2) where g is gamma, specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
    • polynomial: The polynomial kernel is defined by k(x,y) = (x*y+1)^d where d is the degree of the polynomial and is specified by the _kernel degree_ parameter. Polynomial kernels are well suited for problems where all the training data is normalized.
    • sigmoid: This is a hyperbolic tangent sigmoid kernel. The distance is calculated as tanh[a*Y1Y2+b] where Y1Y2 is the inner product of the Attribute vector of the two Examples. a and b can be adjusted using the kernel a and kernel b parameters. A common value for a is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
    • anova: The anova kernel is defined by the raised to the power d of summation of exp(-g(x-y)) where g is gamma and d is degree. The two are adjusted by the kernel gamma and kernel degree parameters respectively.
    • epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has the two adjustable parameters kernel sigma1 and kernel degree.
    • gaussian_combination: This is the gaussian combination kernel. It has the adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
    • multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2+c^2. It has the adjustable parameters kernel sigma1 and kernel sigma shift.
    Range:
  • kernel_gamma

    This is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova.

    Range:
  • kernel_sigma1

    This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric.

    Range:
  • kernel_sigma2

    This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

    Range:
  • kernel_sigma3

    This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination.

    Range:
  • kernel_shift

    This is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric.

    Range:
  • kernel_degree

    This is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov.

    Range:
  • kernel_a

    This is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.

    Range:
  • kernel_b

    This is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural.

    Range:
  • use_local_random_seed

    This parameter indicates if a local random seed should be used for randomization of the k different starting points of the algorithm.

    Range:
  • local_random_seed

    If the use local random seed parameter is checked this parameter determines the local random seed.

    Range:

Tutorial Processes

Clustering of the Iris Data Set

In this tutorial process the Iris data set is clustered using the k-means Operator. The Iris data set is retrieved from the Samples folder. Also the label Attribute is also retrieved, but only for comparison of the cluster assignment with the class of the Examples. The label Attribute is not used in the Clustering. For more details see the comments in the Process.