K-Means (RapidMiner Studio Core)

Synopsis

This operator performs clustering using the k-means algorithm. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabelled data. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters.

Description

This operator performs clustering using the k-means algorithm. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. Objects in one cluster are similar to each other. The similarity between objects is based on a measure of the distance between them.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabeled data. Clustering can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Here is a simple explanation of how the k-means algorithm works. First of all we need to introduce the notion of the center of a cluster, generally called its centroid. Assuming that we are using Euclidean distance or something similar as a measure we can define the centroid of a cluster to be the point for which each attribute value is the average of the values of the corresponding attribute for all the points in the cluster. The centroid of a cluster will sometimes be one of the points in the cluster, but frequently it will be an imaginary point, not part of the cluster itself, which we can take as marking its center.

For this method of clustering we start by deciding how many clusters we would like to form from our data. We call this value k. The value of k is generally a small integer, such as 2, 3, 4 or 5, but may be larger.

We next select k points. These are treated as the centroids of k clusters, or to be more precise as the centroids of k potential clusters, which at present have no members. We can select these points in any way we wish, but the method may work better if we pick k initial points that are fairly far apart.We now assign each of the points one by one to the cluster which has the nearest centroid. When all the objects have been assigned we will have k clusters based on the original k centroids but the centroids will no longer be the true centroids of the clusters. Next we recalculate the centroids of the clusters, and then repeat the previous steps, assigning each object to the cluster with the nearest centroid etc. The entire algorithm can be summarized as

Choose a value of k. Select k objects in an arbitrary fashion. Use these as the initial set of k centroids. Assign each of the objects to the cluster for which it is nearest to the centroid. Recalculate the centroids of the k clusters. Repeat steps 3 and 4 until the centroids no longer move.

Differentiation

k-Medoids

In case of the k-medoids algorithm the centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm. In k-means algorithm the centroid of a cluster will frequently be an imaginary point, not part of the cluster itself, which we can take as marking its center.

k-Means (Kernel)

Kernel k-means uses kernels to estimate distance between objects and clusters. Because of the nature of kernels it is necessary to sum over all elements of a cluster to calculate one distance. So this algorithm is quadratic in number of examples and does not return a Centroid Cluster Model (on the contrary the K-Means operator returns a Centroid Cluster Model).

Input

  • example set input (Data Table)

    The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

  • cluster model (Centroid Cluster Model)

    This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster. It also has information regarding centroids of each cluster.

  • clustered set (Data Table)

    The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster attribute parameter.

Parameters

  • add_cluster_attributeIf enabled, a new attribute with cluster role is generated directly in this operator, otherwise this operator does not add the cluster attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute. Range: boolean
  • add_as_labelIf true, the cluster id is stored in an attribute with the label role instead of cluster role (see add cluster attribute parameter). Range: boolean
  • remove_unlabeledIf set to true, unlabeled examples are deleted. Range: boolean
  • kThis parameter specifies the number of clusters to form. There is no hard and fast rule of number of clusters to form. But, generally it is preferred to have small number of clusters with examples scattered (not too scattered) around them in a balanced way. Range: integer
  • max_runsThis parameter specifies the maximal number of runs of k-Means with random initialization that are performed. Range: integer
  • max_optimization_stepsThis parameter specifies the maximal number of iterations performed for one run of k-Means Range: integer
  • use_local_random_seedIndicates if a local random seed should be used for randomization. Randomization may be used for selecting k different points at the start of the algorithm as potential centroids. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Simple clustering of Ripley-Set data set

In many cases, no target attribute (i.e. label) can be defined and the data should be automatically grouped. This procedure is called Clustering. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators.

In this Example Process, the 'Ripley-Set' data set is loaded using the Retrieve operator. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before application of the K-Means operator. Other than the label attribute the 'Ripley-Set' has two real attributes; 'att1' and 'att2'. The K-Means operator is applied on this data set with default values for all parameters. Run the process and you will see that two new attributes are created by the K-Means operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. As parameter k was set to 2, only two clusters are possible. That is why each example is assigned to either 'cluster_0' or 'cluster_1'. Also note the Plot View of this data. You can clearly see how the algorithm has created two separate groups in the Plot View. A cluster model is also delivered through the cluster model output port. It has information regarding the clustering performed. Under Folder View you can see members of each cluster in folder format. You can see information regarding centroids under the Centroid Table and Centroid Plot View tabs.