Expectation Maximization Clustering (RapidMiner Studio Core)
SynopsisThis operator performs clustering using the Expectation Maximization algorithm. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. But the Expectation Maximization algorithm extends this basic approach to clustering in some important ways.
The general purpose of clustering is to detect clusters in examples and to assign those examples to the clusters. A typical application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect 'market segments', i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that belong to other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters.
The EM (expectation maximization) technique is similar to the K-Means technique. The basic operation of K-Means clustering algorithms is relatively simple: Given a fixed number of k clusters, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible. The EM algorithm extends this basic approach to clustering in two important ways:
- Instead of assigning examples to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.
- Unlike the classic implementation of k-means clustering, the general EM algorithm can be applied to both continuous and categorical variables (note that the classic k-means algorithm can also be modified to accommodate categorical variables).
Expectation Maximization algorithmThe basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.
k-MeansThe K-Means operator performs clustering using the k-means algorithm. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. Objects in one cluster are similar to each other. The similarity between objects is based on a measure of the distance between them. The K-Means operator assigns observations to clusters to maximize the distances between clusters. The Expectation Maximization Clustering operator, on the other hand, computes classification probabilities.
- example set (Data Table)
The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
- cluster model (Cluster Model)
This port delivers the cluster model which has information regarding the clustering performed. It has information about cluster probabilities and cluster means.
- clustered set (Data Table)
The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster attribute parameter. If the show probabilities parameter is set to true, one probability column is added for each cluster.
- kThis parameter specifies the number of clusters to form. There is no hard and fast rule of number of clusters to form. But, generally it is preferred to have small number of clusters with examples scattered (not too scattered) around them in a balanced way. Range: integer
- add_cluster_attributeIf enabled, a new attribute with cluster role is generated directly in this operator, otherwise this operator does not add the cluster attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute. Range: boolean
- add_as_labelIf true, the cluster id is stored in an attribute with the label role instead of cluster role Range: boolean
- remove_unlabeledIf set to true, unlabeled examples are deleted. Range: boolean
- max_runsThis parameter specifies the maximal number of runs of this operator to be performed with random initialization. Range: integer
- max_optimization_stepsThis parameter specifies the maximal number of iterations performed for one run of this operator. Range: integer
- qualityThis parameter specifies the quality that must be fulfilled before the algorithm stops ( i.e. the rising of the log-likelihood that must be undercut). Range: real
- use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Range: boolean
- local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
- show_probabilitiesThis parameter indicates if the probabilities for every cluster should be inserted with every example in the ExampleSet. Range: boolean
- inital_distributionThis parameter indicates the initial distribution of the centroids. Range: selection
- correlated_attributesThis parameter should be set to true if the ExampleSet contains correlated attributes. Range: boolean
Clustering of the Ripley-Set data set using the Expectation Maximization Clustering operator
The 'Ripley-Set' data set is loaded using the Retrieve operator. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before the application of the Expectation Maximization Clustering operator. Besides the label attribute the 'Ripley-Set' has two real attributes; 'att1' and 'att2'. The Expectation Maximization Clustering operator is applied on this data set with default values for all parameters. Run the process and you will see that a few new attributes are created by the Expectation Maximization Clustering operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. As parameter k was set to 2, only two clusters are possible. That is why each example is assigned to either 'cluster_0' or 'cluster_1'. Note that the Expectation Maximization Clustering operator has added probability attributes for each cluster that show the probability of an example to be part of that cluster. This operator assigns an example to the cluster with maximum probability. Also note the Plot View of this data. You can clearly see how the algorithm has created two separate groups in the Plot View. A cluster model is also delivered through the cluster model output port. It has information regarding the clustering performed. It also has information about cluster probabilities and cluster means. Under Folder View you can see members of each cluster in folder format.