K-Means (H2O) (H2O)

Synopsis

Performs clustering using the k-means algorithm found in H2O 3.30.0.1.

Description

This Operator performs clustering on the provided ExampleSet and produces a model. The model can be applied later or used for cluster visualization using the Cluster Model Visualizer Operator.

This Operator performs clustering using H2O's k-means algorithm. Clustering groups Examples together which are similar to each other. As no Label Attribute is necessary, Clustering can be used on unlabelled data and is an algorithm of unsupervised machine learning.

The k-means algorithm determines a set of k clusters and assigns each Example to exactly one cluster. The clusters consist of similar Examples. The similarity between Examples is based on the Euclidean distance between them.

A cluster in the k-means algorithm is determined by the position of the center in the n-dimensional space of the n Attributes of the ExampleSet. This position is called centroid. It can, but does not have to be the position of an Example of the ExampleSet.

The k-means algorithm starts with k points which are treated as the centroid of k potential clusters. These start points are determined by some heuristic:

  • Random initialization randomly samples the k-specified value of the Examples of the ExampleSet as cluster centers.
  • PlusPlus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
  • Furthest initialization chooses one initial center at random and then chooses the next center to be the point furthest away in terms of Euclidean distance.

All Examples are assigned to their nearest cluster (in terms of Euclidean distance). Next, the centroids of the clusters are recalculated by averaging over all Examples of one cluster. The previous steps are repeated for the new centroids until the centroids no longer move or max iterations is reached.

The procedure is repeated max iterations times with each time a different set of start points. The set of clusters is delivered which has the minimal sum of squared distances of all Examples to their corresponding centroids.

This Operator supports the automatic inference of k. The estimate k parameter specifies whether to estimate the number of clusters (<=maximum k) iteratively (independent of the seed) and deterministically (beginning with k=1,2,3...). If enabled, for each k that, the estimate will go up to max iterations. This option is disabled by default.

Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This Operator can take care of standardization as well. This option is enabled by default.

Differentiation

k-Medoids

In case of the k-medoids algorithm the centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm.