Top Down Clustering (RapidMiner Studio Core)

Synopsis

This operator performs top down clustering by applying the inner flat clustering scheme recursively. Top down clustering is a strategy of hierarchical clustering. The result of this operator is an hierarchical cluster model.

Description

This operator is a nested operator i.e. it has a subprocess. The subprocess must have a flat clustering operator e.g. the K-Means operator. This operator builds a Hierarchical clustering model using the clustering operator provided in its subprocess. You need to have a basic understanding of subprocesses in order to apply this operator. Please study the documentation of the Subprocess operator for basic understanding of subprocesses.

The basic idea of Top down clustering is that all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Top down clustering is a strategy of hierarchical clustering. Hierarchical clustering (also known as Connectivity based clustering) is a method of cluster analysis which seeks to build a hierarchy of clusters. Hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. As such, these algorithms connect 'objects' (or examples, in case of an ExampleSet) to form clusters based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form. These algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances.

Strategies for hierarchical clustering generally fall into two types:

Agglomerative: This is a bottom-up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. This type of clustering is implemented in RapidMiner as the Agglomerative Clustering operator.
Divisive: This is a top-down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is a technique for extracting information from unlabeled data and can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Input

example set (Data Table)
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process.

Output

cluster model (Hierachical Cluster Model)
This port delivers the hierarchical cluster model. It has information regarding the clustering performed.
clustered set (Data Table)
The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster label parameter.

Parameters

create_cluster_labelThis parameter specifies if a cluster label should be created. If this parameter is set to true, a new attribute with cluster role is generated in the resultant ExampleSet, otherwise this operator does not add the cluster attribute. Range: boolean
max_depthThis parameter specifies the maximal depth of the cluster tree. Range: integer
max_leaf_sizeThis parameter specifies the maximal number of items in each cluster leaf. Range: integer

Tutorial Processes

Top down clustering of Ripley-Set data set

The 'Ripley-Set' data set is loaded using the Retrieve operator. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before application of the Top Down Clustering operator. Other than the label attribute the 'Ripley-Set' has two real attributes; 'att1' and 'att2'. The Top Down Clustering operator is applied on this data set. Run the process and you will see that two new attributes are created by the Top Down Clustering operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. Each example is assigned to a particular cluster. Note the Graph View of the results. You can see that the algorithm has not created separate groups or clusters as other clustering algorithms (like k-means), instead the result is a hierarchy of clusters. Under the Folder View you can see members of each cluster in folder format. You can see that it is an hierarchy of folders.