Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.0 - Check here for latest version

Random Clustering (RapidMiner Studio Core)

Synopsis

This operator performs a random flat clustering of the given ExampleSet. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters.

Description

This operator performs a random flat clustering of the given ExampleSet. Please note that this algorithm does not guarantee that all clusters will be non-empty. This operator creates a cluster attribute in the resultant ExampleSet if the add cluster attribute parameter is set to true. It is important to note that this operator randomly assigns examples to clusters, if you want proper clustering please use an operator that implements a clustering algorithm like the K-Means operator.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabeled data. Clustering can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Input

  • example set (Data Table)

    The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

  • cluster model (Cluster Model)

    This port delivers the cluster model which has information regarding the clustering performed. It tells which examples are part of which cluster.

  • clustered set (Data Table)

    The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster attribute parameter.

Parameters

  • add_cluster_attributeIf enabled, a new attribute with cluster role is generated directly in this operator, otherwise this operator does not add the cluster attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute. Range: boolean
  • add_as_labelIf true, the cluster id is stored in an attribute with the label role instead of cluster role (see add cluster attribute parameter). Range: boolean
  • remove_unlabeledIf set to true, unlabeled examples are deleted. Range: boolean
  • number_of_clustersThis parameter specifies the desired number of clusters to form. There is no hard and fast rule for the number of clusters to form. But, generally it is preferred to have a small number of clusters with examples scattered (not too scattered) around them in a balanced way. Range: integer
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer

Tutorial Processes

Random clustering of the Ripley-Set data set

In many cases, no target attribute (i.e. label) can be defined and the data should be automatically grouped. This procedure is called Clustering. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators.

In this Example Process, the 'Ripley-Set' data set is loaded using the Retrieve operator. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before the application of the Random Clustering operator. Besides the label attribute the 'Ripley-Set' has two real attributes; 'att1' and 'att2'. The Random Clustering operator is applied on this data set with default values for all parameters. Run the process and you will see that two new attributes are created by the Random Clustering operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. As the number of clusters parameter was set to 3, only three clusters are possible. That is why each example is assigned to 'cluster_0', 'cluster_1' or 'cluster_2'. Also note the Plot View of this data. You can clearly see how this operator has created three groups in the Plot View. A cluster model is also delivered through the cluster model output port. It has information regarding the clustering performed. Under Folder View you can see members of each cluster in folder format. It is important to note that this operator randomly assigns examples to clusters (this can be seen easily in the Plot View). If you want proper clustering of your ExampleSet please use an operator that implements a clustering algorithm like the K-Means operator.