Support Vector Clustering (RapidMiner Studio Core)
Synopsis
This operator performs clustering with support vectors. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabeled data.Description
This operator is an implementation of Support Vector Clustering based on Ben-Hur et al (2001). In this Support Vector Clustering (SVC) algorithm data points are mapped from data space to a high dimensional feature space using a Gaussian kernel. In feature space the smallest sphere that encloses the image of the data is searched. This sphere is mapped back to data space, where it forms a set of contours which enclose the data points. These contours are interpreted as cluster boundaries. Points enclosed by each separate contour are associated with the same cluster. As the width parameter of the Gaussian kernel is decreased, the number of disconnected contours in data space increases, leading to an increasing number of clusters. Since the contours can be interpreted as delineating the support of the underlying probability distribution, this algorithm can be viewed as one identifying valleys in this probability distribution.
Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is a technique for extracting information from unlabeled data and can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.
Input
- example set (Data Table)
This input port expects an ExampleSet. It is output of the Generate Data operator in the attached Example Process.
Output
- cluster model (Cluster Model)
This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster.
- clustered set (Data Table)
The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster attribute parameter.
Parameters
- add_cluster_attributeIf this parameter is set to true, a new attribute with cluster role is generated in the resultant ExampleSet, otherwise this operator does not add the cluster attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute. Range: boolean
- add_as_labelIf this parameter is set to true, the cluster id is stored in an attribute with the label role instead of cluster role (see add cluster attribute parameter). Range: boolean
- remove_unlabeledIf this parameter is set to true, unlabeled examples are deleted from the ExampleSet. Range: boolean
- min_ptsThis parameter specifies the minimal number of points in each cluster. Range: integer
- kernel_typeThe type of the kernel function is selected through this parameter. Following kernel types are supported: dot, radial, polynomial, neural
- dot: The dot kernel is defined by k(x,y)=x*y i.e. it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma, it is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of polynomial and it is specified by the kernel degree parameter. The polynomial kernels are well suited for problems where all the training data is normalized.
- neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- kernel_gammaThis is the SVM kernel parameter gamma. This is available only when the kernel type parameter is set to radial. Range: real
- kernel_degreeThis is the SVM kernel parameter degree. This is available only when the kernel type parameter is set to polynomial. Range: real
- kernel_aThis is the SVM kernel parameter a. This is available only when the kernel type parameter is set to neural. Range: real
- kernel_bThis is the SVM kernel parameter b. This is available only when the kernel type parameter is set to neural. Range: real
- kernel_cacheThis is an expert parameter. It specifies the size of the cache for kernel evaluations in megabytes. Range: real
- convergence_epsilonThis is an optimizer parameter. It specifies the precision on the KKT conditions. Range: real
- max_iterationsThis is an optimizer parameter. It specifies to stop iterations after a specified number of iterations. Range: integer
- pThis parameter specifies the fraction of allowed outliers. Range: real
- rIf this parameter is set to -1 then the the calculated radius is used as radius. Otherwise the value specified in this parameter is used as radius. Range: real
- number_sample_pointsThis parameter specifies the number of virtual sample points to check for neighborhood. Range: real
Tutorial Processes
Clustering of Ripley-Set data set by the Support Vector Clustering operator
In many cases, no target attribute (i.e. label) can be defined and the data should be automatically grouped. This procedure is called Clustering. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators.
In this Example Process, the Generate Data operator is used for generating an ExampleSet. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before application of the clustering operator. Other than the label attribute the ExampleSet has two real attributes; 'att1' and 'att2'. The Support Vector Clustering operator is applied on this data set. Run the process and you will see that two new attributes are created by the Support Vector Clustering operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. Each example is assigned to a particular cluster. The examples that are not in any cluster are considered as noise. Also note the Plot View of this data set. Switch to Plot View and set the the Plotter to 'Scatter', x-Axis to 'att1', y-Axis to 'att2' and Color Column to 'cluster'. You can clearly see how the algorithm has created three separate cluster (noise is also visible separately). A cluster model is also delivered through the cluster model output port. It has information regarding the clustering performed. Under Folder View you can see members of each cluster in folder format.