DBSCAN (RapidMiner Studio Core)

Synopsis

This operator performs clustering with DBSCAN. DBSCAN (for density-based spatial clustering of applications with noise) is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Description

DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood) and if p is surrounded by sufficiently many points such that one may consider p and q to be part of a cluster. q is called density-reachable (note the distinction from "directly density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p and p(n) = q where each p(i+1) is directly density-reachable from p(i).

Note that the relation of density-reachable is not symmetric. q might lie on the edge of a cluster, having insufficiently many neighbors to count as dense itself. This would halt the process of finding a path that stops with the first non-dense point. By contrast, starting the process with q would lead to p (though the process would halt there, p being the first non-dense point). Due to this asymmetry, the notion of density-connected is introduced: two points p and q are density-connected if there is a point o such that both p and q are density-reachable from o. Density-connectedness is symmetric.

A cluster, which is a subset of the points of the data set, satisfies two properties: All points within the cluster are mutually density-connected. If a point is density-connected to any point of the cluster, it is part of the cluster as well.

DBSCAN requires two parameters: epsilon and the minimum number of points required to form a cluster (minPts). epsilon and minPts can be specified through the epsilon and min points parameters respectively. DBSCAN starts with an arbitrary starting point that has not been visited. This point's epsilon-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized epsilon-environment of a different point and hence be made part of a cluster.

If a point is found to be a dense part of a cluster, its epsilon-neighborhood is also part of that cluster. Hence, all points that are found within the epsilon-neighborhood are added, as is their own epsilon-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

If no id attribute is present, this operator will create one. The 'Cluster 0' assigned by DBSCAN operator corresponds to points that are labeled as noise. These are the points that have less than min points points in their epsilon-neighborhood.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is a technique for extracting information from unlabeled data and can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Input

example set (Data Table)
This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process.

Output

cluster model (Cluster Model)
This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster.
clustered set (Data Table)
The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute with id role is added to the input ExampleSet to distinguish examples. An attribute with cluster role may also be added depending on the state of the add cluster attribute parameter.

Parameters

epsilonThis parameter specifies the epsilon parameter of the DBSCAN algorithm. epsilon specifies the size of the neighborhood. Range: real
min_pointsThis parameter specifies the minimal number of points forming a cluster. Range: integer
add_cluster_attributeIf this parameter is set to true, a new attribute with cluster role is generated in the resultant ExampleSet, otherwise this operator does not add the cluster attribute. In the latter case you have to use the Apply Model operator to generate the cluster attribute. Range: boolean
add_as_labelIf this parameter is set to true, the cluster id is stored in an attribute with the label role instead of cluster role (see add cluster attribute parameter). Range: boolean
remove_unlabeledIf this parameter is set to true, unlabeled examples are deleted from the ExampleSet. Range: boolean
measure_typesThis parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences. Range: selection
mixed_measureThis parameter is available when the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance' Range: selection
nominal_measureThis parameter is available when the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected. Range: selection
numerical_measureThis parameter is available when the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected. Range: selection
divergenceThis parameter is available when the measure type parameter is set to 'bregman divergences'. Range: selection
kernel_typeThis parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:
- dot: The dot kernel is defined by k(x,y)=x*y i.e.it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
- neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- sigmoid: This is the sigmoid kernel. Please note that the sigmoid kernel is not valid under some parameters.
- anova: This is the anova kernel. It has adjustable parameters gamma and degree.
- epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
- gaussian_combination: This is the gaussian combination kernel. It has adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
- multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has adjustable parameters kernel sigma1 and kernel sigma shift.
Range: selection
kernel_gammaThis is the SVM kernel parameter gamma. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova. Range: real
kernel_sigma1This is the SVM kernel parameter sigma1. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
kernel_sigma2This is the SVM kernel parameter sigma2. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
kernel_sigma3This is the SVM kernel parameter sigma3. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
kernel_shiftThis is the SVM kernel parameter shift. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric. Range: real
kernel_degreeThis is the SVM kernel parameter degree. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
kernel_aThis is the SVM kernel parameter a. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real
kernel_bThis is the SVM kernel parameter b. This parameter is only available when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real

Tutorial Processes

Clustering of Ripley-Set data set by the DBSCAN operator

In many cases, no target attribute (i.e. label) can be defined and the data should be automatically grouped. This procedure is called Clustering. RapidMiner supports a wide range of clustering schemes which can be used in just the same way like any other learning scheme. This includes the combination with all preprocessing operators.

In this Example Process, the 'Ripley-Set' data set is loaded using the Retrieve operator. Note that the label is loaded too, but it is only used for visualization and comparison and not for building the clusters itself. A breakpoint is inserted at this step so that you can have a look at the ExampleSet before application of the DBSCAN operator. Other than the label attribute the 'Ripley-Set' has two real attributes; 'att1' and 'att2'. The DBSCAN operator is applied on this data set with default values for all parameters except the epsilon parameter which is set to 0.1. Run the process and you will see that two new attributes are created by the DBSCAN operator. The id attribute is created to distinguish examples clearly. The cluster attribute is created to show which cluster the examples belong to. Each example is assigned to a particular cluster. The examples in 'cluster_0' are considered as noise. Also note the Plot View of this data set. Switch to Plot View and set the the Plotter to 'Scatter', x-Axis to 'att1', y-Axis to 'att2' and Color Column to 'cluster'. You can clearly see how the algorithm has created three separate groups (noise i.e. cluster_0 is also visible separately). A cluster model is also delivered through the cluster model output port. It has information regarding the clustering performed. Under Folder View you can see members of each cluster in folder format.