DBSCAN (RapidMiner Studio Core)
Synopsis
This operator performs clustering with DBSCAN. DBSCAN (for density-based spatial clustering of applications with noise) is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.Description
DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood) and if p is surrounded by sufficiently many points such that one may consider p and q to be part of a cluster. q is called density-reachable (note the distinction from "directly density-reachable") from p if there is a sequence p(1),…,p(n) of points with p(1) = p and p(n) = q where each p(i+1) is directly density-reachable from p(i).
Note that the relation of density-reachable is not symmetric. q might lie on the edge of a cluster, having insufficiently many neighbors to count as dense itself. This would halt the process of finding a path that stops with the first non-dense point. By contrast, starting the process with q would lead to p (though the process would halt there, p being the first non-dense point). Due to this asymmetry, the notion of density-connected is introduced: two points p and q are density-connected if there is a point o such that both p and q are density-reachable from o. Density-connectedness is symmetric.
A cluster, which is a subset of the points of the data set, satisfies two properties: All points within the cluster are mutually density-connected. If a point is density-connected to any point of the cluster, it is part of the cluster as well.
DBSCAN requires two parameters: epsilon and the minimum number of points required to form a cluster (minimal_points). epsilon and minPts can be specified through the epsilon and min points parameters respectively. DBSCAN starts with an arbitrary starting point that has not been visited. This point's epsilon-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized epsilon-environment of a different point and hence be made part of a cluster.
If a point is found to be a dense part of a cluster, its epsilon-neighborhood is also part of that cluster. Hence, all points that are found within the epsilon-neighborhood are added, as is their own epsilon-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
This operator also creates scores for the assignments. Scores are the minimal distances to a member of the respective cluster.
In the application phase the algorithm checks if there are at least minPoints of a given cluster in it's epsilon neighbourhood. If so, the data point is assigned to the given cluster. If there is no cluster which fulfills the condition, the cluster assignment is 'Noise'. If there is more than one cluster which fulfills the condition, then the cluster which has the minimal point to point distance is chosen. This is the minimum distance to any data point in the given cluster.
Input
- example set (Data Table)
This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process.
Output
- example set (Data Table)
The resulting data set with the cluster assignments.
- original (Data Table)
The original data set passed through.
- model (Cluster model)
This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster.
Parameters
- epsilonThis parameter specifies the epsilon parameter of the DBSCAN algorithm. epsilon specifies the size of the neighborhood. Range: real
- minimal_pointsThis parameter specifies the minimal number of points which needs to be in the epsilon neighborhood to call a point part of the cluster. Range: integer
Tutorial Processes
Clustering Spiral Clusters
In this process we generate a spiral cluster, i.e 2 spirals moving around each other. We use both a DBScan clustering and a k-means clustering to see the differences between the two approaches. In order to see the difference you can use a scatter plotting charting att1 and att2 against each other. You can then put dbscan_cluster or kmeans_cluster on the color axis. It can be clearly seen that the dbscan clustering algorithm is able to distinguish between the two spirals, while the kmeans algorithm does fail to do so.