Overview of DBSCAN

DBSCAN it's a specific clustering algorithm, very apropriated to spatial data. It's a non-parametric method that apply the concept of density, which beyond identify clusters, it's also able to recognize noise observations. Thus, all those ideas it's inside their name, Density Based Spatial Clustering of Application with Noise (DBSCAN), which classifies it as a density-based clustering non-parametric algorithm.

In a certain way, the algorith try mimic the human hability of recognize groups of points which are close to each other and points which are distant be considered out of the group.

Figure 01 - Spatial Data example from the First Paper of DBSCAN

This way we could identify points inside a density/cluster and points which isn't in any density cluster, called noises.

The user choose the hyperparameters min_points and ϵ, which correspond to the minimum points necessary for a observation be considered a core-point and the radius which every point will calculate the other points who surround them, respectively.

Logic of DBSCAN

The algorithm uses two important data structures in your code structure:

To keep track of the points which are already labelled.

Used to make the continuous search of density points in a tree structure. That search feed and use the FIFO queue and this structure make the code be very performatic by using the KDTree struct from NearestNeighbors.jl package.

The algorithm sets as a core-point , points which has more than min_points neighbours within this radius ϵ. All the points inside this radius distance it's part of the same cluster (same density). Points with less than min_points neighbours could be labelled as:

  • Noise

Because doesn't touchs any point already labelled as cluster.

  • Border point

Because touchs at least one labelled point, which will also be labelled as the point touched.

The distance used in this implementation to calculate the point's neighbours is the euclidian distance. In the code, this search for neighbors it's refered as the RangeQuery function and that function it's the most crucial part of the algorithm in question of performancing. Because of that search we incorporate the KDTree structure, cited before, already implemented in NearestNeighbours.jl.

Figure 02 - Ilustration of DBSCAN from this Paper

Pseudocode

We got inspired by this pseudo-code presented in the Paper "DBSCAN Revisited". Here, the reader could get a overview of the steps of the code.

A cool visualization that explain the algorithm

Now, I need to share with the world this amazing website created by Naftali Harris, from this detailed post about DBSCAN. I crop a gif from this interactive visualization because this way I hope would be easier for the reader connect all the information broughted above.

Benchmarking code

using ClusterAnalysis, DataFrames, CSV, BenchmarkTools

# load blob dataset 
df = CSV.read("algo_overview/blob_data.csv", DataFrame, drop=[1]);
X = df[:,1:2];
y = df[:,end];

# parameters of k-means
ϵ = 0.35;
min_pts = 10;

# benchmarking algorithm
@benchmark m = dbscan(X, ϵ, min_pts)

Scikit-Learn with C in backend

R with C++ in backend

Machine settings used in benchmarking Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz 2.71 GHz RAM: 8,00 GB

DBSCAN Results

using Plots, StatsPlots

m = dbscan(X, ϵ, min_pts);

scatter(X[:,1], X[:,2], zcolor=m.labels, 
        leg=false, 
        title="DBSCAN prediction\n(ϵ=$(ϵ), minPts=$(min_pts))")

DBSCAN Struct

struct DBSCAN{T<:AbstractFloat, KD<:KDTree}
    df::Matrix{T}
    ϵ::T
    min_pts::Int
    labels::Vector{Int}
    tree::KD
    clusters::Vector{Vector{Int}}

    # Internal Constructor
    function DBSCAN(df::Matrix{T}, ϵ::T, min_pts::Int) where {T<:AbstractFloat} 
        labels = fill(-1, size(df,1))
        tree = KDTree(df', leafsize=20)
        clusters = Vector{Vector{Int}}()
        KD = typeof(tree)

        new{T, KD}(df, ϵ, min_pts, labels, tree, clusters)
    end
end

References and Papers