Overview of K-Means and Code

It's a cluster algorithm based on centroids. The method partitions the data into k clusters, where each observation belongs to a cluster/centroid $k_{i}$. The user needs to provide the number of k cluster desired and to choose the ideal number os k, there are some methods such the elbow-plot or the silhuete-plot that could be implemented with this package.

The K-Means is a iterative method, therefore it optimizes the measure of intra-cluster variance through a sequence of iterations detailed below. There is a equivalence between the sum of squared error (SSE) and the total intra-cluster variance, Kriegel (2016), which allows us optimize the SSE in the code.

The inspiration for K-Means came from reading some articles (in the references) and watching Andrew NG lectures and StatQuest videos about it. Then, we start prototype the pseudo-code in Julia, which result in this package.

Pseudocode

  1. Initialize k centroids by K-Means++ algorithm or by random init.
  2. Calculate the distance of every point to every centroid.
  3. Assign every point to a cluster, by choosing the centroid with the minimum distance to the point.
  4. Calculate the Total Variance Within Cluster by the SSE of this iteration.
  5. Recalculate the centroids using the mean of the assigned points.
  6. Repeat the steps 3 to 5, maxiter times, until reaching convergence with minimum total variance at step 4.

After that, repeat all steps nstart times and select the centroids with the minimum total variance.

The default arguments nstart, maxiter and init in the code are 10, 10, and :kmpp, respectively. But could also be changed by the user changing the args in the function kmeans(data, K, nstart=10, maxiter=10), for example.

The default initialization is the K-Means++ algorithm (:kmpp) because it achieves faster convergence than the random method, which can be changed to random initialization (:random).

Cool vizualizations that explain the K-Means algorithm

Figure 01 - From Stanford ML CheatSheet

Figure 02 - From K-Means wikipedia page

Benchmarking the algorithm

# load packages
using ClusterAnalysis, RDatasets, DataFrames, BenchmarkTools

# load iris dataset 
iris = dataset("datasets", "iris");
df = iris[:, 1:end-1];

# parameters of k-means
k, nstart, maxiter = 3, 10, 10;

# benchmarking the algorithm
@benchmark model = kmeans($df, $k, nstart=$nstart, maxiter=$maxiter)

ClusterAnalysis.jl implementation

This implementation has an excellent computational performance, being faster than Scikit-Learn's KMeans and very close to the kmeans from R, which call C and FORTRAN in its source code.

Scikit-Learn with C in backend

R with C and FORTRAN in backend

Machine settings used in benchmarking Processor: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz 2.71 GHz RAM: 8,00 GB

References and Papers

  • First paper that mentioned K-Means.
  • Pseudo-Code utilized to prototype the first code extracted from the book Information Theory, Inference and Learning Algorithms.
  • K-Means++ paper with the KMeans++ initialization that we will add soon.
  • Stanford Slides about K-Means.
  • Kriegel, Hans-Peter; Schubert, Erich; Zimek, Arthur (2016). "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?".

K-Means Struct

struct KmeansResult{T<:Real}
    K::Int
    centroids::Vector{Vector{T}}
    cluster::Vector{Int}
    withinss::T
    iter::Int
end