Documentation of ClusterAnalysis functions.
ClusterAnalysis.dbscan
ClusterAnalysis.euclidean
ClusterAnalysis.kmeans
ClusterAnalysis.squared_error
ClusterAnalysis.totalwithinss
ClusterAnalysis.kmeans
— Functionkmeans(table, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)
kmeans(data::AbstractMatrix, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)
Classify all data observations in k clusters by minimizing the total-variance-within each cluster.
Arguments (positional)
table
ordata
: table or Matrix of data observations.K
: number of clusters.
Keyword
nstart
: number of starts.maxiter
: number of maximum iterations.init
: centroids inicialization algorithm -:kmpp
(default) or:random
.
Example
julia> using ClusterAnalysis
julia> using CSV, DataFrames
julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];
julia> model = kmeans(df, 3)
KmeansResult{Float64}:
K = 3
centroids = [
[5.932307692307693, 2.755384615384615, 4.42923076923077, 1.4384615384615382]
[5.006, 3.4279999999999995, 1.462, 0.24599999999999997]
[6.874285714285714, 3.088571428571429, 5.791428571428571, 2.117142857142857]
]
cluster = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 … 3, 3, 1, 3, 3, 3, 1, 3, 3, 1]
within-cluster sum of squares = 78.85144142614601
iterations = 7
Pseudo-code of the algorithm:
- Repeat
nstart
times:- Initialize
K
clusters centroids using KMeans++ algorithm or random init. - Estimate clusters.
- Repeat
maxiter
times:- Update centroids using the mean().
- Reestimates the clusters.
- Calculate the total-variance-within-cluster.
- Evaluate the stop rule.
- Initialize
- Keep the best result (minimum total-variance-within-cluster) of all
nstart
executions.
For more detailed explanation of the algorithm, check the Algorithm's Overview of KMeans
.
ClusterAnalysis.dbscan
— Functiondbscan(df, ϵ::Real, min_pts::Int)
Classify data observations in clusters and noises by using a density concept obtained with the parameters input (ϵ
, min_pts
).
The number of clusters are obtained during the execution of the model, therefore, initially the user don't know how much clusters it will obtain. The algorithm use the KDTree
structure from NearestNeighbors.jl
to calculate the RangeQuery
operation more efficiently.
For more detailed explanation of the algorithm, check the Algorithm's Overview of DBSCAN
ClusterAnalysis.euclidean
— FunctionClusterAnalysis.euclidean(a::AbstractVector, b::AbstractVector)
Calculate euclidean distance from two vectors. √∑(aᵢ - bᵢ)².
Arguments (positional)
a
: First vector.b
: Second vector.
Example
julia> using ClusterAnalysis
julia> a = rand(100); b = rand(100);
julia> ClusterAnalysis.euclidean(a, b)
3.8625780213774954
ClusterAnalysis.squared_error
— FunctionClusterAnalysis.squared_error(data::AbstractMatrix)
ClusterAnalysis.squared_error(col::AbstractVector)
Function that evaluate the kmeans, using the Sum of Squared Error (SSE).
Arguments (positional)
data
orcol
: Matrix of data observations or a Vector which represents one column of data.
Example
julia> using ClusterAnalysis
julia> a = rand(100, 4);
julia> ClusterAnalysis.squared_error(a)
34.71086095943974
julia> ClusterAnalysis.squared_error(a[:, 1])
10.06029322934825
ClusterAnalysis.totalwithinss
— FunctionClusterAnalysis.totalwithinss(data::AbstractMatrix, K::Int, cluster::Vector)
Calculate the total-variance-within-cluster using the squared_error()
function.
Arguments (positional)
data
: Matrix of data observations.K
: number of clusters.cluster
: Vector of cluster for each data observation.
Example
julia> using ClusterAnalysis
julia> using CSV, DataFrames
julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];
julia> model = kmeans(df, 3);
julia> ClusterAnalysis.totalwithinss(Matrix(df), model.K, model.cluster)
78.85144142614601