Documentation of ClusterAnalysis functions.
ClusterAnalysis.dbscanClusterAnalysis.euclideanClusterAnalysis.kmeansClusterAnalysis.squared_errorClusterAnalysis.totalwithinss
ClusterAnalysis.kmeans — Functionkmeans(table, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)
kmeans(data::AbstractMatrix, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)Classify all data observations in k clusters by minimizing the total-variance-within each cluster.
Arguments (positional)
tableordata: table or Matrix of data observations.K: number of clusters.
Keyword
nstart: number of starts.maxiter: number of maximum iterations.init: centroids inicialization algorithm -:kmpp(default) or:random.
Example
julia> using ClusterAnalysis
julia> using CSV, DataFrames
julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];
julia> model = kmeans(df, 3)
KmeansResult{Float64}:
K = 3
centroids = [
[5.932307692307693, 2.755384615384615, 4.42923076923077, 1.4384615384615382]
[5.006, 3.4279999999999995, 1.462, 0.24599999999999997]
[6.874285714285714, 3.088571428571429, 5.791428571428571, 2.117142857142857]
]
cluster = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 … 3, 3, 1, 3, 3, 3, 1, 3, 3, 1]
within-cluster sum of squares = 78.85144142614601
iterations = 7Pseudo-code of the algorithm:
- Repeat
nstarttimes:- Initialize
Kclusters centroids using KMeans++ algorithm or random init. - Estimate clusters.
- Repeat
maxitertimes:- Update centroids using the mean().
- Reestimates the clusters.
- Calculate the total-variance-within-cluster.
- Evaluate the stop rule.
- Initialize
- Keep the best result (minimum total-variance-within-cluster) of all
nstartexecutions.
For more detailed explanation of the algorithm, check the Algorithm's Overview of KMeans.
ClusterAnalysis.dbscan — Functiondbscan(df, ϵ::Real, min_pts::Int)Classify data observations in clusters and noises by using a density concept obtained with the parameters input (ϵ, min_pts).
The number of clusters are obtained during the execution of the model, therefore, initially the user don't know how much clusters it will obtain. The algorithm use the KDTree structure from NearestNeighbors.jl to calculate the RangeQuery operation more efficiently.
For more detailed explanation of the algorithm, check the Algorithm's Overview of DBSCAN
ClusterAnalysis.euclidean — FunctionClusterAnalysis.euclidean(a::AbstractVector, b::AbstractVector)Calculate euclidean distance from two vectors. √∑(aᵢ - bᵢ)².
Arguments (positional)
a: First vector.b: Second vector.
Example
julia> using ClusterAnalysis
julia> a = rand(100); b = rand(100);
julia> ClusterAnalysis.euclidean(a, b)
3.8625780213774954ClusterAnalysis.squared_error — FunctionClusterAnalysis.squared_error(data::AbstractMatrix)
ClusterAnalysis.squared_error(col::AbstractVector)Function that evaluate the kmeans, using the Sum of Squared Error (SSE).
Arguments (positional)
dataorcol: Matrix of data observations or a Vector which represents one column of data.
Example
julia> using ClusterAnalysis
julia> a = rand(100, 4);
julia> ClusterAnalysis.squared_error(a)
34.71086095943974
julia> ClusterAnalysis.squared_error(a[:, 1])
10.06029322934825ClusterAnalysis.totalwithinss — FunctionClusterAnalysis.totalwithinss(data::AbstractMatrix, K::Int, cluster::Vector)Calculate the total-variance-within-cluster using the squared_error() function.
Arguments (positional)
data: Matrix of data observations.K: number of clusters.cluster: Vector of cluster for each data observation.
Example
julia> using ClusterAnalysis
julia> using CSV, DataFrames
julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];
julia> model = kmeans(df, 3);
julia> ClusterAnalysis.totalwithinss(Matrix(df), model.K, model.cluster)
78.85144142614601