kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
C++ library for k-means

Unit tests Documentation stats comparison Codecov

Overview

This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.

Quick start

kmeans is a header-only library, so it can be easily used by just #includeing the relevant source files:

#include "kmeans/Kmeans.hpp"
// assorted boilerplate here...
// Wrap your matrix in a SimpleMatrix.
kmeans::SimpleMatrix kmat(ndim, nobs, matrix.data());
auto res = kmeans::compute(
kmat,
kmeans::InitializeKmeanspp(), // initialize with kmeans++
kmeans::RefineLloyd(), // refine with Lloyd's algorithm
ncenters
);
res.centers; // Matrix of centroid coordinates, stored in column-major format
res.clusters; // Vector of cluster assignments
res.details; // Details from the clustering algorithm
// Compute the WCSS if we want it:
std::vector<double> wcss(ncenters);
kmat,
ncenters,
res.centers.data(),
res.clusters.data(),
wcss.data()
);
k-means++ initialization of Arthur and Vassilvitskii (2007).
Definition InitializeKmeanspp.hpp:146
Implements the Lloyd algorithm for k-means clustering.
Definition RefineLloyd.hpp:61
A simple matrix of observations.
Definition SimpleMatrix.hpp:22
Details< typename Matrix_::index_type > compute(const Matrix_ &data, const Initialize< Matrix_, Cluster_, Float_ > &initialize, const Refine< Matrix_, Cluster_, Float_ > &refine, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters)
Definition kmeans.hpp:49
void compute_wcss(const Matrix_ &data, Cluster_ ncenters, const Float_ *centers, const Cluster_ *clusters, Float_ *wcss)
Definition compute_wcss.hpp:30

If we already allocated arrays for the centroids and clusters, we can fill the arrays directly:

std::vector<double> centers(ndim * ncenters);
std::vector<int> clusters(nobs);
auto deets = kmeans::compute(
kmat,
kmeans::InitializeRandom(), // random initialization
kmeans::RefineHartiganWong(), // refine with Hartigan-Wong
ncenters
centers.data(),
clusters.data()
);
Initialize by sampling random observations without replacement.
Definition InitializeRandom.hpp:40
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:499

We can tune the clustering by passing options into the constructors of the relevant classes:

vp_opt.optimize_partition = false;
ll_opt.max_iterations = 10;
ll_opt.num_threads = 3;
auto res2 = kmeans::compute(kmat, pp, ll, ncenters);
Implements the variance partitioning method of Su and Dy (2007).
Definition InitializeVariancePartition.hpp:164
Options for InitializeVariancePartition.
Definition InitializeVariancePartition.hpp:22
bool optimize_partition
Definition InitializeVariancePartition.hpp:33
Options for RefineLloyd construction.
Definition RefineLloyd.hpp:25
int num_threads
Definition RefineLloyd.hpp:36
int max_iterations
Definition RefineLloyd.hpp:30

See the reference documentation for more details.

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
kmeans
GIT_REPOSITORY https://github.com/LTLA/CppKmeans
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(kmeans)

Then you can link to kmeans to make the headers available during compilation:

# For executables:
target_link_libraries(myexe ltla::kmeans)
# For libaries
target_link_libraries(mylib INTERFACE ltla::kmeans)

CMake with find_package()

To install the library, clone an appropriate version of this repository and run:

mkdir build && cd build
cmake .. -DKMEANS_TESTS=OFF
cmake --build . --target install

Then we can use find_package() as usual:

find_package(ltla_kmeans CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE ltla::kmeans)

By default, this will use FetchContent to fetch all external dependencies (see extern/CMakeLists.txt for a list). If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt, which also need to be made available during compilation.

References

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.

Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.