kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
C++ library for k-means

Unit tests Documentation stats comparison Codecov

Overview

This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.

Quick start

kmeans is a header-only library, so it can be easily used by just #includeing the relevant source files and running compute():

#include "kmeans/Kmeans.hpp"
int ndim = 5;
int nobs = 1000;
std::vector<double> matrix(ndim * nobs); // column-major ndim x nobs matrix of coordinates
// Wrap your matrix in a SimpleMatrix.
int, /* type for the column index */
double /* type of the data */
> kmat(ndim, nobs, matrix.data());
auto res = kmeans::compute(
kmat,
// initialize with kmeans++
/* column index type */ int,
/* input matrix data type */ double,
/* cluster ID type */ int,
/* centroid type */ double
>(),
// refine with Lloyd's algorithm
/* column index type */ int,
/* input matrix data type */ double,
/* cluster ID type */ int,
/* centroid type */ double
>(),
ncenters
);
res.centers; // Matrix of centroid coordinates, stored in column-major format
res.clusters; // Vector of cluster assignments
res.details; // Details from the clustering algorithm
k-means++ initialization of Arthur and Vassilvitskii (2007).
Definition InitializeKmeanspp.hpp:144
Implements the Lloyd algorithm for k-means clustering.
Definition RefineLloyd.hpp:64
A simple matrix of observations.
Definition SimpleMatrix.hpp:76
Details< Index_ > compute(const Matrix_ &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > &refine, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters)
Definition kmeans.hpp:53

See the reference documentation for more details.

Changing parameters

We can tune the clustering by passing options into the constructors of the relevant classes:

vp_opt.optimize_partition = false;
ll_opt.max_iterations = 10;
ll_opt.num_threads = 3;
auto res2 = kmeans::compute(kmat, pp, ll, ncenters);
Implements the variance partitioning method of Su and Dy (2007).
Definition InitializeVariancePartition.hpp:168
Options for InitializeVariancePartition.
Definition InitializeVariancePartition.hpp:23
bool optimize_partition
Definition InitializeVariancePartition.hpp:34
Options for RefineLloyd construction.
Definition RefineLloyd.hpp:25
int num_threads
Definition RefineLloyd.hpp:36
int max_iterations
Definition RefineLloyd.hpp:30

The initialization and refinement classes can themselves be swapped at run-time via pointers to their respective interfaces. This design also allows the kmeans library to be easily extended to additional methods from third-party developers.

std::unique_ptr<kmeans::Initialize<int, double, int, double> > init_ptr;
if (init_method == "random") {
} else if (init_method == "kmeans++") {
opt.seed = 42;
} else {
// do something else
}
std::unique_ptr<kmeans::Refine<int, double, int, double> > ref_ptr;
if (ref_method == "random") {
opt.max_iterations = 10;
} else {
opt.max_iterations = 100;
}
auto res3 = kmeans::compute(kmat, *init_ptr, *ref_ptr, ncenters);
Initialize by sampling random observations without replacement.
Definition InitializeRandom.hpp:43
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:498
Options for k-means++ initialization.
Definition InitializeKmeanspp.hpp:27
uint64_t seed
Definition InitializeKmeanspp.hpp:31
Options for RefineHartiganWong.
Definition RefineHartiganWong.hpp:27
int max_iterations
Definition RefineHartiganWong.hpp:32
int max_quick_transfer_iterations
Definition RefineHartiganWong.hpp:38

Template parameters can also be altered to control the input and output data types. As shown above, these should be set consistently for all classes used in compute(). While int and double are suitable for most cases, advanced users may wish to use other types. For example, we might consider the following parametrization for various reasons:

/* If our input data has too many observations to fit into an 'int', we
* might need to use a 'size_t' instead.
*/
size_t,
/* Perhaps our input data is in single-precision floating point to save
* space and to speed up processing.
*/
float,
/* If we know that we will never ask for more than 255 clusters, we can use
* a smaller integer for the cluster IDs to save space.
*/
uint8_t,
/* We still want our centroids and distances to be computed in high
* precision, even though the input data is only single precision.
*/
double
> initpp();

Other bits and pieces

If we want the within-cluster sum of squares, this can be easily computed from the output of compute():

std::vector<double> wcss(ncenters);
kmat,
ncenters,
res.centers.data(),
res.clusters.data(),
wcss.data()
);
void compute_wcss(const Matrix_ &data, Cluster_ ncenters, const Float_ *centers, const Cluster_ *clusters, Float_ *wcss)
Definition compute_wcss.hpp:30

If we already allocated arrays for the centroids and clusters, we can fill the arrays directly. This allows us to skip a copy when interfacing with other languages that manage their own memory (e.g., R, Python).

std::vector<double> centers(ndim * ncenters);
std::vector<int> clusters(nobs);
auto deets = kmeans::compute(
kmat,
kmeans::InitializeRandom(), // random initialization
kmeans::RefineHartiganWong(), // refine with Hartigan-Wong
ncenters
centers.data(),
clusters.data()
);

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
kmeans
GIT_REPOSITORY https://github.com/LTLA/CppKmeans
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(kmeans)

Then you can link to kmeans to make the headers available during compilation:

# For executables:
target_link_libraries(myexe ltla::kmeans)
# For libaries
target_link_libraries(mylib INTERFACE ltla::kmeans)

CMake with find_package()

To install the library, clone an appropriate version of this repository and run:

mkdir build && cd build
cmake .. -DKMEANS_TESTS=OFF
cmake --build . --target install

Then we can use find_package() as usual:

find_package(ltla_kmeans CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE ltla::kmeans)

By default, this will use FetchContent to fetch all external dependencies (see extern/CMakeLists.txt for a list). If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt, which also need to be made available during compilation.

References

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.

Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.