kmeans
A C++ library for k-means
|
This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.
kmeans is a header-only library, so it can be easily used by just #include
ing the relevant source files and running compute()
:
See the reference documentation for more details.
We can tune the clustering by passing options into the constructors of the relevant classes:
The initialization and refinement classes can themselves be swapped at run-time via pointers to their respective interfaces. This design also allows the kmeans library to be easily extended to additional methods from third-party developers.
Template parameters can also be altered to control the input and output data types. As shown above, these should be set consistently for all classes used in compute()
. While int
and double
are suitable for most cases, advanced users may wish to use other types. For example, we might consider the following parametrization for various reasons:
If we want the within-cluster sum of squares, this can be easily computed from the output of compute()
:
If we already allocated arrays for the centroids and clusters, we can fill the arrays directly. This allows us to skip a copy when interfacing with other languages that manage their own memory (e.g., R, Python).
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to kmeans to make the headers available during compilation:
find_package()
To install the library, clone an appropriate version of this repository and run:
Then we can use find_package()
as usual:
By default, this will use FetchContent
to fetch all external dependencies (see extern/CMakeLists.txt
for a list). If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF
.
If you're not using CMake, the simple approach is to just copy the files in include/
- either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. This requires the external dependencies listed in extern/CMakeLists.txt
, which also need to be made available during compilation.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.
Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.