|
kmeans
k-means clustering in C++
|
This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong method, Lloyd's algorithm or a custom mini-batch implementation. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.
kmeans is a header-only library, so it can be easily used by just #includeing the relevant source files and running compute():
See the reference documentation for more details.
We can tune the clustering by passing options into the constructors of the relevant classes:
The initialization and refinement classes can themselves be swapped at run-time via pointers to their respective interfaces. This design also allows the kmeans library to be easily extended to additional methods from third-party developers.
Template parameters can also be altered to control the input and output data types. As shown above, these should be set consistently for all classes used in compute(). While int and double are suitable for most cases, advanced users may wish to use other types. For example, we might consider the following parametrization for various reasons:
If we want the within-cluster sum of squares, this can be easily calculated from the output of compute():
If we already allocated arrays for the centroids and clusters, we can fill the arrays directly. This allows us to skip a copy when interfacing with other languages that manage their own memory (e.g., R, Python).
FetchContentIf you're using CMake, you just need to add something like this to your CMakeLists.txt:
Then you can link to kmeans to make the headers available during compilation:
By default, this will use FetchContent to fetch all external dependencies. Applications are advised to pin the versions of each dependency for stability - see extern/CMakeLists.txt for suggested versions. If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF.
find_package()To install the library, clone an appropriate version of this repository and run:
Then we can use find_package() as usual:
Again, this will automatically acquire all its dependencies, see recommendations above.
If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.
Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.