kmeans
A C++ library for k-means
|
This repository contains a header-only C++ library for k-means clustering. Initialization can be performed with user-supplied centers, random selection of points, weighted sampling with kmeans++ (Arthur and Vassilvitskii, 2007) or variance partitioning (Su and Dy, 2007). Refinement can be performed using the Hartigan-Wong approach or Lloyd's algorithm. The Hartigan-Wong implementation is derived from the Fortran code in the R stats package, heavily refactored for more idiomatic C++.
kmeans is a header-only library, so it can be easily used by just #include
ing the relevant source files:
If we already allocated arrays for the centroids and clusters, we can fill the arrays directly:
We can tune the clustering by passing options into the constructors of the relevant classes:
See the reference documentation for more details.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to kmeans to make the headers available during compilation:
find_package()
To install the library, clone an appropriate version of this repository and run:
Then we can use find_package()
as usual:
By default, this will use FetchContent
to fetch all external dependencies (see extern/CMakeLists.txt
for a list). If you want to install them manually, use -DKMEANS_FETCH_EXTERN=OFF
.
If you're not using CMake, the simple approach is to just copy the files in include/
- either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. This requires the external dependencies listed in extern/CMakeLists.txt
, which also need to be made available during compilation.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics 28, 100-108.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.
Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.