kmeans
k-means clustering in C++
|
Implements the Hartigan-Wong algorithm for k-means clustering. More...
#include <RefineHartiganWong.hpp>
Public Member Functions | |
RefineHartiganWong (RefineHartiganWongOptions options) | |
RefineHartiganWong ()=default | |
RefineHartiganWongOptions & | get_options () |
![]() | |
virtual Details< Index_ > | run (const Matrix_ &data, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters) const =0 |
Implements the Hartigan-Wong algorithm for k-means clustering.
The Hartigan-Wong algorithm performs several iterations of transferring points between clusters, involving a computationally expensive "optimal transfer" step that checks each observation against each cluster to determine its best assignment, followed by a cheaper "quick transfer" step, which iterates between the best and second-best cluster choices for each observation. The latter enables rapid exploration of the local solution space without the unnecessary expense of repeatedly comparing each observation to all clusters. The choice of "best" cluster for each observation considers the gain/loss in the sum of squares when an observation moves between clusters, even accounting for the shift in the cluster centers after the transfer. The algorithm terminates when no observation wishes to transfer between clusters.
This implementation is derived from the Fortran code underlying the kmeans
function in the stats R package, which in turn is taken from Hartigan and Wong (1979).
In the Details::status
returned by run()
, the status code is either 0 (success), 2 (maximum optimal transfer iterations reached without convergence) or 4 (maximum quick transfer iterations reached without convergence, if RefineHartiganWongOptions::quit_on_quick_transfer_convergence_failure = true
). Previous versions of the library would report a status code of 1 upon encountering an empty cluster, but these are now just ignored.
In the Details::iterations
returned by run()
, the reported number of iterations is that of the optimal transfers.
Index_ | Integer type of the observation indices. This should be the same as the index type of Matrix_ . |
Data_ | Numeric type of the input dataset. This should be the same as the data type of Matrix_ . |
Cluster_ | Integer type of the cluster assignments. |
Float_ | Floating-point type of the centroids. This will also be used for the internal distance calculations. |
Matrix_ | Class satisfying the Matrix interface. |
|
inline |
options | Further options for the Hartigan-Wong algorithm. |
|
default |
Default constructor.
|
inline |
run()
.