Implements the mini-batch algorithm for k-means clustering.
The mini-batch approach is similar to Lloyd's algorithm in that it runs through a set of observations, assigns each to the closest centroid, updates the centroids and repeats. The key difference is that each iteration is performed with a random subset of observations (i.e., a "mini-batch"), instead of the full set of observations. This reduces computational time and memory usage at the cost of some solution quality.
The update procedure for a cluster's centroid involves adjusting the coordinates by the assigned observations in the mini-batch. The resulting vector can be interpreted as the mean of all observations that have ever been sampled (possibly multiple times) to that cluster. Thus, the magnitude of the updates will decrease in later iterations as the relative effect of newly sampled points is reduced. This ensures that the centroids will stabilize at a sufficiently large number of iterations.
We may stop the algorithm before the maximum number of iterations if only a few observations are reassigned at each iteration. Specifically, every \(h\) iterations, we compute the proportion of sampled observations for each cluster in the past \(h\) mini-batches that were reassigned to/from that cluster. If this proportion is less than some threshold \(p\) for all clusters, we consider that the algorithm has converged.
In the Details::status
returned by run()
, the status code is either 0 (success) or 2 (maximum iterations reached without convergence). Previous versions of the library would report a status code of 1 upon encountering an empty cluster, but these are now just ignored.
- Template Parameters
-
Matrix_ | Matrix type for the input data. This should satisfy the MockMatrix contract. |
Cluster_ | Integer type for the cluster assignments. |
Float_ | Floating-point type for the centroids. |