Processing math: 100%
kmeans
A C++ library for k-means
All Classes Namespaces Files Functions Variables Pages
kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ > Class Template Reference

Implements the mini-batch algorithm for k-means clustering. More...

#include <RefineMiniBatch.hpp>

Inheritance diagram for kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >:
Collaboration diagram for kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >:

Public Member Functions

 RefineMiniBatch (RefineMiniBatchOptions options)
 
 RefineMiniBatch ()=default
 
RefineMiniBatchOptionsget_options ()
 
- Public Member Functions inherited from kmeans::Refine< Index_, Data_, Cluster_, Float_, Matrix_ >
virtual Details< Index_ > run (const Matrix_ &data, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters) const =0
 

Detailed Description

template<typename Index_, typename Data_, typename Cluster_, typename Float_, typename Matrix_ = Matrix<Index_, Data_>>
class kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >

Implements the mini-batch algorithm for k-means clustering.

The mini-batch approach is similar to Lloyd's algorithm in that it runs through a set of observations, assigns each to the closest centroid, updates the centroids and repeats. The key difference is that each iteration is performed with a random subset of observations (i.e., a "mini-batch"), instead of the full set of observations. This reduces computational time and memory usage at the cost of some solution quality.

The update procedure for a cluster's centroid involves adjusting the coordinates by the assigned observations in the mini-batch. The resulting vector can be interpreted as the mean of all observations that have ever been sampled (possibly multiple times) to that cluster. Thus, the magnitude of the updates will decrease in later iterations as the relative effect of newly sampled points is reduced. This ensures that the centroids will stabilize at a sufficiently large number of iterations.

We may stop the algorithm before the maximum number of iterations if only a few observations are reassigned at each iteration. Specifically, every h iterations, we compute the proportion of sampled observations for each cluster in the past h mini-batches that were reassigned to/from that cluster. If this proportion is less than some threshold p for all clusters, we consider that the algorithm has converged.

In the Details::status returned by run(), the status code is either 0 (success) or 2 (maximum iterations reached without convergence). Previous versions of the library would report a status code of 1 upon encountering an empty cluster, but these are now just ignored.

Template Parameters
Index_Integer type for the observation indices.
Data_Numeric type for the data.
Cluster_Integer type for the cluster assignments.
Float_Floating-point type for the centroids. This will also be used for any internal distance calculations.
Matrix_Type for the input data matrix. This should satisfy the Matrix interface.

Constructor & Destructor Documentation

◆ RefineMiniBatch() [1/2]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , typename Matrix_ = Matrix<Index_, Data_>>
kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >::RefineMiniBatch ( RefineMiniBatchOptions options)
inline
Parameters
optionsFurther options for the mini-batch algorithm.

◆ RefineMiniBatch() [2/2]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , typename Matrix_ = Matrix<Index_, Data_>>
kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >::RefineMiniBatch ( )
default

Default constructor.

Member Function Documentation

◆ get_options()

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , typename Matrix_ = Matrix<Index_, Data_>>
RefineMiniBatchOptions & kmeans::RefineMiniBatch< Index_, Data_, Cluster_, Float_, Matrix_ >::get_options ( )
inline
Returns
Options for mini-batch partitioning, to be modified prior to calling run().

The documentation for this class was generated from the following file: