kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
Public Member Functions | List of all members
kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ > Class Template Reference

Implements the mini-batch algorithm for k-means clustering. More...

#include <RefineMiniBatch.hpp>

Inheritance diagram for kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >:
Inheritance graph
[legend]
Collaboration diagram for kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >:
Collaboration graph
[legend]

Public Member Functions

 RefineMiniBatch (RefineMiniBatchOptions options)
 
 RefineMiniBatch ()=default
 
RefineMiniBatchOptionsget_options ()
 
Details< typename Matrix_::index_type > run (const Matrix_ &data, Cluster_ ncenters, Float_ *centers, Cluster_ *clusters) const
 

Detailed Description

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
class kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >

Implements the mini-batch algorithm for k-means clustering.

The mini-batch approach is similar to Lloyd's algorithm in that it runs through a set of observations, assigns each to the closest centroid, updates the centroids and repeats. The key difference is that each iteration is performed with a random subset of observations (i.e., a "mini-batch"), instead of the full set of observations. This reduces computational time and memory usage at the cost of some solution quality.

The update procedure for a cluster's centroid involves adjusting the coordinates by the assigned observations in the mini-batch. The resulting vector can be interpreted as the mean of all observations that have ever been sampled (possibly multiple times) to that cluster. Thus, the magnitude of the updates will decrease in later iterations as the relative effect of newly sampled points is reduced. This ensures that the centroids will stabilize at a sufficiently large number of iterations.

We may stop the algorithm before the maximum number of iterations if only a few observations are reassigned at each iteration. Specifically, every \(h\) iterations, we compute the proportion of sampled observations for each cluster in the past \(h\) mini-batches that were reassigned to/from that cluster. If this proportion is less than some threshold \(p\) for all clusters, we consider that the algorithm has converged.

In the Details::status returned by run(), the status code is either 0 (success) or 2 (maximum iterations reached without convergence). Previous versions of the library would report a status code of 1 upon encountering an empty cluster, but these are now just ignored.

Template Parameters
Matrix_Matrix type for the input data. This should satisfy the MockMatrix contract.
Cluster_Integer type for the cluster assignments.
Float_Floating-point type for the centroids.

Constructor & Destructor Documentation

◆ RefineMiniBatch() [1/2]

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >::RefineMiniBatch ( RefineMiniBatchOptions  options)
inline
Parameters
optionsFurther options for the mini-batch algorithm.

◆ RefineMiniBatch() [2/2]

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >::RefineMiniBatch ( )
default

Default constructor.

Member Function Documentation

◆ get_options()

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
RefineMiniBatchOptions & kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >::get_options ( )
inline
Returns
Options for mini-batch partitioning, to be modified prior to calling run().

◆ run()

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
Details< typename Matrix_::index_type > kmeans::RefineMiniBatch< Matrix_, Cluster_, Float_ >::run ( const Matrix_ data,
Cluster_  num_centers,
Float_ centers,
Cluster_ clusters 
) const
inlinevirtual
Parameters
dataA matrix-like object (see MockMatrix) containing per-observation data.
num_centersNumber of cluster centers.
[in,out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On input, each column should contain the initial centroid location for its cluster. On output, each column will contain the final centroid locations for each cluster.
[out]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). On output, this will contain the cluster assignment for each observation.
Returns
centers and clusters are filled, and a Details object is returned containing clustering statistics. If num_centers is greater than data.num_observations(), only the first data.num_observations() columns of the centers array will be filled.

Implements kmeans::Refine< Matrix_, Cluster_, Float_ >.


The documentation for this class was generated from the following file: