kmeans
k-means clustering in C++
Loading...
Searching...
No Matches
kmeans Namespace Reference

Perform k-means clustering. More...

Classes

class  ConsecutiveAccessExtractor
 Extractor for accessing consecutive observations. More...
 
struct  Details
 Additional statistics from the k-means algorithm. More...
 
class  IndexedAccessExtractor
 Extractor for accessing indexed observations. More...
 
class  Initialize
 Interface for k-means initialization algorithms. More...
 
class  InitializeKmeanspp
 k-means++ initialization of Arthur and Vassilvitskii (2007). More...
 
struct  InitializeKmeansppOptions
 Options for InitializeKmeanspp. More...
 
class  InitializeNone
 No-op "initialization" with existing cluster centers. More...
 
class  InitializeRandom
 Initialize by sampling random observations without replacement. More...
 
struct  InitializeRandomOptions
 Options for InitializeRandom. More...
 
class  InitializeVariancePartition
 Implements the variance partitioning method of Su and Dy (2007). More...
 
struct  InitializeVariancePartitionOptions
 Options for InitializeVariancePartition. More...
 
class  Matrix
 Interface for matrix data. More...
 
class  RandomAccessExtractor
 Extractor for accessing random observations. More...
 
class  Refine
 Interface for k-means refinement algorithms. More...
 
class  RefineHartiganWong
 Implements the Hartigan-Wong algorithm for k-means clustering. More...
 
struct  RefineHartiganWongOptions
 Options for RefineHartiganWong. More...
 
class  RefineLloyd
 Implements the Lloyd algorithm for k-means clustering. More...
 
struct  RefineLloydOptions
 Options for RefineLloyd. More...
 
class  RefineMiniBatch
 Implements the mini-batch algorithm for k-means clustering. More...
 
struct  RefineMiniBatchOptions
 Options for RefineMiniBatch. More...
 
struct  Results
 Results of the k-means clustering. More...
 
class  SimpleMatrix
 A simple matrix of observations. More...
 

Typedefs

typedef std::mt19937_64 InitializeKmeansppRng
 
typedef std::mt19937_64 InitializeRandomRng
 
typedef std::mt19937_64 RefineMiniBatchRng
 

Functions

template<class Matrix_ , typename Cluster_ , typename Float_ >
void compute_wcss (const Matrix_ &data, const Cluster_ num_centers, const Float_ *const centers, const Cluster_ *const clusters, Float_ *const wcss)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Details< Index_ > compute (const Matrix_ &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > &refine, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Details< Index_ > compute (const Matrix< Index_, Data_ > &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &refine, const Cluster_ num_centers, Float_ *const centers, Cluster_ *const clusters)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Results< Index_, Cluster_, Float_ > compute (const Matrix_ &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > &refine, const Cluster_ num_centers)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Results< Index_, Cluster_, Float_ > compute (const Matrix< Index_, Data_ > &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &refine, const Cluster_ num_centers)
 
template<typename Task_ , class Run_ >
void parallelize (const int num_workers, const Task_ num_tasks, Run_ run_task_range)
 
template<typename Index_ , typename Cluster_ , typename Float_ >
Cluster_ remove_unused_centers (const std::size_t num_dimensions, const Index_ num_observations, Cluster_ *const clusters, const Cluster_ num_centers, Float_ *const centers, std::vector< Index_ > &sizes)
 

Detailed Description

Perform k-means clustering.

Typedef Documentation

◆ InitializeKmeansppRng

typedef std::mt19937_64 kmeans::InitializeKmeansppRng

Type of the pseudo-random number generator for InitializeKmeanspp.

◆ InitializeRandomRng

typedef std::mt19937_64 kmeans::InitializeRandomRng

Type of the pseudo-random number generator for InitializeRandom.

◆ RefineMiniBatchRng

typedef std::mt19937_64 kmeans::RefineMiniBatchRng

Type of the pseudo-random number generator for RefineMiniBatch.

Function Documentation

◆ compute() [1/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Results< Index_, Cluster_, Float_ > kmeans::compute ( const Matrix< Index_, Data_ > & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & refine,
const Cluster_ num_centers )

Overload of compute() to assist template deduction for the default Matrix. This allocates and returns the vectors for the centroids and cluster assignments.

Template Parameters
Index_Integer type of the observation indices.
Data_Numeric type of the input dataset.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
Returns
Results of the clustering, including the centroid locations and cluster assignments.

◆ compute() [2/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Details< Index_ > kmeans::compute ( const Matrix< Index_, Data_ > & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & refine,
const Cluster_ num_centers,
Float_ *const centers,
Cluster_ *const clusters )

Overload of compute() to assist template deduction for the default Matrix.

Template Parameters
Index_Integer type of the observation indices.
Data_Numeric type of the input dataset.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
[out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On output, each column should contain the initial centroid location for its cluster.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). On output, this will contain the 0-based cluster assignment for each observation, where each entry is less than num_centers.
Returns
Details of the clustering, including the size of each cluster and the status of the algorithm.

◆ compute() [3/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Details< Index_ > kmeans::compute ( const Matrix_ & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > & refine,
Cluster_ num_centers,
Float_ * centers,
Cluster_ * clusters )
Template Parameters
Index_Integer type of the observation indices. This should be the same as the index type of Matrix_.
Data_Numeric type of the input dataset. This should be the same as the data type of Matrix_.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Matrix_Class satisfying the Matrix interface.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
[out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On output, each column should contain the initial centroid location for its cluster.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). On output, this will contain the 0-based cluster assignment for each observation, where each entry is less than num_centers.
Returns
Details of the clustering, including the size of each cluster and the status of the algorithm.

◆ compute() [4/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Results< Index_, Cluster_, Float_ > kmeans::compute ( const Matrix_ & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > & refine,
const Cluster_ num_centers )

Overload of compute() that allocates and returns the vectors for the centroids and cluster assignments.

Template Parameters
Index_Integer type of the observation indices. This should be the same as the index type of Matrix_.
Data_Numeric type of the input dataset. This should be the same as the data type of Matrix_.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Matrix_Class satisfying the Matrix interface.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
Returns
Results of the clustering, including the centroid locations and cluster assignments.

◆ compute_wcss()

template<class Matrix_ , typename Cluster_ , typename Float_ >
void kmeans::compute_wcss ( const Matrix_ & data,
const Cluster_ num_centers,
const Float_ *const centers,
const Cluster_ *const clusters,
Float_ *const wcss )
Template Parameters
Matrix_Class satisfying the Matrix interface.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centers and output.
Parameters
dataA matrix containing data for each observation.
num_centersNumber of cluster centers.
[in]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. The j-th column should contain the initial centroid location for cluster j.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). This should contain the 0-based cluster assignment for each observation. Specifically, each entry is an index that refers to a column of centers and is no greater than num_centers.
[out]wcssPointer to an array of length equal to num_centers. On output, this will contain the within-cluster sum of squares for each cluster.

◆ parallelize()

template<typename Task_ , class Run_ >
void kmeans::parallelize ( const int num_workers,
const Task_ num_tasks,
Run_ run_task_range )
Template Parameters
Task_Integer type of the number of tasks.
Run_Function to execute a range of tasks.
Parameters
num_workersNumber of workers.
num_tasksNumber of tasks.
run_task_rangeFunction to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the KMEANS_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().

◆ remove_unused_centers()

template<typename Index_ , typename Cluster_ , typename Float_ >
Cluster_ kmeans::remove_unused_centers ( const std::size_t num_dimensions,
const Index_ num_observations,
Cluster_ *const clusters,
const Cluster_ num_centers,
Float_ *const centers,
std::vector< Index_ > & sizes )

Remove unused k-means centroids from the centers array filled by Refine::run(). Specifically, clusters are relabelled so that all empty clusters have higher cluster indices in clusters than the non-empty clusters. On output, clusters will contain all and only integers in [0, N) where N is the number of non-empty clusters. This ensures that downstream applications do not waste time processing num_centers clusters when some of them are empty.

Template Parameters
Index_Integer type of the observation indices.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Parameters
num_dimensionsNumber of dimensions.
num_observationsNumber of observations.
[in,out]clustersPointer to an array of length num_observations. On input, clusters[i] should contain the index of the cluster assigned to observation i, typically from Refine::run(). On output, clusters[i] contains a possibly-remapped index that will be less than the number of non-empty clusters (returned by this function).
num_centersNumber of cluster centers.
[in,out]centersPointer to an array of length equal to the product of num_dimensions and num_centers. On input, this should contain a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers, i.e., the j-th column defines the centroid for cluster j, which may be present in clusters. On output, the columns are rearranged to match the remapped indices in clusters. Empty clusters are assigned all-zero coordinates.
sizesVector of length equal to num_centers, containing the number of observations in each cluster. On input, the j-th value should be equal to the frequency of cluster j in clusters, typically from Details::sizes. On output, the values may be rearranged to match the remapped indices in clusters.
Returns
Number of non-empty clusters. If this is equal to num_centers, this function is a no-op.