kmeans
k-means clustering in C++
Loading...
Searching...
No Matches
kmeans Namespace Reference

Perform k-means clustering. More...

Classes

class  ConsecutiveAccessExtractor
 Extractor for accessing consecutive observations. More...
 
struct  Details
 Additional statistics from the k-means algorithm. More...
 
class  IndexedAccessExtractor
 Extractor for accessing indexed observations. More...
 
class  Initialize
 Interface for k-means initialization algorithms. More...
 
class  InitializeKmeanspp
 k-means++ initialization of Arthur and Vassilvitskii (2007). More...
 
struct  InitializeKmeansppOptions
 Options for InitializeKmeanspp. More...
 
class  InitializeNone
 No-op "initialization" with existing cluster centers. More...
 
class  InitializeRandom
 Initialize by sampling random observations without replacement. More...
 
struct  InitializeRandomOptions
 Options for InitializeRandom. More...
 
class  InitializeVariancePartition
 Implements the variance partitioning method of Su and Dy (2007). More...
 
struct  InitializeVariancePartitionOptions
 Options for InitializeVariancePartition. More...
 
class  Matrix
 Interface for matrix data. More...
 
class  RandomAccessExtractor
 Extractor for accessing random observations. More...
 
class  Refine
 Interface for k-means refinement algorithms. More...
 
class  RefineHartiganWong
 Implements the Hartigan-Wong algorithm for k-means clustering. More...
 
struct  RefineHartiganWongOptions
 Options for RefineHartiganWong. More...
 
class  RefineLloyd
 Implements the Lloyd algorithm for k-means clustering. More...
 
struct  RefineLloydOptions
 Options for RefineLloyd. More...
 
class  RefineMiniBatch
 Implements the mini-batch algorithm for k-means clustering. More...
 
struct  RefineMiniBatchOptions
 Options for RefineMiniBatch. More...
 
struct  Results
 Results of the k-means clustering. More...
 
class  SimpleMatrix
 A simple matrix of observations. More...
 

Typedefs

typedef std::mt19937_64 InitializeKmeansppRng
 
typedef std::mt19937_64 InitializeRandomRng
 
typedef std::mt19937_64 RefineMiniBatchRng
 

Functions

template<class Matrix_ , typename Cluster_ , typename Float_ >
void compute_wcss (const Matrix_ &data, const Cluster_ num_centers, const Float_ *const centers, const Cluster_ *const clusters, Float_ *const wcss)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Details< Index_ > compute (const Matrix_ &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > &refine, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Details< Index_ > compute (const Matrix< Index_, Data_ > &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &refine, const Cluster_ num_centers, Float_ *const centers, Cluster_ *const clusters)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Results< Index_, Cluster_, Float_ > compute (const Matrix_ &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > &refine, const Cluster_ num_centers)
 
template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Results< Index_, Cluster_, Float_ > compute (const Matrix< Index_, Data_ > &data, const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &initialize, const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > &refine, const Cluster_ num_centers)
 
template<typename Task_ , class Run_ >
void parallelize (const int num_workers, const Task_ num_tasks, Run_ run_task_range)
 

Detailed Description

Perform k-means clustering.

Typedef Documentation

◆ InitializeKmeansppRng

typedef std::mt19937_64 kmeans::InitializeKmeansppRng

Type of the pseudo-random number generator for InitializeKmeanspp.

◆ InitializeRandomRng

typedef std::mt19937_64 kmeans::InitializeRandomRng

Type of the pseudo-random number generator for InitializeRandom.

◆ RefineMiniBatchRng

typedef std::mt19937_64 kmeans::RefineMiniBatchRng

Type of the pseudo-random number generator for RefineMiniBatch.

Function Documentation

◆ compute() [1/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Results< Index_, Cluster_, Float_ > kmeans::compute ( const Matrix< Index_, Data_ > & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & refine,
const Cluster_ num_centers )

Overload of compute() to assist template deduction for the default Matrix. This allocates and returns the vectors for the centroids and cluster assignments.

Template Parameters
Index_Integer type of the observation indices.
Data_Numeric type of the input dataset.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
Returns
Results of the clustering, including the centroid locations and cluster assignments.

◆ compute() [2/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ >
Details< Index_ > kmeans::compute ( const Matrix< Index_, Data_ > & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix< Index_, Data_ > > & refine,
const Cluster_ num_centers,
Float_ *const centers,
Cluster_ *const clusters )

Overload of compute() to assist template deduction for the default Matrix.

Template Parameters
Index_Integer type of the observation indices.
Data_Numeric type of the input dataset.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
[out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On output, each column should contain the initial centroid location for its cluster.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). On output, this will contain the 0-based cluster assignment for each observation, where each entry is less than num_centers.
Returns
Details of the clustering, including the size of each cluster and the status of the algorithm.

◆ compute() [3/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Details< Index_ > kmeans::compute ( const Matrix_ & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > & refine,
Cluster_ num_centers,
Float_ * centers,
Cluster_ * clusters )
Template Parameters
Index_Integer type of the observation indices. This should be the same as the index type of Matrix_.
Data_Numeric type of the input dataset. This should be the same as the data type of Matrix_.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Matrix_Class satisfying the Matrix interface.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
[out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On output, each column should contain the initial centroid location for its cluster.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). On output, this will contain the 0-based cluster assignment for each observation, where each entry is less than num_centers.
Returns
Details of the clustering, including the size of each cluster and the status of the algorithm.

◆ compute() [4/4]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>
Results< Index_, Cluster_, Float_ > kmeans::compute ( const Matrix_ & data,
const Initialize< Index_, Data_, Cluster_, Float_, Matrix_ > & initialize,
const Refine< Index_, Data_, Cluster_, Float_, Matrix_ > & refine,
const Cluster_ num_centers )

Overload of compute() that allocates and returns the vectors for the centroids and cluster assignments.

Template Parameters
Index_Integer type of the observation indices. This should be the same as the index type of Matrix_.
Data_Numeric type of the input dataset. This should be the same as the data type of Matrix_.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centroids.
Matrix_Class satisfying the Matrix interface.
Parameters
dataA matrix containing data for each observation.
initializeInitialization method to use.
refineRefinement method to use.
num_centersNumber of cluster centers.
Returns
Results of the clustering, including the centroid locations and cluster assignments.

◆ compute_wcss()

template<class Matrix_ , typename Cluster_ , typename Float_ >
void kmeans::compute_wcss ( const Matrix_ & data,
const Cluster_ num_centers,
const Float_ *const centers,
const Cluster_ *const clusters,
Float_ *const wcss )
Template Parameters
Matrix_Class satisfying the Matrix interface.
Cluster_Integer type of the cluster assignments.
Float_Floating-point type of the centers and output.
Parameters
dataA matrix containing data for each observation.
num_centersNumber of cluster centers.
[in]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. Each column should contain the initial centroid location for its cluster.
[in]clustersPointer to an array of length equal to the number of observations (from data.num_observations()). This should contain the 0-based cluster assignment for each observation, where each value is no greater than num_centers.
[out]wcssPointer to an array of length equal to num_centers. On output, this will contain the within-cluster sum of squares.

◆ parallelize()

template<typename Task_ , class Run_ >
void kmeans::parallelize ( const int num_workers,
const Task_ num_tasks,
Run_ run_task_range )
Template Parameters
Task_Integer type of the number of tasks.
Run_Function to execute a range of tasks.
Parameters
num_workersNumber of workers.
num_tasksNumber of tasks.
run_task_rangeFunction to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the KMEANS_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().