kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
Public Member Functions | List of all members
kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ > Class Template Reference

Implements the variance partitioning method of Su and Dy (2007). More...

#include <InitializeVariancePartition.hpp>

Inheritance diagram for kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >:
Inheritance graph
[legend]
Collaboration diagram for kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >:
Collaboration graph
[legend]

Public Member Functions

 InitializeVariancePartition (InitializeVariancePartitionOptions options)
 
 InitializeVariancePartition ()=default
 
InitializeVariancePartitionOptionsget_options ()
 
Cluster_ run (const Matrix_ &data, Cluster_ ncenters, Float_ *centers) const
 

Detailed Description

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
class kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >

Implements the variance partitioning method of Su and Dy (2007).

We start from a single cluster containing all points. At each iteration, we select the cluster with the largest within-cluster sum of squares (WCSS); we identify the dimension with the largest variance within that cluster; and we split the cluster at its center along that axis to obtain two new clusters. This is repeated until the desired number of clusters is obtained. The idea is to deterministically partition the dataset so that the initial centers are evenly distributed along the axes of greatest variation.

The original algorithm favors selection and partitioning of the largest cluster, which has the greatest chance of having the highest WCSS. For more fine-grained control, we modify the procedure to adjust the effective number of observations contributing to the WCSS. Specifically, we choose the cluster to partition based on the product of \(N\) and the mean squared difference within each cluster, where \(N\) is the cluster size raised to some user-specified power (i.e., the "size adjustment") between 0 and 1. An adjustment of 1 recapitulates the original algorithm, while smaller values of the size adjustment will reduce the preference towards larger clusters. A value of zero means that the cluster size is completely ignored, though this seems unwise as it causes excessive splitting of small clusters with unstable WCSS.

The original algorithm splits the cluster at the center (i.e., mean) along its most variable dimension. We provide an improvement on this approach where the partition boundary is chosen to minimizes the sum of sum of squares of the two child partitions. This often provides more appropriate partitions by considering the distribution of observations within the cluster, at the cost of some extra computation. Users can switch back to the original approach by setting InitializeVariancePartitionOptions::optimize_partition = false.

Template Parameters
Matrix_Matrix type for the input data. This should satisfy the MockMatrix contract.
Cluster_Integer type for the cluster assignments.
Float_Floating-point type for the centroids.
See also
Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.

Constructor & Destructor Documentation

◆ InitializeVariancePartition() [1/2]

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >::InitializeVariancePartition ( InitializeVariancePartitionOptions  options)
inline
Parameters
optionsOptions for variance partitioning.

◆ InitializeVariancePartition() [2/2]

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >::InitializeVariancePartition ( )
default

Default constructor.

Member Function Documentation

◆ get_options()

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
InitializeVariancePartitionOptions & kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >::get_options ( )
inline
Returns
Options for variance partitioning, to be modified prior to calling run().

◆ run()

template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
Cluster_ kmeans::InitializeVariancePartition< Matrix_, Cluster_, Float_ >::run ( const Matrix_ data,
Cluster_  num_centers,
Float_ centers 
) const
inlinevirtual
Parameters
dataA matrix-like object (see MockMatrix) containing per-observation data.
num_centersNumber of cluster centers.
[out]centersPointer to an array of length equal to the product of num_centers and data.num_dimensions(). This contains a column-major matrix where rows correspond to dimensions and columns correspond to cluster centers. On output, each column will contain the final centroid locations for each cluster.
Returns
centers is filled with the new cluster centers. The number of filled centers is returned - this is usually equal to num_centers, but may not be if, e.g., num_centers is greater than the number of observations. If the returned value is less than num_centers, only the first few centers in centers will be filled.

Implements kmeans::Initialize< Matrix_, Cluster_, Float_ >.


The documentation for this class was generated from the following file: