Implements the variance partitioning method of Su and Dy (2007). More...

#include <InitializeVariancePartition.hpp>

Inheritance diagram for kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >:

Collaboration diagram for kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >:

Public Member Functions
	InitializeVariancePartition (InitializeVariancePartitionOptions options)

	InitializeVariancePartition ()=default

InitializeVariancePartitionOptions &	get_options ()

Public Member Functions inherited from kmeans::Initialize< Index_, Data_, Cluster_, Float_, Matrix_ >
virtual Cluster_	run (const Matrix_ &data, Cluster_ num_centers, Float_ *centers) const =0

Detailed Description

template<typename Index_, typename Data_, typename Cluster_, typename Float_, class Matrix_ = Matrix<Index_, Data_>>
class kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >

Implements the variance partitioning method of Su and Dy (2007).

We start from a single cluster containing all points. At each iteration, we select the cluster with the largest within-cluster sum of squares (WCSS); we identify the dimension with the largest variance within that cluster; and we split the cluster at its center along that axis to obtain two new clusters. This is repeated until the desired number of clusters is obtained. The idea is to deterministically partition the dataset so that the initial centers are evenly distributed along the axes of greatest variation.

The original algorithm favors selection and partitioning of the largest cluster, which has the greatest chance of having the highest WCSS. For more fine-grained control, we modify the procedure to adjust the effective number of observations contributing to the WCSS. Specifically, we choose the cluster to partition based on the product of \(N\) and the mean squared difference within each cluster, where \(N\) is the cluster size raised to some user-specified power (i.e., the "size adjustment") between 0 and 1. An adjustment of 1 recapitulates the original algorithm, while smaller values of the size adjustment will reduce the preference towards larger clusters. A value of zero means that the cluster size is completely ignored, though this seems unwise as it causes excessive splitting of small clusters with unstable WCSS.

The original algorithm splits the cluster at the center (i.e., mean) along its most variable dimension. We provide an improvement on this approach where the partition boundary is chosen to minimizes the sum of sum of squares of the two child partitions. This often provides more appropriate partitions by considering the distribution of observations within the cluster, at the cost of some extra computation. Users can switch back to the original approach by setting InitializeVariancePartitionOptions::optimize_partition = false.

Template Parameters

Index_	Integer type for the observation indices in the input dataset.
Data_	Numeric type for the input dataset.
Cluster_	Integer type for the cluster assignments.
Float_	Floating-point type for the centroids. This will also be used for any internal distance calculations.
Matrix_	Class of the input data matrix. This should satisfy the `Matrix` interface.

See also: Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.

Constructor & Destructor Documentation

◆ InitializeVariancePartition() [1/2]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>

kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >::InitializeVariancePartition ( InitializeVariancePartitionOptions options )

inline

Parameters

options Options for variance partitioning.

◆ InitializeVariancePartition() [2/2]

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>

kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >::InitializeVariancePartition ( )

default

Default constructor.

Member Function Documentation

◆ get_options()

template<typename Index_ , typename Data_ , typename Cluster_ , typename Float_ , class Matrix_ = Matrix<Index_, Data_>>

InitializeVariancePartitionOptions & kmeans::InitializeVariancePartition< Index_, Data_, Cluster_, Float_, Matrix_ >::get_options ( )

inline

Returns: Options for variance partitioning, to be modified prior to calling run().

The documentation for this class was generated from the following file:

kmeans/InitializeVariancePartition.hpp

Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ InitializeVariancePartition() [1/2]

◆ InitializeVariancePartition() [2/2]

Member Function Documentation

◆ get_options()