Implements the variance partitioning method of Su and Dy (2007).
We start from a single cluster containing all points. At each iteration, we select the cluster with the largest within-cluster sum of squares (WCSS); we identify the dimension with the largest variance within that cluster; and we split the cluster at its center along that axis to obtain two new clusters. This is repeated until the desired number of clusters is obtained. The idea is to deterministically partition the dataset so that the initial centers are evenly distributed along the axes of greatest variation.
The original algorithm favors selection and partitioning of the largest cluster, which has the greatest chance of having the highest WCSS. For more fine-grained control, we modify the procedure to adjust the effective number of observations contributing to the WCSS. Specifically, we choose the cluster to partition based on the product of \(N\) and the mean squared difference within each cluster, where \(N\) is the cluster size raised to some user-specified power (i.e., the "size adjustment") between 0 and 1. An adjustment of 1 recapitulates the original algorithm, while smaller values of the size adjustment will reduce the preference towards larger clusters. A value of zero means that the cluster size is completely ignored, though this seems unwise as it causes excessive splitting of small clusters with unstable WCSS.
The original algorithm splits the cluster at the center (i.e., mean) along its most variable dimension. We provide an improvement on this approach where the partition boundary is chosen to minimizes the sum of sum of squares of the two child partitions. This often provides more appropriate partitions by considering the distribution of observations within the cluster, at the cost of some extra computation. Users can switch back to the original approach by setting InitializeVariancePartitionOptions::optimize_partition = false
.
- Template Parameters
-
Matrix_ | Matrix type for the input data. This should satisfy the MockMatrix contract. |
Cluster_ | Integer type for the cluster assignments. |
Float_ | Floating-point type for the centroids. |
- See also
- Su, T. and Dy, J. G. (2007). In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering, Intelligent Data Analysis 11, 319-338.