kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
RefineHartiganWong.hpp
Go to the documentation of this file.
1#ifndef KMEANS_HARTIGAN_WONG_HPP
2#define KMEANS_HARTIGAN_WONG_HPP
3
4#include <vector>
5#include <algorithm>
6#include <numeric>
7#include <cstdint>
8
9#include "Refine.hpp"
10#include "Details.hpp"
11#include "QuickSearch.hpp"
12#include "parallelize.hpp"
13#include "compute_centroids.hpp"
14#include "is_edge_case.hpp"
15
22namespace kmeans {
23
52
57
58/*
59 * Alright, get ready for a data dump, because this is where it gets real.
60 * Here I attempt to untangle the code spaghetti from the original Fortran
61 * implementation, which is exemplified by the NCP and LIVE arrays.
62 *
63 * Okay, NCP first. Each element L in NCP has a dual interpretation:
64 *
65 * - In the optimal-transfer stage, NCP(L) stores the step at which cluster L
66 * was last updated. Each step is just the observation index as the optimal
67 * transfer only does one pass through the dataset.
68 * - In the quick-transfer stage, NCP(L) stores the step at which cluster L is
69 * last updated plus M (i.e., the number of observations). Here, the step
70 * corresponding to an observation will be 'M * X + obs' for some integer X
71 * >= 0, where X is the iteration of the quick transfer. This means that the
72 * stored value is defined as '(M + 1) * X + obs'.
73 * - After each quick_transfer call, NCP(L) is set back to zero so any existing
74 * values will not carry over into the next optimal_transfer call.
75 *
76 * Note that these two definitions bleed into each other as the NCP(L) set by
77 * optimal_transfer is still referenced in the first few iterations of
78 * quick_transfer before it eventually gets overwritten. The easiest way to
79 * interpret this is to consider the optimal transfer as iteration 'M = -1'
80 * from the perspective of the quick transfer iterations.
81 *
82 * In short, the NCP array is used to determine whether a cluster was modified
83 * within the last M steps of the optimal_transfer or quick_transfer.
84 *
85 * Let's move onto LIVE, which has an even more painful interpretation:
86 *
87 * - Before each optimal_transfer call, LIVE(L) stores the observation at which
88 * cluster L was updated in the _previous_ optimal_transfer call.
89 * - During the optimal_transfer call, LIVE(L) is updated to the observation at
90 * which L was updated in this call, plus M (i.e., number of observations).
91 * - After the optimal_transfer call, LIVE(L) is updated by subtracting M, so
92 * that the interpretation is correct in the next call.
93 * - After the quick_transfer call, LIVE(L) is set to M + 1 if a quick transfer
94 * took place for L, effectively mimicking a "very recent" update.
95 *
96 * It basically tells us whether there was a recent transfer (optimal or quick)
97 * within the last M steps of optimal_transfer. If so, the cluster is "live".
98 * This differs very slightly from NCP as LIVE only counts optimal transfer
99 * steps while NCP counts both optimal and quick transfer steps.
100 *
101 * We simplify this mess by consolidating NCP and LIVE into a single update
102 * history that can serve both purposes. This is possible as LIVE is not used
103 * in quick_transfer, while any modifications made to NCP by quick_transfer are
104 * ignored and overwritten in optimal_transfer. Thus, there won't be any
105 * conflicts between the two meanings in a consolidated update history.
106 *
107 * The update history for cluster L now holds the iteration ('X') and the
108 * observation ('obs') at which the cluster was last modified. This has
109 * an obvious mapping to NCP(L), which was already defined in terms of
110 * 'X' and 'obs' anyway. For LIVE(L), the mapping is more complex:
111 *
112 * - If an optimal_transfer occurred but no quick_transfer occurs for L,
113 * we set 'X = -2'. 'obs' is not modified as it is the observation at
114 * which L was modified in the previous optimal_transfer iteration;
115 * this is basically just the definition of LIVE(L).
116 * - If a quick transfer does occur, we set 'X = -2' and 'obs = M + 1'.
117 * Again, this is equivalent to the behavior of LIVE(L).
118 * - If neither a quick_transfer nor optimal_transfer occurs for L,
119 * we set 'X = -3' to indicate that L didn't change in any manner.
120 * - During the optimal_transfer, we consider a cluster to be live at a
121 * particular observation if it is greater than 'obs'.
122 *
123 * Note that Fortran is 1-based so the actual code is a little different than
124 * described above for 'obs' - specifically, we just need to set it to 'M'
125 * when a quick transfer occurs. Meaning of 'X' is unchanged.
126 *
127 * Incidentally, we can easily detect if a quick transfer occurs based on
128 * whether 'X > -1'. This obviates the need for the ITRAN array.
129 */
130template<typename Index_>
131class UpdateHistory {
132private:
133 Index_ my_last_observation = 0;
134
135 static constexpr int current_optimal_transfer = -1;
136 static constexpr int previous_optimal_transfer = -2;
137 static constexpr int ancient_history = -3;
138
139 // Starting at -3 as we can't possibly have any updates if we haven't done
140 // any transfers, optimal or quick!
142
143public:
144 void reset(Index_ total_obs) {
145 if (my_last_iteration > current_optimal_transfer) { // i.e., quick_transfer.
149 // Preserve the existing 'my_last_observation', just bump the iteration downwards.
151 } else {
153 }
154 }
155
156 void set_optimal(Index_ obs) {
159 }
160
161 // Here, iter should be from '[0, max_quick_transfer_iterations)'.
162 void set_quick(int iter, Index_ obs) {
165 }
166
167public:
168 bool changed_after(int iter, Index_ obs) const {
169 if (my_last_iteration == iter) {
170 return my_last_observation > obs;
171 } else {
172 return my_last_iteration > iter;
173 }
174 }
175
176 bool changed_after_or_at(int iter, Index_ obs) const {
177 if (my_last_iteration == iter) {
178 return my_last_observation >= obs;
179 } else {
180 return my_last_iteration > iter;
181 }
182 }
183
184 bool is_live(Index_ obs) const {
186 }
187};
188
189template<typename Float_, typename Index_, typename Cluster_>
190struct Workspace {
191 // Array arguments in the same order as supplied to R's kmns_ function.
192 std::vector<Cluster_> best_destination_cluster; // i.e., IC2
193 std::vector<Index_> cluster_sizes; // i.e., NC
194
195 std::vector<Float_> loss_multiplier; // i.e., AN1
196 std::vector<Float_> gain_multiplier; // i.e., AN2
197 std::vector<Float_> wcss_loss; // i.e., D
198
199 std::vector<UpdateHistory<Index_> > update_history; // i.e., NCP, LIVE, and ITRAN.
200
201 Index_ optra_steps_since_last_transfer = 0; // i.e., INDX
202
203public:
204 Workspace(Index_ nobs, Cluster_ ncenters) :
205 // Sizes taken from the .Fortran() call in stats::kmeans().
206 best_destination_cluster(nobs),
207 cluster_sizes(ncenters),
208 loss_multiplier(ncenters),
209 gain_multiplier(ncenters),
210 wcss_loss(nobs),
211 update_history(ncenters)
212 {}
213};
214
215template<typename Data_, typename Float_, typename Dim_>
217 Float_ output = 0;
218 for (decltype(ndim) dim = 0; dim < ndim; ++dim, ++data, ++center) {
219 Float_ delta = static_cast<Float_>(*data) - *center; // cast to float for consistent precision regardless of Data_.
220 output += delta * delta;
221 }
222 return output;
223}
224
225template<class Matrix_, typename Cluster_, typename Float_>
226void find_closest_two_centers(const Matrix_& data, Cluster_ ncenters, const Float_* centers, Cluster_* best_cluster, std::vector<Cluster_>& best_destination_cluster, int nthreads) {
227 auto ndim = data.num_dimensions();
228
229 // We assume that there are at least two centers here, otherwise we should
230 // have detected that this was an edge case in RefineHartiganWong::run.
231 internal::QuickSearch<Float_, Cluster_, decltype(ndim)> index(ndim, ncenters, centers);
232
233 auto nobs = data.num_observations();
234 typedef typename Matrix_::index_type Index_;
235 parallelize(nthreads, nobs, [&](int, Index_ start, Index_ length) -> void {
236 auto matwork = data.create_workspace(start, length);
237 for (Index_ obs = start, end = start + length; obs < end; ++obs) {
238 auto optr = data.get_observation(matwork);
239 auto res2 = index.find2(optr);
240 best_cluster[obs] = res2.first;
242 }
243 });
244}
245
246template<typename Float_>
247constexpr Float_ big_number() {
248 return 1e30; // Some very big number.
249}
250
251template<typename Dim_, typename Data_, typename Index_, typename Cluster_, typename Float_>
253 // Yes, casts to float are deliberate here, so that the
254 // multipliers can be computed correctly.
255 Float_ al1 = work.cluster_sizes[l1], alw = al1 - 1;
256 Float_ al2 = work.cluster_sizes[l2], alt = al2 + 1;
257
258 size_t long_ndim = ndim;
259 auto copy1 = centers + static_cast<size_t>(l1) * long_ndim; // cast to avoid overflow.
260 auto copy2 = centers + static_cast<size_t>(l2) * long_ndim;
261 for (decltype(ndim) dim = 0; dim < ndim; ++dim, ++copy1, ++copy2, ++obs_ptr) {
262 Float_ oval = *obs_ptr; // cast to float for consistent precision regardless of Data_.
263 *copy1 = (*copy1 * al1 - oval) / alw;
264 *copy2 = (*copy2 * al2 + oval) / alt;
265 }
266
267 --work.cluster_sizes[l1];
268 ++work.cluster_sizes[l2];
269
270 work.gain_multiplier[l1] = alw / al1;
271 work.loss_multiplier[l1] = (alw > 1 ? alw / (alw - 1) : big_number<Float_>());
272 work.loss_multiplier[l2] = alt / al2;
273 work.gain_multiplier[l2] = alt / (alt + 1);
274
276 work.best_destination_cluster[obs_id] = l1;
277}
278
279/* ALGORITHM AS 136.1 APPL. STATIST. (1979) VOL.28, NO.1
280 * This is the OPtimal TRAnsfer stage.
281 * ----------------------
282 * Each point is re-assigned, if necessary, to the cluster that will induce a
283 * maximum reduction in the within-cluster sum of squares. In this stage,
284 * there is only one pass through the data.
285 */
286template<class Matrix_, typename Cluster_, typename Float_>
288 auto nobs = data.num_observations();
289 auto ndim = data.num_dimensions();
290 auto matwork = data.create_workspace();
291 size_t long_ndim = ndim;
292
293 for (decltype(nobs) obs = 0; obs < nobs; ++obs) {
294 ++work.optra_steps_since_last_transfer;
295
296 auto l1 = best_cluster[obs];
297 if (work.cluster_sizes[l1] != 1) {
298 auto obs_ptr = data.get_observation(obs, matwork);
299
300 // The original Fortran implementation re-used the WCSS loss for
301 // each observation, only recomputing it if its cluster had
302 // experienced an optimal transfer for an earlier observation. In
303 // theory, this sounds great to avoid recomputation, but the
304 // existing WCSS loss was computed in a running fashion during the
305 // quick transfers. This makes them susceptible to accumulation of
306 // numerical errors, even after the centroids are freshly
307 // recomputed in the run() loop. So, we simplify matters and
308 // improve accuracy by just recomputing the loss all the time,
309 // which doesn't take too much extra effort.
310 auto& wcss_loss = work.wcss_loss[obs];
311 auto l1_ptr = centers + long_ndim * static_cast<size_t>(l1); // cast to avoid overflow.
313
314 // Find the cluster with minimum WCSS gain.
315 auto l2 = work.best_destination_cluster[obs];
316 auto original_l2 = l2;
317 auto l2_ptr = centers + long_ndim * static_cast<size_t>(l2); // cast to avoid overflow.
318 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
319
321 auto cen_ptr = centers + long_ndim * static_cast<size_t>(cen); // cast to avoid overflow.
323 if (candidate < wcss_gain) {
325 l2 = cen;
326 }
327 };
328
329 // If the current assigned cluster is live, we need to check all
330 // other clusters as potential transfer destinations, because the
331 // gain/loss comparison has changed. Otherwise, we only need to
332 // consider other live clusters as transfer destinations; the
333 // non-live clusters were already rejected as transfer destinations
334 // when compared to the current assigned cluster, so there's no
335 // point checking them again if they didn't change in the meantime.
336 //
337 // The exception is for the first call to optimal_transfer, where we
338 // consider all clusters as live (i.e., all_live = true). This is
339 // because no observation really knows its best transfer
340 // destination yet - the second-closest cluster is just a
341 // guesstimate - so we need to compute it exhaustively.
342 if (all_live || work.update_history[l1].is_live(obs)) {
343 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
344 if (cen != l1 && cen != original_l2) {
346 }
347 }
348 } else {
349 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
350 if (cen != l1 && cen != original_l2 && work.update_history[cen].is_live(obs)) {
352 }
353 }
354 }
355
356 // Deciding whether to make the transfer based on the change to the WCSS.
357 if (wcss_gain >= wcss_loss) {
358 work.best_destination_cluster[obs] = l2;
359 } else {
360 work.optra_steps_since_last_transfer = 0;
361 work.update_history[l1].set_optimal(obs);
362 work.update_history[l2].set_optimal(obs);
364 }
365 }
366
367 // Stop if we've iterated through the entire dataset and no transfer of
368 // any kind took place, be it optimal or quick.
369 if (work.optra_steps_since_last_transfer == nobs) {
370 return true;
371 }
372 }
373
374 return false;
375}
376
377/* ALGORITHM AS 136.2 APPL. STATIST. (1979) VOL.28, NO.1
378 * This is the Quick TRANsfer stage.
379 * --------------------
380 * IC1(I) is the cluster which point I currently belongs to.
381 * IC2(I) is the cluster which point I is most likely to be transferred to.
382 *
383 * For each point I, IC1(I) & IC2(I) are switched, if necessary, to reduce
384 * within-cluster sum of squares. The cluster centres are updated after each
385 * step. In this stage, we loop through the data until no further change is to
386 * take place, or we hit an iteration limit, whichever is first.
387 */
388template<class Matrix_, typename Cluster_, typename Float_>
389std::pair<bool, bool> quick_transfer(
390 const Matrix_& data,
392 Float_* centers,
395{
396 bool had_transfer = false;
397
398 auto nobs = data.num_observations();
399 auto matwork = data.create_workspace();
400 auto ndim = data.num_dimensions();
401 size_t long_ndim = data.num_dimensions();
402
403 typedef decltype(nobs) Index_;
404 Index_ steps_since_last_quick_transfer = 0; // i.e., ICOUN in the original Fortran implementation.
405
406 for (int it = 0; it < quick_iterations; ++it) {
407 int prev_it = it - 1;
408
409 for (decltype(nobs) obs = 0; obs < nobs; ++obs) {
411 auto l1 = best_cluster[obs];
412
413 if (work.cluster_sizes[l1] != 1) {
414 const typename Matrix_::data_type* obs_ptr = NULL;
415
416 // Need to update the WCSS loss if the cluster was updated recently.
417 // Otherwise, we must have already updated the WCSS in a previous
418 // iteration of the outermost loop, so this can be skipped.
419 //
420 // Note that we use changed_at_or_after; if the same
421 // observation was changed in the previous iteration of the
422 // outermost loop, its WCSS loss won't have been updated yet.
423 auto& history1 = work.update_history[l1];
424 if (history1.changed_after_or_at(prev_it, obs)) {
425 auto l1_ptr = centers + static_cast<size_t>(l1) * long_ndim; // cast to avoid overflow.
426 obs_ptr = data.get_observation(obs, matwork);
427 work.wcss_loss[obs] = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
428 }
429
430 // If neither the best or second-best clusters have changed
431 // after the previous iteration that we visited this
432 // observation, then there's no point reevaluating the
433 // transfer, because nothing's going to be different anyway.
434 auto l2 = work.best_destination_cluster[obs];
435 auto& history2 = work.update_history[l2];
436 if (history1.changed_after(prev_it, obs) || history2.changed_after(prev_it, obs)) {
437 if (obs_ptr == NULL) {
438 obs_ptr = data.get_observation(obs, matwork);
439 }
440 auto l2_ptr = centers + static_cast<size_t>(l2) * long_ndim; // cast to avoid overflow.
441 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
442
443 if (wcss_gain < work.wcss_loss[obs]) {
444 had_transfer = true;
446 history1.set_quick(it, obs);
447 history2.set_quick(it, obs);
449 }
450 }
451 }
452
454 // Quit early if no transfer occurred within the past 'nobs'
455 // steps, as we've already converged for each observation.
456 return std::make_pair(had_transfer, false);
457 }
458 }
459 }
460
461 return std::make_pair(had_transfer, true);
462}
463
464}
498template<typename Matrix_ = SimpleMatrix<double, int>, typename Cluster_ = int, typename Float_ = double>
499class RefineHartiganWong : public Refine<Matrix_, Cluster_, Float_> {
500public:
505
510
511private:
512 RefineHartiganWongOptions my_options;
513 typedef typename Matrix_::index_type Index_;
514
515public:
521 return my_options;
522 }
523
524public:
525 Details<Index_> run(const Matrix_& data, Cluster_ ncenters, Float_* centers, Cluster_* clusters) const {
526 auto nobs = data.num_observations();
527 if (internal::is_edge_case(nobs, ncenters)) {
528 return internal::process_edge_case(data, ncenters, centers, clusters);
529 }
530
531 RefineHartiganWong_internal::Workspace<Float_, Index_, Cluster_> work(nobs, ncenters);
532
533 RefineHartiganWong_internal::find_closest_two_centers(data, ncenters, centers, clusters, work.best_destination_cluster, my_options.num_threads);
534 for (Index_ obs = 0; obs < nobs; ++obs) {
535 ++work.cluster_sizes[clusters[obs]];
536 }
537 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
538
539 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
540 Float_ num = work.cluster_sizes[cen]; // yes, cast is deliberate here so that the multipliers can be computed correctly.
541 work.gain_multiplier[cen] = num / (num + 1);
542 work.loss_multiplier[cen] = (num > 1 ? num / (num - 1) : RefineHartiganWong_internal::big_number<Float_>());
543 }
544
545 int iter = 0;
546 int ifault = 0;
547 while ((++iter) <= my_options.max_iterations) {
548 bool finished = RefineHartiganWong_internal::optimal_transfer(data, work, ncenters, centers, clusters, /* all_live = */ (iter == 1));
549 if (finished) {
550 break;
551 }
552
553 auto quick_status = RefineHartiganWong_internal::quick_transfer(
554 data,
555 work,
556 centers,
557 clusters,
559 );
560
561 // Recomputing the centroids to avoid accumulation of numerical
562 // errors after many transfers (e.g., adding a whole bunch of
563 // values and then subtracting them again leaves behind some
564 // cancellation error). Note that we don't have to do this if
565 // 'finished = true' as this means that there was no transfer of
566 // any kind in the final pass through the dataset.
567 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
568
569 if (quick_status.second) { // Hit the quick transfer iteration limit.
571 ifault = 4;
572 break;
573 }
574 } else {
575 // If quick transfer converged and there are only two clusters,
576 // there is no need to re-enter the optimal transfer stage.
577 if (ncenters == 2) {
578 break;
579 }
580 }
581
582 if (quick_status.first) { // At least one quick transfer was performed.
583 work.optra_steps_since_last_transfer = 0;
584 }
585
586 for (Cluster_ c = 0; c < ncenters; ++c) {
587 work.update_history[c].reset(nobs);
588 }
589 }
590
591 if (iter == my_options.max_iterations + 1) {
592 ifault = 2;
593 }
594
595 return Details(std::move(work.cluster_sizes), iter, ifault);
596 }
597};
598
599}
600
601#endif
Report detailed clustering statistics.
Interface for k-means refinement.
Implements the variance partitioning method of Su and Dy (2007).
Definition InitializeVariancePartition.hpp:164
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:499
RefineHartiganWong(RefineHartiganWongOptions options)
Definition RefineHartiganWong.hpp:504
RefineHartiganWongOptions & get_options()
Definition RefineHartiganWong.hpp:520
Details< Index_ > run(const Matrix_ &data, Cluster_ ncenters, Float_ *centers, Cluster_ *clusters) const
Definition RefineHartiganWong.hpp:525
Interface for all k-means refinement algorithms.
Definition Refine.hpp:23
Namespace for k-means clustering.
Definition compute_wcss.hpp:12
void parallelize(int num_workers, Task_ num_tasks, Run_ run_task_range)
Definition parallelize.hpp:28
Utilities for parallelization.
Additional statistics from the k-means algorithm.
Definition Details.hpp:20
Options for RefineHartiganWong.
Definition RefineHartiganWong.hpp:27
bool quit_on_quick_transfer_convergence_failure
Definition RefineHartiganWong.hpp:44
int max_iterations
Definition RefineHartiganWong.hpp:32
int num_threads
Definition RefineHartiganWong.hpp:50
int max_quick_transfer_iterations
Definition RefineHartiganWong.hpp:38