kmeans
A C++ library for k-means
Loading...
Searching...
No Matches
RefineHartiganWong.hpp
Go to the documentation of this file.
1#ifndef KMEANS_HARTIGAN_WONG_HPP
2#define KMEANS_HARTIGAN_WONG_HPP
3
4#include <vector>
5#include <algorithm>
6#include <numeric>
7#include <cstdint>
8
9#include "Refine.hpp"
10#include "Details.hpp"
11#include "QuickSearch.hpp"
12#include "parallelize.hpp"
13#include "compute_centroids.hpp"
14#include "is_edge_case.hpp"
15
22namespace kmeans {
23
52
56namespace RefineHartiganWong_internal {
57
58/*
59 * Alright, get ready for a data dump, because this is where it gets real.
60 * Here I attempt to untangle the code spaghetti from the original Fortran
61 * implementation, which is exemplified by the NCP and LIVE arrays.
62 *
63 * Okay, NCP first. Each element L in NCP has a dual interpretation:
64 *
65 * - In the optimal-transfer stage, NCP(L) stores the step at which cluster L
66 * was last updated. Each step is just the observation index as the optimal
67 * transfer only does one pass through the dataset.
68 * - In the quick-transfer stage, NCP(L) stores the step at which cluster L is
69 * last updated plus M (i.e., the number of observations). Here, the step
70 * corresponding to an observation will be 'M * X + obs' for some integer X
71 * >= 0, where X is the iteration of the quick transfer. This means that the
72 * stored value is defined as '(M + 1) * X + obs'.
73 * - After each quick_transfer call, NCP(L) is set back to zero so any existing
74 * values will not carry over into the next optimal_transfer call.
75 *
76 * Note that these two definitions bleed into each other as the NCP(L) set by
77 * optimal_transfer is still referenced in the first few iterations of
78 * quick_transfer before it eventually gets overwritten. The easiest way to
79 * interpret this is to consider the optimal transfer as iteration 'M = -1'
80 * from the perspective of the quick transfer iterations.
81 *
82 * In short, the NCP array is used to determine whether a cluster was modified
83 * within the last M steps of the optimal_transfer or quick_transfer.
84 *
85 * Let's move onto LIVE, which has an even more painful interpretation:
86 *
87 * - Before each optimal_transfer call, LIVE(L) stores the observation at which
88 * cluster L was updated in the _previous_ optimal_transfer call.
89 * - During the optimal_transfer call, LIVE(L) is updated to the observation at
90 * which L was updated in this call, plus M (i.e., number of observations).
91 * - After the optimal_transfer call, LIVE(L) is updated by subtracting M, so
92 * that the interpretation is correct in the next call.
93 * - After the quick_transfer call, LIVE(L) is set to M + 1 if a quick transfer
94 * took place for L, effectively mimicking a "very recent" update.
95 *
96 * It basically tells us whether there was a recent transfer (optimal or quick)
97 * within the last M steps of optimal_transfer. If so, the cluster is "live".
98 * This differs very slightly from NCP as LIVE only counts optimal transfer
99 * steps while NCP counts both optimal and quick transfer steps.
100 *
101 * We simplify this mess by consolidating NCP and LIVE into a single update
102 * history that can serve both purposes. This is possible as LIVE is not used
103 * in quick_transfer, while any modifications made to NCP by quick_transfer are
104 * ignored and overwritten in optimal_transfer. Thus, there won't be any
105 * conflicts between the two meanings in a consolidated update history.
106 *
107 * The update history for cluster L now holds the iteration ('X') and the
108 * observation ('obs') at which the cluster was last modified. This has
109 * an obvious mapping to NCP(L), which was already defined in terms of
110 * 'X' and 'obs' anyway. For LIVE(L), the mapping is more complex:
111 *
112 * - If an optimal_transfer occurred but no quick_transfer occurs for L,
113 * we set 'X = -2'. 'obs' is not modified as it is the observation at
114 * which L was modified in the previous optimal_transfer iteration;
115 * this is basically just the definition of LIVE(L).
116 * - If a quick transfer does occur, we set 'X = -2' and 'obs = M + 1'.
117 * Again, this is equivalent to the behavior of LIVE(L).
118 * - If neither a quick_transfer nor optimal_transfer occurs for L,
119 * we set 'X = -3' to indicate that L didn't change in any manner.
120 * - In optimal_transfer, we consider a cluster to be live at a particular
121 * observation if its index is greater than 'obs' and its 'X' is -2.
122 *
123 * Note that Fortran is 1-based so the actual code is a little different than
124 * described above for 'obs' - specifically, we just need to set it to 'M'
125 * when a quick transfer occurs. Meaning of 'X' is unchanged.
126 *
127 * Incidentally, we can easily detect if a quick transfer occurs based on
128 * whether 'X > -1'. This obviates the need for the ITRAN array.
129 */
130template<typename Index_>
131class UpdateHistory {
132private:
133 Index_ my_last_observation = 0;
134
135 static constexpr int current_optimal_transfer = -1;
136 static constexpr int previous_optimal_transfer = -2;
137 static constexpr int ancient_history = -3;
138
139 // Starting at -3 as we can't possibly have any updates if we haven't done
140 // any transfers, optimal or quick!
141 int my_last_iteration = ancient_history;
142
143public:
144 void reset(Index_ total_obs) {
145 if (my_last_iteration > current_optimal_transfer) { // i.e., quick_transfer.
146 my_last_observation = total_obs;
147 my_last_iteration = previous_optimal_transfer;
148 } else if (my_last_iteration == current_optimal_transfer) {
149 // Preserve the existing 'my_last_observation', just bump the iteration downwards.
150 my_last_iteration = previous_optimal_transfer;
151 } else {
152 my_last_iteration = ancient_history;
153 }
154 }
155
156 void set_optimal(Index_ obs) {
157 my_last_observation = obs;
158 my_last_iteration = current_optimal_transfer;
159 }
160
161 // Here, iter should be from '[0, max_quick_transfer_iterations)'.
162 void set_quick(int iter, Index_ obs) {
163 my_last_iteration = iter;
164 my_last_observation = obs;
165 }
166
167public:
168 bool changed_after(int iter, Index_ obs) const {
169 if (my_last_iteration == iter) {
170 return my_last_observation > obs;
171 } else {
172 return my_last_iteration > iter;
173 }
174 }
175
176 bool changed_after_or_at(int iter, Index_ obs) const {
177 if (my_last_iteration == iter) {
178 return my_last_observation >= obs;
179 } else {
180 return my_last_iteration > iter;
181 }
182 }
183
184 bool is_live(Index_ obs) const {
185 return changed_after(previous_optimal_transfer, obs);
186 }
187};
188
189template<typename Float_, typename Index_, typename Cluster_>
190struct Workspace {
191 // Array arguments in the same order as supplied to R's kmns_ function.
192 std::vector<Cluster_> best_destination_cluster; // i.e., IC2
193 std::vector<Index_> cluster_sizes; // i.e., NC
194
195 std::vector<Float_> loss_multiplier; // i.e., AN1
196 std::vector<Float_> gain_multiplier; // i.e., AN2
197 std::vector<Float_> wcss_loss; // i.e., D
198
199 std::vector<UpdateHistory<Index_> > update_history; // i.e., NCP, LIVE, and ITRAN.
200
201 Index_ optra_steps_since_last_transfer = 0; // i.e., INDX
202
203public:
204 Workspace(Index_ nobs, Cluster_ ncenters) :
205 // Sizes taken from the .Fortran() call in stats::kmeans().
206 best_destination_cluster(nobs),
207 cluster_sizes(ncenters),
208 loss_multiplier(ncenters),
209 gain_multiplier(ncenters),
210 wcss_loss(nobs),
211 update_history(ncenters)
212 {}
213};
214
215template<typename Data_, typename Float_>
216Float_ squared_distance_from_cluster(const Data_* data, const Float_* center, size_t ndim) {
217 Float_ output = 0;
218 for (size_t d = 0; d < ndim; ++d) {
219 Float_ delta = static_cast<Float_>(data[d]) - center[d]; // cast to float for consistent precision regardless of Data_.
220 output += delta * delta;
221 }
222 return output;
223}
224
225template<class Matrix_, typename Cluster_, typename Float_>
226void find_closest_two_centers(const Matrix_& data, Cluster_ ncenters, const Float_* centers, Cluster_* best_cluster, std::vector<Cluster_>& best_destination_cluster, int nthreads) {
227 auto ndim = data.num_dimensions();
228
229 // We assume that there are at least two centers here, otherwise we should
230 // have detected that this was an edge case in RefineHartiganWong::run.
231 internal::QuickSearch<Float_, Cluster_> index(ndim, ncenters, centers);
232
233 auto nobs = data.num_observations();
234 parallelize(nthreads, nobs, [&](int, Index<Matrix_> start, Index<Matrix_> length) -> void {
235 auto matwork = data.new_extractor(start, length);
236 for (Index<Matrix_> obs = start, end = start + length; obs < end; ++obs) {
237 auto optr = matwork->get_observation();
238 auto res2 = index.find2(optr);
239 best_cluster[obs] = res2.first;
240 best_destination_cluster[obs] = res2.second;
241 }
242 });
243}
244
245template<typename Float_>
246constexpr Float_ big_number() {
247 return 1e30; // Some very big number.
248}
249
250template<typename Data_, typename Index_, typename Cluster_, typename Float_>
251void transfer_point(size_t ndim, const Data_* obs_ptr, Index_ obs_id, Cluster_ l1, Cluster_ l2, Float_* centers, Cluster_* best_cluster, Workspace<Float_, Index_, Cluster_>& work) {
252 // Yes, casts to float are deliberate here, so that the
253 // multipliers can be computed correctly.
254 Float_ al1 = work.cluster_sizes[l1], alw = al1 - 1;
255 Float_ al2 = work.cluster_sizes[l2], alt = al2 + 1;
256
257 auto copy1 = centers + static_cast<size_t>(l1) * ndim; // cast to avoid overflow.
258 auto copy2 = centers + static_cast<size_t>(l2) * ndim;
259 for (decltype(ndim) dim = 0; dim < ndim; ++dim, ++copy1, ++copy2, ++obs_ptr) {
260 Float_ oval = *obs_ptr; // cast to float for consistent precision regardless of Data_.
261 *copy1 = (*copy1 * al1 - oval) / alw;
262 *copy2 = (*copy2 * al2 + oval) / alt;
263 }
264
265 --work.cluster_sizes[l1];
266 ++work.cluster_sizes[l2];
267
268 work.gain_multiplier[l1] = alw / al1;
269 work.loss_multiplier[l1] = (alw > 1 ? alw / (alw - 1) : big_number<Float_>());
270 work.loss_multiplier[l2] = alt / al2;
271 work.gain_multiplier[l2] = alt / (alt + 1);
272
273 best_cluster[obs_id] = l2;
274 work.best_destination_cluster[obs_id] = l1;
275}
276
277/* ALGORITHM AS 136.1 APPL. STATIST. (1979) VOL.28, NO.1
278 * This is the OPtimal TRAnsfer stage.
279 * ----------------------
280 * Each point is re-assigned, if necessary, to the cluster that will induce a
281 * maximum reduction in the within-cluster sum of squares. In this stage,
282 * there is only one pass through the data.
283 */
284template<class Matrix_, typename Cluster_, typename Float_>
285bool optimal_transfer(const Matrix_& data, Workspace<Float_, Index<Matrix_>, Cluster_>& work, Cluster_ ncenters, Float_* centers, Cluster_* best_cluster, bool all_live) {
286 auto nobs = data.num_observations();
287 auto ndim = data.num_dimensions();
288 auto matwork = data.new_extractor();
289
290 for (decltype(nobs) obs = 0; obs < nobs; ++obs) {
291 ++work.optra_steps_since_last_transfer;
292
293 auto l1 = best_cluster[obs];
294 if (work.cluster_sizes[l1] != 1) {
295 auto obs_ptr = matwork->get_observation(obs);
296
297 // The original Fortran implementation re-used the WCSS loss for
298 // each observation, only recomputing it if its cluster had
299 // experienced an optimal transfer for an earlier observation. In
300 // theory, this sounds great to avoid recomputation, but the
301 // existing WCSS loss was computed in a running fashion during the
302 // quick transfers. This makes them susceptible to accumulation of
303 // numerical errors, even after the centroids are freshly
304 // recomputed in the run() loop. So, we simplify matters and
305 // improve accuracy by just recomputing the loss all the time,
306 // which doesn't take too much extra effort.
307 auto& wcss_loss = work.wcss_loss[obs];
308 auto l1_ptr = centers + ndim * static_cast<size_t>(l1); // cast to avoid overflow.
309 wcss_loss = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
310
311 // Find the cluster with minimum WCSS gain.
312 auto l2 = work.best_destination_cluster[obs];
313 auto original_l2 = l2;
314 auto l2_ptr = centers + ndim * static_cast<size_t>(l2); // cast to avoid overflow.
315 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
316
317 auto update_destination_cluster = [&](Cluster_ cen) -> void {
318 auto cen_ptr = centers + ndim * static_cast<size_t>(cen); // cast to avoid overflow.
319 auto candidate = squared_distance_from_cluster(obs_ptr, cen_ptr, ndim) * work.gain_multiplier[cen];
320 if (candidate < wcss_gain) {
321 wcss_gain = candidate;
322 l2 = cen;
323 }
324 };
325
326 // If the current assigned cluster is live, we need to check all
327 // other clusters as potential transfer destinations, because the
328 // gain/loss comparison has changed. Otherwise, we only need to
329 // consider other live clusters as transfer destinations; the
330 // non-live clusters were already rejected as transfer destinations
331 // when compared to the current assigned cluster, so there's no
332 // point checking them again if they didn't change in the meantime.
333 //
334 // The exception is for the first call to optimal_transfer, where we
335 // consider all clusters as live (i.e., all_live = true). This is
336 // because no observation really knows its best transfer
337 // destination yet - the second-closest cluster is just a
338 // guesstimate - so we need to compute it exhaustively.
339 if (all_live || work.update_history[l1].is_live(obs)) {
340 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
341 if (cen != l1 && cen != original_l2) {
342 update_destination_cluster(cen);
343 }
344 }
345 } else {
346 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
347 if (cen != l1 && cen != original_l2 && work.update_history[cen].is_live(obs)) {
348 update_destination_cluster(cen);
349 }
350 }
351 }
352
353 // Deciding whether to make the transfer based on the change to the WCSS.
354 if (wcss_gain >= wcss_loss) {
355 work.best_destination_cluster[obs] = l2;
356 } else {
357 work.optra_steps_since_last_transfer = 0;
358 work.update_history[l1].set_optimal(obs);
359 work.update_history[l2].set_optimal(obs);
360 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
361 }
362 }
363
364 // Stop if we've iterated through the entire dataset and no transfer of
365 // any kind took place, be it optimal or quick.
366 if (work.optra_steps_since_last_transfer == nobs) {
367 return true;
368 }
369 }
370
371 return false;
372}
373
374/* ALGORITHM AS 136.2 APPL. STATIST. (1979) VOL.28, NO.1
375 * This is the Quick TRANsfer stage.
376 * --------------------
377 * IC1(I) is the cluster which point I currently belongs to.
378 * IC2(I) is the cluster which point I is most likely to be transferred to.
379 *
380 * For each point I, IC1(I) & IC2(I) are switched, if necessary, to reduce
381 * within-cluster sum of squares. The cluster centres are updated after each
382 * step. In this stage, we loop through the data until no further change is to
383 * take place, or we hit an iteration limit, whichever is first.
384 */
385template<class Matrix_, typename Cluster_, typename Float_>
386std::pair<bool, bool> quick_transfer(
387 const Matrix_& data,
388 Workspace<Float_, Index<Matrix_>, Cluster_>& work,
389 Float_* centers,
390 Cluster_* best_cluster,
391 int quick_iterations)
392{
393 bool had_transfer = false;
394
395 auto nobs = data.num_observations();
396 auto matwork = data.new_extractor();
397 size_t ndim = data.num_dimensions();
398
399 typedef decltype(nobs) Index_;
400 Index_ steps_since_last_quick_transfer = 0; // i.e., ICOUN in the original Fortran implementation.
401
402 for (int it = 0; it < quick_iterations; ++it) {
403 int prev_it = it - 1;
404
405 for (decltype(nobs) obs = 0; obs < nobs; ++obs) {
406 ++steps_since_last_quick_transfer;
407 auto l1 = best_cluster[obs];
408
409 if (work.cluster_sizes[l1] != 1) {
410 decltype(matwork->get_observation(obs)) obs_ptr = NULL;
411
412 // Need to update the WCSS loss if the cluster was updated recently.
413 // Otherwise, we must have already updated the WCSS in a previous
414 // iteration of the outermost loop, so this can be skipped.
415 //
416 // Note that we use changed_at_or_after; if the same
417 // observation was changed in the previous iteration of the
418 // outermost loop, its WCSS loss won't have been updated yet.
419 auto& history1 = work.update_history[l1];
420 if (history1.changed_after_or_at(prev_it, obs)) {
421 auto l1_ptr = centers + static_cast<size_t>(l1) * ndim; // cast to avoid overflow.
422 obs_ptr = matwork->get_observation(obs);
423 work.wcss_loss[obs] = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
424 }
425
426 // If neither the best or second-best clusters have changed
427 // after the previous iteration that we visited this
428 // observation, then there's no point reevaluating the
429 // transfer, because nothing's going to be different anyway.
430 auto l2 = work.best_destination_cluster[obs];
431 auto& history2 = work.update_history[l2];
432 if (history1.changed_after(prev_it, obs) || history2.changed_after(prev_it, obs)) {
433 if (obs_ptr == NULL) {
434 obs_ptr = matwork->get_observation(obs);
435 }
436 auto l2_ptr = centers + static_cast<size_t>(l2) * ndim; // cast to avoid overflow.
437 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
438
439 if (wcss_gain < work.wcss_loss[obs]) {
440 had_transfer = true;
441 steps_since_last_quick_transfer = 0;
442 history1.set_quick(it, obs);
443 history2.set_quick(it, obs);
444 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
445 }
446 }
447 }
448
449 if (steps_since_last_quick_transfer == nobs) {
450 // Quit early if no transfer occurred within the past 'nobs'
451 // steps, as we've already converged for each observation.
452 return std::make_pair(had_transfer, false);
453 }
454 }
455 }
456
457 return std::make_pair(had_transfer, true);
458}
459
460}
497template<typename Index_, typename Data_, typename Cluster_, typename Float_, class Matrix_ = Matrix<Index_, Data_> >
498class RefineHartiganWong final : public Refine<Index_, Data_, Cluster_, Float_, Matrix_> {
499public:
503 RefineHartiganWong(RefineHartiganWongOptions options) : my_options(std::move(options)) {}
504
509
510private:
511 RefineHartiganWongOptions my_options;
512
513public:
519 return my_options;
520 }
521
522public:
526 Details<Index_> run(const Matrix_& data, Cluster_ ncenters, Float_* centers, Cluster_* clusters) const {
527 auto nobs = data.num_observations();
528 if (internal::is_edge_case(nobs, ncenters)) {
529 return internal::process_edge_case(data, ncenters, centers, clusters);
530 }
531
532 RefineHartiganWong_internal::Workspace<Float_, Index_, Cluster_> work(nobs, ncenters);
533
534 RefineHartiganWong_internal::find_closest_two_centers(data, ncenters, centers, clusters, work.best_destination_cluster, my_options.num_threads);
535 for (Index_ obs = 0; obs < nobs; ++obs) {
536 ++work.cluster_sizes[clusters[obs]];
537 }
538 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
539
540 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
541 Float_ num = work.cluster_sizes[cen]; // yes, cast is deliberate here so that the multipliers can be computed correctly.
542 work.gain_multiplier[cen] = num / (num + 1);
543 work.loss_multiplier[cen] = (num > 1 ? num / (num - 1) : RefineHartiganWong_internal::big_number<Float_>());
544 }
545
546 int iter = 0;
547 int ifault = 0;
548 while ((++iter) <= my_options.max_iterations) {
549 bool finished = RefineHartiganWong_internal::optimal_transfer(data, work, ncenters, centers, clusters, /* all_live = */ (iter == 1));
550 if (finished) {
551 break;
552 }
553
554 auto quick_status = RefineHartiganWong_internal::quick_transfer(
555 data,
556 work,
557 centers,
558 clusters,
560 );
561
562 // Recomputing the centroids to avoid accumulation of numerical
563 // errors after many transfers (e.g., adding a whole bunch of
564 // values and then subtracting them again leaves behind some
565 // cancellation error). Note that we don't have to do this if
566 // 'finished = true' as this means that there was no transfer of
567 // any kind in the final pass through the dataset.
568 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
569
570 if (quick_status.second) { // Hit the quick transfer iteration limit.
572 ifault = 4;
573 break;
574 }
575 } else {
576 // If quick transfer converged and there are only two clusters,
577 // there is no need to re-enter the optimal transfer stage.
578 if (ncenters == 2) {
579 break;
580 }
581 }
582
583 if (quick_status.first) { // At least one quick transfer was performed.
584 work.optra_steps_since_last_transfer = 0;
585 }
586
587 for (Cluster_ c = 0; c < ncenters; ++c) {
588 work.update_history[c].reset(nobs);
589 }
590 }
591
592 if (iter == my_options.max_iterations + 1) {
593 ifault = 2;
594 }
595
596 return Details(std::move(work.cluster_sizes), iter, ifault);
597 }
601};
602
603}
604
605#endif
Report detailed clustering statistics.
Interface for k-means refinement.
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:498
RefineHartiganWongOptions & get_options()
Definition RefineHartiganWong.hpp:518
RefineHartiganWong(RefineHartiganWongOptions options)
Definition RefineHartiganWong.hpp:503
Interface for k-means refinement algorithms.
Definition Refine.hpp:26
virtual Details< Index_ > run(const Matrix_ &data, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters) const =0
Namespace for k-means clustering.
Definition compute_wcss.hpp:12
void parallelize(int num_workers, Task_ num_tasks, Run_ run_task_range)
Definition parallelize.hpp:28
Utilities for parallelization.
Additional statistics from the k-means algorithm.
Definition Details.hpp:20
Options for RefineHartiganWong.
Definition RefineHartiganWong.hpp:27
bool quit_on_quick_transfer_convergence_failure
Definition RefineHartiganWong.hpp:44
int max_iterations
Definition RefineHartiganWong.hpp:32
int num_threads
Definition RefineHartiganWong.hpp:50
int max_quick_transfer_iterations
Definition RefineHartiganWong.hpp:38