kmeans
k-means clustering in C++
Loading...
Searching...
No Matches
RefineHartiganWong.hpp
Go to the documentation of this file.
1#ifndef KMEANS_HARTIGAN_WONG_HPP
2#define KMEANS_HARTIGAN_WONG_HPP
3
4#include <vector>
5#include <algorithm>
6#include <numeric>
7#include <cstdint>
8
9#include "sanisizer/sanisizer.hpp"
10
11#include "Refine.hpp"
12#include "Details.hpp"
13#include "QuickSearch.hpp"
14#include "parallelize.hpp"
15#include "compute_centroids.hpp"
16#include "is_edge_case.hpp"
17#include "utils.hpp"
18
25namespace kmeans {
26
57
61namespace RefineHartiganWong_internal {
62
63/*
64 * Alright, get ready for a data dump, because this is where it gets real.
65 * Here I attempt to untangle the code spaghetti from the original Fortran
66 * implementation, which is exemplified by the NCP and LIVE arrays.
67 *
68 * Okay, NCP first. Each element L in NCP has a dual interpretation:
69 *
70 * - In the optimal-transfer stage, NCP(L) stores the step at which cluster L
71 * was last updated. Each step is just the observation index as the optimal
72 * transfer only does one pass through the dataset.
73 * - In the quick-transfer stage, NCP(L) stores the step at which cluster L is
74 * last updated plus M (i.e., the number of observations). Here, the step
75 * corresponding to an observation will be 'M * X + obs' for some integer X
76 * >= 0, where X is the iteration of the quick transfer. This means that the
77 * stored value is defined as '(M + 1) * X + obs'.
78 * - After each quick_transfer call, NCP(L) is set back to zero so any existing
79 * values will not carry over into the next optimal_transfer call.
80 *
81 * Note that these two definitions bleed into each other as the NCP(L) set by
82 * optimal_transfer is still referenced in the first few iterations of
83 * quick_transfer before it eventually gets overwritten. The easiest way to
84 * interpret this is to consider the optimal transfer as iteration 'M = -1'
85 * from the perspective of the quick transfer iterations.
86 *
87 * In short, the NCP array is used to determine whether a cluster was modified
88 * within the last M steps of the optimal_transfer or quick_transfer.
89 *
90 * Let's move onto LIVE, which has an even more painful interpretation:
91 *
92 * - Before each optimal_transfer call, LIVE(L) stores the observation at which
93 * cluster L was updated in the _previous_ optimal_transfer call.
94 * - During the optimal_transfer call, LIVE(L) is updated to the observation at
95 * which L was updated in this call, plus M (i.e., number of observations).
96 * - After the optimal_transfer call, LIVE(L) is updated by subtracting M, so
97 * that the interpretation is correct in the next call.
98 * - After the quick_transfer call, LIVE(L) is set to M + 1 if a quick transfer
99 * took place for L, effectively mimicking a "very recent" update.
100 *
101 * It basically tells us whether there was a recent transfer (optimal or quick)
102 * within the last M steps of optimal_transfer. If so, the cluster is "live".
103 * This differs very slightly from NCP as LIVE only counts optimal transfer
104 * steps while NCP counts both optimal and quick transfer steps.
105 *
106 * We simplify this mess by consolidating NCP and LIVE into a single update
107 * history that can serve both purposes. This is possible as LIVE is not used
108 * in quick_transfer, while any modifications made to NCP by quick_transfer are
109 * ignored and overwritten in optimal_transfer. Thus, there won't be any
110 * conflicts between the two meanings in a consolidated update history.
111 *
112 * The update history for cluster L now holds the iteration ('X') and the
113 * observation ('obs') at which the cluster was last modified. This has
114 * an obvious mapping to NCP(L), which was already defined in terms of
115 * 'X' and 'obs' anyway. For LIVE(L), the mapping is more complex:
116 *
117 * - If an optimal_transfer occurred but no quick_transfer occurs for L,
118 * we set 'X = -2'. 'obs' is not modified as it is the observation at
119 * which L was modified in the previous optimal_transfer iteration;
120 * this is basically just the definition of LIVE(L).
121 * - If a quick transfer does occur, we set 'X = -2' and 'obs = M + 1'.
122 * Again, this is equivalent to the behavior of LIVE(L).
123 * - If neither a quick_transfer nor optimal_transfer occurs for L,
124 * we set 'X = -3' to indicate that L didn't change in any manner.
125 * - In optimal_transfer, we consider a cluster to be live at a particular
126 * observation if its index is greater than 'obs' and its 'X' is -2.
127 *
128 * Note that Fortran is 1-based so the actual code is a little different than
129 * described above for 'obs' - specifically, we just need to set it to 'M'
130 * when a quick transfer occurs. Meaning of 'X' is unchanged.
131 *
132 * Incidentally, we can easily detect if a quick transfer occurs based on
133 * whether 'X > -1'. This obviates the need for the ITRAN array.
134 */
135template<typename Index_>
136class UpdateHistory {
137private:
138 Index_ my_last_observation = 0;
139
140 static constexpr int current_optimal_transfer = -1;
141 static constexpr int previous_optimal_transfer = -2;
142 static constexpr int ancient_history = -3;
143
144 // Starting at -3 as we can't possibly have any updates if we haven't done
145 // any transfers, optimal or quick!
146 int my_last_iteration = ancient_history;
147
148public:
149 void reset(const Index_ total_obs) {
150 if (my_last_iteration > current_optimal_transfer) { // i.e., quick_transfer.
151 my_last_observation = total_obs;
152 my_last_iteration = previous_optimal_transfer;
153 } else if (my_last_iteration == current_optimal_transfer) {
154 // Preserve the existing 'my_last_observation', just bump the iteration downwards.
155 my_last_iteration = previous_optimal_transfer;
156 } else {
157 my_last_iteration = ancient_history;
158 }
159 }
160
161 void set_optimal(const Index_ obs) {
162 my_last_observation = obs;
163 my_last_iteration = current_optimal_transfer;
164 }
165
166 // Here, iter should be from '[0, max_quick_transfer_iterations)'.
167 void set_quick(const int iter, const Index_ obs) {
168 my_last_iteration = iter;
169 my_last_observation = obs;
170 }
171
172public:
173 bool changed_after(const int iter, const Index_ obs) const {
174 if (my_last_iteration == iter) {
175 return my_last_observation > obs;
176 } else {
177 return my_last_iteration > iter;
178 }
179 }
180
181 bool changed_after_or_at(const int iter, const Index_ obs) const {
182 if (my_last_iteration == iter) {
183 return my_last_observation >= obs;
184 } else {
185 return my_last_iteration > iter;
186 }
187 }
188
189 bool is_live(const Index_ obs) const {
190 return changed_after(previous_optimal_transfer, obs);
191 }
192};
193
194template<typename Float_, typename Index_, typename Cluster_>
195struct Workspace {
196 // Array arguments in the same order as supplied to R's kmns_ function.
197 std::vector<Cluster_> best_destination_cluster; // i.e., IC2
198 std::vector<Index_> cluster_sizes; // i.e., NC
199
200 std::vector<Float_> loss_multiplier; // i.e., AN1
201 std::vector<Float_> gain_multiplier; // i.e., AN2
202 std::vector<Float_> wcss_loss; // i.e., D
203
204 std::vector<UpdateHistory<Index_> > update_history; // i.e., NCP, LIVE, and ITRAN.
205
206 Index_ optra_steps_since_last_transfer = 0; // i.e., INDX
207
208public:
209 Workspace(Index_ nobs, Cluster_ ncenters) :
210 // Sizes taken from the .Fortran() call in stats::kmeans().
211 best_destination_cluster(sanisizer::cast<decltype(I(best_destination_cluster.size()))>(nobs)),
212 cluster_sizes(sanisizer::cast<decltype(I(cluster_sizes.size()))>(ncenters)),
213 loss_multiplier(sanisizer::cast<decltype(I(loss_multiplier.size()))>(ncenters)),
214 gain_multiplier(sanisizer::cast<decltype(I(gain_multiplier.size()))>(ncenters)),
215 wcss_loss(sanisizer::cast<decltype(I(wcss_loss.size()))>(nobs)),
216 update_history(sanisizer::cast<decltype(I(update_history.size()))>(ncenters))
217 {}
218};
219
220template<typename Data_, typename Float_>
221Float_ squared_distance_from_cluster(const Data_* const data, const Float_* const center, const std::size_t ndim) {
222 Float_ output = 0;
223 for (decltype(I(ndim)) d = 0; d < ndim; ++d) {
224 const Float_ delta = static_cast<Float_>(data[d]) - center[d]; // cast to float for consistent precision regardless of Data_.
225 output += delta * delta;
226 }
227 return output;
228}
229
230template<class Matrix_, typename Cluster_, typename Float_>
231void find_closest_two_centers(
232 const Matrix_& data,
233 const Cluster_ ncenters,
234 const Float_* const centers,
235 Cluster_* const best_cluster,
236 std::vector<Cluster_>& best_destination_cluster,
237 const int nthreads)
238{
239 const auto ndim = data.num_dimensions();
240
241 // We assume that there are at least two centers here, otherwise we should
242 // have detected that this was an edge case in RefineHartiganWong::run.
243 const internal::QuickSearch<Float_, Cluster_> index(ndim, ncenters, centers);
244
245 const auto nobs = data.num_observations();
246 parallelize(nthreads, nobs, [&](const int, const decltype(I(nobs)) start, const decltype(I(nobs)) length) -> void {
247 auto matwork = data.new_extractor(start, length);
248 for (decltype(I(start)) obs = start, end = start + length; obs < end; ++obs) {
249 const auto optr = matwork->get_observation();
250 const auto res2 = index.find2(optr);
251 best_cluster[obs] = res2.first;
252 best_destination_cluster[obs] = res2.second;
253 }
254 });
255}
256
257template<typename Float_>
258constexpr Float_ big_number() {
259 return 1e30; // Some very big number.
260}
261
262template<typename Data_, typename Index_, typename Cluster_, typename Float_>
263void transfer_point(
264 const std::size_t ndim,
265 const Data_* const obs_ptr,
266 const Index_ obs_id,
267 const Cluster_ l1,
268 const Cluster_ l2,
269 Float_* const centers,
270 Cluster_* const best_cluster,
271 Workspace<Float_, Index_, Cluster_>& work)
272{
273 // Yes, casts to float are deliberate here, so that the
274 // multipliers can be computed correctly.
275 const Float_ al1 = work.cluster_sizes[l1], alw = al1 - 1;
276 const Float_ al2 = work.cluster_sizes[l2], alt = al2 + 1;
277
278 for (decltype(I(ndim)) d = 0; d < ndim; ++d) {
279 const Float_ oval = obs_ptr[d]; // cast to float for consistent precision regardless of Data_.
280 auto& c1 = centers[sanisizer::nd_offset<std::size_t>(d, ndim, l1)];
281 c1 = (c1 * al1 - oval) / alw;
282 auto& c2 = centers[sanisizer::nd_offset<std::size_t>(d, ndim, l2)];
283 c2 = (c2 * al2 + oval) / alt;
284 }
285
286 --work.cluster_sizes[l1];
287 ++work.cluster_sizes[l2];
288
289 work.gain_multiplier[l1] = alw / al1;
290 work.loss_multiplier[l1] = (alw > 1 ? alw / (alw - 1) : big_number<Float_>());
291 work.loss_multiplier[l2] = alt / al2;
292 work.gain_multiplier[l2] = alt / (alt + 1);
293
294 best_cluster[obs_id] = l2;
295 work.best_destination_cluster[obs_id] = l1;
296}
297
298/* ALGORITHM AS 136.1 APPL. STATIST. (1979) VOL.28, NO.1
299 * This is the OPtimal TRAnsfer stage.
300 * ----------------------
301 * Each point is re-assigned, if necessary, to the cluster that will induce a
302 * maximum reduction in the within-cluster sum of squares. In this stage,
303 * there is only one pass through the data.
304 */
305template<class Matrix_, typename Cluster_, typename Float_>
306bool optimal_transfer(
307 const Matrix_& data, Workspace<Float_, Index<Matrix_>, Cluster_>& work,
308 const Cluster_ ncenters,
309 Float_* const centers,
310 Cluster_* const best_cluster,
311 const bool all_live)
312{
313 const auto nobs = data.num_observations();
314 const auto ndim = data.num_dimensions();
315 auto matwork = data.new_extractor();
316
317 for (decltype(I(nobs)) obs = 0; obs < nobs; ++obs) {
318 ++work.optra_steps_since_last_transfer;
319
320 const auto l1 = best_cluster[obs];
321 if (work.cluster_sizes[l1] != 1) {
322 const auto obs_ptr = matwork->get_observation(obs);
323
324 // The original Fortran implementation re-used the WCSS loss for
325 // each observation, only recomputing it if its cluster had
326 // experienced an optimal transfer for an earlier observation. In
327 // theory, this sounds great to avoid recomputation, but the
328 // existing WCSS loss was computed in a running fashion during the
329 // quick transfers. This makes them susceptible to accumulation of
330 // numerical errors, even after the centroids are freshly
331 // recomputed in the run() loop. So, we simplify matters and
332 // improve accuracy by just recomputing the loss all the time,
333 // which doesn't take too much extra effort.
334 auto& wcss_loss = work.wcss_loss[obs];
335 const auto l1_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, l1);
336 wcss_loss = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
337
338 // Find the cluster with minimum WCSS gain.
339 auto l2 = work.best_destination_cluster[obs];
340 const auto original_l2 = l2;
341 const auto l2_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, l2);
342 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
343
344 const auto update_destination_cluster = [&](const Cluster_ cen) -> void {
345 auto cen_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, cen);
346 auto candidate = squared_distance_from_cluster(obs_ptr, cen_ptr, ndim) * work.gain_multiplier[cen];
347 if (candidate < wcss_gain) {
348 wcss_gain = candidate;
349 l2 = cen;
350 }
351 };
352
353 // If the current assigned cluster is live, we need to check all
354 // other clusters as potential transfer destinations, because the
355 // gain/loss comparison has changed. Otherwise, we only need to
356 // consider other live clusters as transfer destinations; the
357 // non-live clusters were already rejected as transfer destinations
358 // when compared to the current assigned cluster, so there's no
359 // point checking them again if they didn't change in the meantime.
360 //
361 // The exception is for the first call to optimal_transfer, where we
362 // consider all clusters as live (i.e., all_live = true). This is
363 // because no observation really knows its best transfer
364 // destination yet - the second-closest cluster is just a
365 // guesstimate - so we need to compute it exhaustively.
366 if (all_live || work.update_history[l1].is_live(obs)) {
367 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
368 if (cen != l1 && cen != original_l2) {
369 update_destination_cluster(cen);
370 }
371 }
372 } else {
373 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
374 if (cen != l1 && cen != original_l2 && work.update_history[cen].is_live(obs)) {
375 update_destination_cluster(cen);
376 }
377 }
378 }
379
380 // Deciding whether to make the transfer based on the change to the WCSS.
381 if (wcss_gain >= wcss_loss) {
382 work.best_destination_cluster[obs] = l2;
383 } else {
384 work.optra_steps_since_last_transfer = 0;
385 work.update_history[l1].set_optimal(obs);
386 work.update_history[l2].set_optimal(obs);
387 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
388 }
389 }
390
391 // Stop if we've iterated through the entire dataset and no transfer of
392 // any kind took place, be it optimal or quick.
393 if (work.optra_steps_since_last_transfer == nobs) {
394 return true;
395 }
396 }
397
398 return false;
399}
400
401/* ALGORITHM AS 136.2 APPL. STATIST. (1979) VOL.28, NO.1
402 * This is the Quick TRANsfer stage.
403 * --------------------
404 * IC1(I) is the cluster which point I currently belongs to.
405 * IC2(I) is the cluster which point I is most likely to be transferred to.
406 *
407 * For each point I, IC1(I) & IC2(I) are switched, if necessary, to reduce
408 * within-cluster sum of squares. The cluster centres are updated after each
409 * step. In this stage, we loop through the data until no further change is to
410 * take place, or we hit an iteration limit, whichever is first.
411 */
412template<class Matrix_, typename Cluster_, typename Float_>
413std::pair<bool, bool> quick_transfer(
414 const Matrix_& data,
415 Workspace<Float_, Index<Matrix_>, Cluster_>& work,
416 Float_* const centers,
417 Cluster_* const best_cluster,
418 const int quick_iterations)
419{
420 bool had_transfer = false;
421
422 const auto nobs = data.num_observations();
423 const auto ndim = data.num_dimensions();
424 auto matwork = data.new_extractor();
425
426 decltype(I(nobs)) steps_since_last_quick_transfer = 0; // i.e., ICOUN in the original Fortran implementation.
427
428 for (int it = 0; it < quick_iterations; ++it) {
429 const int prev_it = it - 1;
430
431 for (decltype(I(nobs)) obs = 0; obs < nobs; ++obs) {
432 ++steps_since_last_quick_transfer;
433 const auto l1 = best_cluster[obs];
434
435 if (work.cluster_sizes[l1] != 1) {
436 decltype(I(matwork->get_observation(obs))) obs_ptr = NULL;
437
438 // Need to update the WCSS loss if the cluster was updated recently.
439 // Otherwise, we must have already updated the WCSS in a previous
440 // iteration of the outermost loop, so this can be skipped.
441 //
442 // Note that we use changed_at_or_after; if the same
443 // observation was changed in the previous iteration of the
444 // outermost loop, its WCSS loss won't have been updated yet.
445 auto& history1 = work.update_history[l1];
446 if (history1.changed_after_or_at(prev_it, obs)) {
447 const auto l1_ptr = centers + sanisizer::product_unsafe<std::size_t>(l1, ndim);
448 obs_ptr = matwork->get_observation(obs);
449 work.wcss_loss[obs] = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
450 }
451
452 // If neither the best or second-best clusters have changed
453 // after the previous iteration that we visited this
454 // observation, then there's no point reevaluating the
455 // transfer, because nothing's going to be different anyway.
456 const auto l2 = work.best_destination_cluster[obs];
457 auto& history2 = work.update_history[l2];
458 if (history1.changed_after(prev_it, obs) || history2.changed_after(prev_it, obs)) {
459 if (obs_ptr == NULL) {
460 obs_ptr = matwork->get_observation(obs);
461 }
462 const auto l2_ptr = centers + sanisizer::product_unsafe<std::size_t>(l2, ndim);
463 const auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
464
465 if (wcss_gain < work.wcss_loss[obs]) {
466 had_transfer = true;
467 steps_since_last_quick_transfer = 0;
468 history1.set_quick(it, obs);
469 history2.set_quick(it, obs);
470 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
471 }
472 }
473 }
474
475 if (steps_since_last_quick_transfer == nobs) {
476 // Quit early if no transfer occurred within the past 'nobs'
477 // steps, as we've already converged for each observation.
478 return std::make_pair(had_transfer, false);
479 }
480 }
481 }
482
483 return std::make_pair(had_transfer, true);
484}
485
486}
526template<typename Index_, typename Data_, typename Cluster_, typename Float_, class Matrix_ = Matrix<Index_, Data_> >
527class RefineHartiganWong final : public Refine<Index_, Data_, Cluster_, Float_, Matrix_> {
528public:
532 RefineHartiganWong(RefineHartiganWongOptions options) : my_options(std::move(options)) {}
533
538
539private:
540 RefineHartiganWongOptions my_options;
541
542public:
548 return my_options;
549 }
550
551public:
555 Details<Index_> run(const Matrix_& data, const Cluster_ ncenters, Float_* const centers, Cluster_* const clusters) const {
556 const auto nobs = data.num_observations();
557 if (internal::is_edge_case(nobs, ncenters)) {
558 return internal::process_edge_case(data, ncenters, centers, clusters);
559 }
560
561 RefineHartiganWong_internal::Workspace<Float_, Index_, Cluster_> work(nobs, ncenters);
562
563 RefineHartiganWong_internal::find_closest_two_centers(data, ncenters, centers, clusters, work.best_destination_cluster, my_options.num_threads);
564 for (Index_ obs = 0; obs < nobs; ++obs) {
565 ++work.cluster_sizes[clusters[obs]];
566 }
567 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
568
569 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
570 const Float_ num = work.cluster_sizes[cen]; // yes, cast is deliberate here so that the multipliers can be computed correctly.
571 work.gain_multiplier[cen] = num / (num + 1);
572 work.loss_multiplier[cen] = (num > 1 ? num / (num - 1) : RefineHartiganWong_internal::big_number<Float_>());
573 }
574
575 decltype(I(my_options.max_iterations)) iter = 0;
576 int ifault = 0;
577 for (; iter < my_options.max_iterations; ++iter) {
578 const bool finished = RefineHartiganWong_internal::optimal_transfer(data, work, ncenters, centers, clusters, /* all_live = */ (iter == 0));
579 if (finished) {
580 break;
581 }
582
583 const auto quick_status = RefineHartiganWong_internal::quick_transfer(
584 data,
585 work,
586 centers,
587 clusters,
589 );
590
591 // Recomputing the centroids to avoid accumulation of numerical
592 // errors after many transfers (e.g., adding a whole bunch of
593 // values and then subtracting them again leaves behind some
594 // cancellation error). Note that we don't have to do this if
595 // 'finished = true' as this means that there was no transfer of
596 // any kind in the final pass through the dataset.
597 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
598
599 if (quick_status.second) { // Hit the quick transfer iteration limit.
601 ifault = 4;
602 break;
603 }
604 } else {
605 // If quick transfer converged and there are only two clusters,
606 // there is no need to re-enter the optimal transfer stage.
607 if (ncenters == 2) {
608 break;
609 }
610 }
611
612 if (quick_status.first) { // At least one quick transfer was performed.
613 work.optra_steps_since_last_transfer = 0;
614 }
615
616 for (Cluster_ c = 0; c < ncenters; ++c) {
617 work.update_history[c].reset(nobs);
618 }
619 }
620
621 if (iter == my_options.max_iterations) {
622 ifault = 2;
623 } else {
624 ++iter; // make it 1-based.
625 }
626
627 return Details(std::move(work.cluster_sizes), iter, ifault);
628 }
632};
633
634}
635
636#endif
Report detailed clustering statistics.
Interface for k-means refinement.
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:527
RefineHartiganWongOptions & get_options()
Definition RefineHartiganWong.hpp:547
RefineHartiganWong(RefineHartiganWongOptions options)
Definition RefineHartiganWong.hpp:532
Interface for k-means refinement algorithms.
Definition Refine.hpp:30
virtual Details< Index_ > run(const Matrix_ &data, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters) const =0
Perform k-means clustering.
Definition compute_wcss.hpp:16
void parallelize(const int num_workers, const Task_ num_tasks, Run_ run_task_range)
Definition parallelize.hpp:28
Utilities for parallelization.
Additional statistics from the k-means algorithm.
Definition Details.hpp:20
Options for RefineHartiganWong.
Definition RefineHartiganWong.hpp:30
bool quit_on_quick_transfer_convergence_failure
Definition RefineHartiganWong.hpp:49
int max_iterations
Definition RefineHartiganWong.hpp:35
int num_threads
Definition RefineHartiganWong.hpp:55
int max_quick_transfer_iterations
Definition RefineHartiganWong.hpp:41