Ward’s method seems to work well, but complete linkage would also probably do a good job here. The problem with method selection is that the “best” method depends on the unknown nature of the underlying data. Ward and complete linkage assume compact clusters, but this might not be the case, e.g., density-based methods would do better for irregular shapes. This might suggest that ensemble or consensus clustering would perform best.
The issue is with the interpretability of whatever clusters crawl out at the end. If an assigment only occurs with a minority of methods, should it be discarded, even if those methods focus on particularly pertinent aspects of the data? This is likely to occur if you use substantially different clustering methods, given that the use of similar methods would defeat the purpose. Upon summarization, these minority assignments would be discarded and power would be lost relative to application of the minority methods by themselves.
Rather, the main utility of a consensus method is that it tells you which clusters are the most robust with respect to variability from the choice of method. These clusters can be considered conservative as you need everyone to detect them, resulting in some loss of power. However, if you assume that each method is affected by noise in different ways, then the consensus clusters are effectively denoised, which is nice.
The age-old question - how many clusters? Maximizing the average silhouette width seems to work pretty well, as it protects against too few and too many clusters. This is better than the gap statistic in that it allows you to see the quality of the clusters at the same time. In particular, you can colour the silhouette by its own colour if positive, and by the colour of its neighbour if negative. This tells you where the “wrong” cells should have been assigned to. Thus, a cluster is only well-separated if most its cells have positive width; and if only a few cells in other clusters have negative widths and have the target cluster as a closest neighbour.
Of course, it is difficult to assess how “right” or “wrong” the clustering is, especially in the context of data exploration. For example, with underclustering, one might group CD4+ and CD8+ T cells together in one cluster, and B cells in another cluster. This is not wrong per se - suboptimal, perhaps, as you’re not discovering all the heterogeneity, but not wrong. Conversely, with overclustering, you would just end up splitting a cluster into multiple smaller clusters. This complicates interpretation as some clusters are probably redundant with each other - but it’s not wrong either, just results in some extra work during annotation. Indeed, the smaller clusters probably have some differences, so it just comes down to the resolution that you want to use.
This is widely acknowledged to be a bad idea, but it is worth considering exactly why. The obvious practical issue is that it introduces (unnecessary) stochasticity to the clustering results. This is due to the sensitivity of the t-SNE’s algorithm to initialization, such that even minor deviations result in large quantitative differences in the output coordinates. It also adds an element of arbitrariness due the impact of the choice of perplexity, which further complicates the interpretation of the results.
A more subtle point is that t-SNE - or indeed, any non-linear dimensionality reduction technique - must distort the distances between points. This is required in order to effectively compress high-dimensional data into a two-dimensional space. However, this will also affect the behaviour of clustering algorithms that operate on distances. Cells that are close together in t-SNE space may not be so in the original space, resulting in incorrect clusters. Consider the following example:
set.seed(1000)
y <- matrix(rnorm(100000), ncol=200) # true location of each cell.
out <- kmeans(t(y), centers=2)
library(Rtsne)
t.out <- Rtsne(t(y))
plot(t.out$Y[,1], t.out$Y[,2], pch=16, col=factor(out$cluster))
One could argue that t-SNE’s distortion is, in fact, a good thing as it provides some denoising to improve downstream clustering. While this may be true, it is also difficult to justify theoretically when discrepancies arise between clustering results.
This leads to my final point: the t-SNE itself provides a powerful sanity check for the clustering, and vice versa. The two algorithms operate on different principles, so agreement of cluster labels with visualized clusters suggests that the underlying structure is not an analysis artifact. This reasoning would not be possible if the clustering was performed directly on the t-SNE components, as any issues with the t-SNE would propagate to the clusters.
sessionInfo()
## R version 3.5.0 RC (2018-04-16 r74624)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Rtsne_0.13 BiocStyle_2.9.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.19 bookdown_0.7 digest_0.6.18
## [4] rprojroot_1.3-2 backports_1.1.2 magrittr_1.5
## [7] evaluate_0.12 stringi_1.2.4 rmarkdown_1.10
## [10] tools_3.5.0 stringr_1.3.1 xfun_0.4
## [13] yaml_2.2.0 compiler_3.5.0 BiocManager_1.30.3
## [16] htmltools_0.3.6 knitr_1.20