1 Overview

Exploratory analyses of single-cell RNA sequencing (scRNA-seq) data often involve clustering to summarize the data for further interpretation. It is routine to assess the quality of the clustering, e.g., based on how separated or modular the clusters are. To this end, the concept of “cluster stability” is used to determine whether the clusters are reproducible in a replicate experiment. Bootstrapping is often used for this purpose, providing a non-parametric method to generate replicate datasets in silico. Here we will discuss some of the common uses of bootstrapping in scRNA-seq data analyses.

2 By gene

2.1 Overview

The simplest approach is to resample the genes to generate a “replicate” dataset. Assume that we have a cell-based clustering approach that was used to generate clusterings in the original dataset. We generate a bootstrap replicate by sampling genes with replacement, apply our clustering method to the replicate, record the clusterings, and repeat. For two cells that were originally clustered together, we report the percentage of times that they were clustered together in the replicate. Conversely, for cells that were not originally clustered together, we report the percentage of times that they were not clustered together in the replicate. This provides a measure of reproducibility for the clustering, with larger values indicating that the clusters are more stable.

This approach is implemented in a number of software packages such as pvclust (Suzuki and Shimodaira 2006), and was probably motivated by the use of bootstrap confidence intervals for phylogenetic trees (Efron, Halloran, and Holmes 1996). It is appealing as the identities of the cells/samples are not altered, allowing a direct comparison of clusterings between the replicate and original datasets. However, bootstrapping on genes involves a number of problematic assumptions:

  • Genes are independent of each other, conditional on the underlying structure (as genes supporting the same clustering would always have correlated expression profiles). This is untrue due to correlations between co-regulated genes within clusters, which overstates the confidence of stochastic substructure. In contrast, the original use case of resampling loci in phylogenetic trees was more justifiable under the assumptions of neutral evolution and infinite sites.
  • Genes are interchangeable with each other. This is not appealing in principle as the set of genes used in each experiment are fixed; a “replicate” generated by resampling genes has little theoretical similarity to a genuine replicate generated with the same genes. In practical terms, there is a ~40% chance of discarding a gene during resampling, so clusters defined by only a few genes are liable to be considered unstable even if the signal is highly consistent.

2.2 Simulation

To illustrate, let’s have a look at 100 cells that belong to the same cluster. These cells express strongly co-regulated genes in each of 5 pathways. However, there is no systematic structure among the 5 pathways.

dataGen <- function (npath, perpath, ncells) {
    each.path <- matrix(rnorm(npath*ncells), ncol=ncells)
    gene.exprs <- each.path[rep(seq_len(npath), each=perpath),]
    gene.exprs + rnorm(length(gene.exprs), 0.1)
}
y <- dataGen(npath=5, perpath=100, ncells=100)

We apply simple k-means to split this into 4 clusters. This clustering is completely spurious as there is no real structure here.

clusterFun <- function(y) {
    cluster <- kmeans(t(y), centers=4)$cluster
    outer(cluster, cluster, "==")
}
together <- clusterFun(y)

We perform bootstrapping on the genes to assess cluster stability. This is done by checking whether the co-clusterings of cells are preserved in the bootstrap replicate.

result <- matrix(0, ncol(y), ncol(y))
for (x in 1:100) {
    new.y <- y[sample(nrow(y), nrow(y), replace=TRUE),]
    new.together <- clusterFun(new.y)
    result <- result + as.integer(new.together==together)
}

We can compare this to a true replicate of this dataset. Recall that the pathway expression is unstructured, so we are not under any obligation to re-use the structure in y.

result2 <- matrix(0, ncol(y), ncol(y))
for (x in 1:100) {
    new.y <- dataGen(npath=5, perpath=100, ncells=100)
    new.together2 <- clusterFun(new.y)
    result2 <- result2 + as.integer(new.together2==together)
}

We can see that the reproducibility values are consistently overestimated by bootstrapping. This is a consequence of the correlations between genes. The true reproducibility for co-clustering is close to 25%, as cells are split across 4 clusters.

par(mfrow=c(1,2))
boxplot(split(result, together), ylab="Reproducibility (%)", 
    xlab="Same cluster", main="Bootstrapping", ylim=c(0, 100))
boxplot(split(result2, together), ylab="Reproducibility (%)", 
    xlab="Same cluster", main="Truth", ylim=c(0, 100))

Conversely, bootstrapping is accurate if the genes are actually independent. This is consistent with the underlying assumptions of the gene-based bootstrap.

y <- dataGen(npath=500, perpath=1, ncells=100)
together <- clusterFun(y)

result3 <- matrix(0, ncol(y), ncol(y))
for (x in 1:100) {
    new.y <- y[sample(nrow(y), nrow(y), replace=TRUE),]
    new.together <- clusterFun(new.y)
    result3 <- result3 + as.integer(new.together==together)
}

result4 <- matrix(0, ncol(y), ncol(y))
for (x in 1:100) {
    new.y <- dataGen(npath=500, perpath=1, ncells=100)
    new.together <- clusterFun(new.y)
    result4 <- result4 + as.integer(new.together==together)
}

par(mfrow=c(1,2))
boxplot(split(result3, together), ylab="Reproducibility (%)", 
    xlab="Same cluster", main="Bootstrapping", ylim=c(0, 100))
boxplot(split(result4, together), ylab="Reproducibility (%)", 
    xlab="Same cluster", main="Truth", ylim=c(0, 100))

More subtle variants of this problem arise when there is actually some weak underlying structure. In such cases, the correlation-induced bias of the bootstrap estimates may inflate the apparent stability of the clusters. This defeats the purpose of using these estimates as an objective criterion for determining the number and reliability of clusters.

3 By cell

3.1 Overview

The alternative approach to bootstrapping is to resample the cells with replacement. Here we are resampling observations, which is closer to the original application of the bootstrap (Efron and Tibshirani 1986). We can reasonably assume that cells were independently1 Depending on the experimental protocol used for sampling the cells, of course. drawn from the same underlying population, such that the cell-bootstrapped replicate is a close approximation of a real experimental replicate. No assumptions about the independence of genes are involved, which is appealing.

It is important to keep in mind that bootstrapping on the cells fundamentally changes the cell identities in the bootstrap replicate. A resampled cell in a bootstrap replicate cannot be considered to be the same as the cell-of-origin in the original dataset. This reflects the fact that, in a real scRNA-seq experiment, we cannot obtain replicate observations from the same cell. The resampled cell is a conceptually new cell, which simply happens to have an expression profile identical to the cell-of-origin.

The question then becomes, how do we assess cluster stability if the cell identities have changed in the replicate? The fpc package (Hennig 2007) provides cluster-specific stability measures by bootstrapping on the cells with the clusterboot function. This is achieved by computing the maximum Jaccard similarity between each bootstrapped and original cluster, and taking the mean Jaccard index across bootstrap iterations. Larger Jaccard indices indicate that the cluster was mostly recovered in the bootstrap replicate.

The clusterboot approach relies on a mapping between cells in the original dataset to that in the bootstrap replicate. This is achieved by keeping track of the identities of the resampled cells in each bootstrap iteration. At first glance, this seems inappropriate as cell identities are inherently altered by bootstrapping. However, if we did have a replicate, we could construct a mapping using a nearest-neighbour approach; and the nearest neighbour of each resampled cell in the original dataset would simply be the cell-of-origin, as they have the same expression profile!

3.2 Simulation

We can check whether the clusterboot approach yields sensible Jaccard indices with a small simulation. First, we set up a function to map replicate cells to the original dataset, and to calculate the Jaccard indices:

library(FNN)
mapJaccard <- function(y, cluster, new.y, new.cluster) {
    m <- get.knnx(t(y), t(new.y), k=1)$nn.index
    re.cluster <- cluster[m]

    by.reclust <- split(seq_along(re.cluster), re.cluster)
    by.newclust <- split(seq_along(new.cluster), new.cluster)
    output <- numeric(length(by.newclust))
    names(output) <- names(by.newclust)

    # Original clusters with no cells in the bootstrap get Jaccard=0.
    for (X in names(by.reclust)) {
        current <- by.reclust[[X]]
        jaccards <- lapply(by.newclust, FUN=function(other) {
            length(intersect(current, other))/length(union(current,other))
        })
        output[X] <- max(unlist(jaccards))
    }
    return(output)
}

We also set up a simple k-means function to report the clustering:

clusterFun2 <- function(y) {
    kmeans(t(y), centers=4)$cluster
}

We use bootstrapping to compute Jaccard indices:

y <- dataGen(npath=500, perpath=1, ncells=100)
cluster <- clusterFun2(y)

result3 <- vector("list", 100)
for (x in seq_along(result3)) {
    new.y <- y[sample(nrow(y), nrow(y), replace=TRUE),]
    new.cluster <- clusterFun2(new.y)
    result3[[x]] <- mapJaccard(y, cluster, new.y, new.cluster)
}
result3 <- do.call(rbind, result3)
colMeans(result3)
##         1         2         3         4 
## 0.1821198 0.2677909 0.1998753 0.1705122

… and we repeat this with true replicates. The two sets of Jaccard indices are quite similar, which suggests that the implicit mapping used by clusterboot is acceptable (and that the variability of the nearest-neighbour mapping can be ignored).

result4 <- vector("list", 100)
for (x in seq_along(result4)) { 
    new.y <- dataGen(npath=500, perpath=1, ncells=100)
    new.cluster <- clusterFun2(new.y)
    result4[[x]] <- mapJaccard(y, cluster, new.y, new.cluster)
}
result4 <- do.call(rbind, result4)
colMeans(result4)
##         1         2         3         4 
## 0.1835086 0.2614334 0.1950414 0.1650600

4 Interpreting the cluster stability

4.1 Difficulties

The real problem with clusterboot is the interpretability of the Jaccard indices. A Jaccard index of 0.9 may be very good for cell types that are difficult to separate, e.g., various flavours of T cells. In contrast, any Jaccard index below 1 would probably be unacceptable for very distinct cell types, e.g., mESCs and MEFs.

Similar issues are present with metrics based on the number of times two cells are observed in the same cluster. This describes the relationship between pairs of cells, and relating this to the stability of an entire cluster is not straightforward. Yes, we could compute any number of summary statistics, but this provides no real advantage in interpretability over the Jaccard index.

Another issue is that cluster stability becomes contextual as soon as multiple clusters are involved. A cluster may not be stable with respect to separation from a neighbouring cluster, but may be very consistently separated from other distant clusters. Strictly speaking, it is true that such a cluster is unstable, but this is only relevant when considering the differences between the adjacent clusters. Interpreting a single stability metric per cluster is difficult as the reason for any low stability is not obvious (and may or may not be important).

4.2 Potential solution

We propose a more interpretable metric with the following procedure:

  1. We bootstrap on the cells, recluster and obtain a new set of clusters.
  2. We assign each original cluster to the closest new cluster based on their mean expression profiles.
  3. For each pair of original clusters, we count the number of bootstrap iterations in which they are assigned to the same new cluster.
  4. A high frequency of co-assignments indicates that the paired clusters are not stable with respect to each other.

The co-assignment frequency can be interpreted as the probability of merging two clusters, using the nearest-neighbour mapping to determine which clusters are “equivalent” across replicates. High probabilities indicate that the two clusters are not stable with respect to each other. This is much more useful as it allows users to evaluate the reliability of separation for different clusters. For example, we could ignore any separation of cluster pairs that are merged more than 5% of the time.

An interesting case involves that of rare cell populations, which may not even be sampled in the boostrap replicate. This will result in a high co-assignment frequency as the closest new cluster will frequently correspond to another original cluster. As far as we are concerned, this is the correct result as the rare cell population cannot be consistently recovered in the bootstrap replicates. The closest original cluster will also have a high co-assignment frequency but only to the rare population, so interpretation of other separations is not compromised.

5 Does this even make sense?

A stable cluster is necessary but not sufficient to be actually useful. It also has to be functionally interesting, and this cannot be assessed by bootstrapping (or, for that matter, any stability measure). In fact, for datasets with many cells, it is likely that functional relevance is lost long before stability is compromised, given that the noise introduced by resampling has a decreasing effect on the distribution of cells on the biological manifold. This suggests that bootstrap stability is largely ineffective as a measure of cluster quality.

Another problem is that bootstrapping fails to capture the true variability across samples. A bootstrap replicate is not a good representative of a genuine replicate generated from a different set of cells (most likely with different cell type composition and changes in expression). This means that bootstrapping will be overly optimistic with respect to what it considers to be reproducible.

Conversely, we might actually consider the comparisons between clusters to be too strict. The clustering algorithm may merge or split apart clusters due to the noise introduced by bootstrapping, but it is often still possible to effectively recover the same clusters by tweaking the parameters for the boostrap replicate. In such cases, the scientific conclusions are reproducible across bootstrap replicates but bootstrap-based stability measures would be overly pessimistic.

6 Session information

sessionInfo()
## R version 3.6.0 Patched (2019-05-02 r76456)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/luna/Software/R/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/luna/Software/R/R-3-6-branch/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] FNN_1.1.3        BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2         bookdown_0.14      digest_0.6.21     
##  [4] magrittr_1.5       evaluate_0.14      rlang_0.4.0       
##  [7] stringi_1.4.3      rmarkdown_1.16     tools_3.6.0       
## [10] stringr_1.4.0      xfun_0.10          yaml_2.2.0        
## [13] compiler_3.6.0     BiocManager_1.30.8 htmltools_0.4.0   
## [16] knitr_1.25

References

Efron, B., E. Halloran, and S. Holmes. 1996. “Bootstrap confidence levels for phylogenetic trees.” Proc. Natl. Acad. Sci. U.S.A. 93 (23): 13429–34.

Efron, B., and R. Tibshirani. 1986. “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy.” Statist. Sci. 1 (1): 54–75.

Hennig, C. 2007. “Cluster-Wise Assessment of Cluster Stability.” Computational Statistics & Data Analysis 52 (1): 258–71.

Suzuki, R., and H. Shimodaira. 2006. “Pvclust: an R package for assessing the uncertainty in hierarchical clustering.” Bioinformatics 22 (12): 1540–2.