1 Why t-tests?

In versions up to 1.7.0, scran used limma to compute p-values for the pairwise comparisons between clusters. However, I have switched it to use Welch t-tests instead, motivated by several factors:

Speed. Only cluster-specific means and standard deviations need to be calculated, rather than fitting a linear model to each gene.
Accommodation of differences in variance between clusters. This is particularly important for handling the mean-variance relationship upon differential expression. The classic example is that of a cluster with all-zero expression; a linear model would incorrectly shrink the sample variance towards zero in such cases.
Robustness to misspecification of clusters. This is relevant when clusters other than the ones being tested are misspecified, as their information does not get used in pairwise t-tests. In contrast, inflation of the sample variance would be observed with a linear model.

In theory, we could perform Student’s t-test in situations where there is only one cell in one of the groups. However, we do not do so as it is often compromised by discreteness in scRNA-seq data. If the group with many cells contains only zero counts, the sample variance is shrunk towards zero, which would downweight variability in expression in the other groups. More generally, I do not think it is possible to report a reliable p-value when there is only one observation in one group, especially as the mean-variance relationship is considered.

2 Normality requirements, or lack thereof

The distributional assumptions of the Welch t-test are surprisingly unproblematic at large sample sizes. Consider that the following examples work pretty well:

set.seed(12345)

# On highly non-normal distributions
p <- numeric(10000)
for (x in seq_along(p)) {
    a <- rexp(100, rate=1)
    b <- rexp(100, rate=1)    
    p[x] <- t.test(a, b)$p.value
}
hist(p)

# On the normalized counts.
p <- numeric(10000)
for (x in seq_along(p)) {
    a <- rnbinom(100, mu=10, size=10)
    b <- rnbinom(100, mu=1, size=1) * 10
    p[x] <- t.test(a, b)$p.value
}
hist(p)

# On the log-counts.(notwithstanding the fact that the
# mean of the logs is not generally equal to the log of the means).
p <- numeric(10000)
for (x in seq_along(p)) {
    a <- log(rnbinom(100, mu=10, size=10)+1)
    b <- log(rnbinom(100, mu=10, size=10)+1)
    p[x] <- t.test(a, b)$p.value
}
hist(p)

It is worth noting that this is not driven by the CLT, which only applies to asymptotic normality of the sample means. This does not ensure that the sample variance is chi-squared distributed, which is necessary for the t-test to work. Rather, the continued operation of the t-test is probably due to dependencies between the sample means and variances for these distributions. This means that you are more likely to get a large mean and variance at the same time, thus cancelling out the effects of non-normality. The other possibility is that the variance becomes so well estimated that the t-test converges to the z-test, but I don’t think we have the sample sizes for that.

3 Comparison to standard DE methods

3.1 Versus limma

The obvious downside to our use of t-tests is the loss of power from reduced information. This comes in three forms:

Reduced residual d.f. for the pair of groups being compared. This is because a sample variance is estimated separate for each group, rather than across both groups as would have been done with Student’s t-test or a linear model. The loss of power is exacerbated when blocking is involved, where a sample variance needs to be estimated for each block/group combination.
Reduced residual d.f. for each gene, due to the fact that information is only used from a pair of groups. In contrast, a linear model would use information across all groups to estimate the residual variance.
No empirical Bayes shrinkage from limma. This means that we do not get any benefit from a larger total d.f. due to the contribution of the prior d.f. On the other hand, this avoids any funny business with the distributional assumptions of the variances, especially at low counts and with a misspecified model.

We hope that the loss of power is minor due to the fact that we already have large residual d.f. in single-cell studies. Let us demonstrate by running a few simulations to examine the type II error rate. First we set up some functions to simulate the data and to perform the tests:

getY <- function(grouping, diff=1, ngenes=10000) {
    means <- as.integer(factor(grouping))*diff
    matrix(rnorm(length(means)*ngenes, mean=means), nrow=ngenes, byrow=TRUE)
}
library(limma)

## Warning: package 'limma' was built under R version 3.5.1

runLimma <- function(grouping, y, comp) {
    g <- factor(grouping)
    design <- model.matrix(~0+g)
    fit <- lmFit(y, design)
    con <- makeContrasts(contrasts=sprintf("g%s - g%s", comp[1], comp[2]), levels=design)
    fit <- contrasts.fit(fit, con)
    fit <- eBayes(fit)
    return(fit$p.value)
}
runT <- function(grouping, y, comp) {
    y1 <- y[,grouping==comp[1]]
    y2 <- y[,grouping==comp[2]]
    outp <- numeric(nrow(y))
    for (x in seq_len(nrow(y))) {
        outp[x] <- t.test(y1[x,], y2[x,])$p.value
    }
    return(outp)
}

Running under various conditions, and having a look at the type II error rate differences at a p-value threshold of 0.1%. We can see that the performances of limma and the t-test are comparable, though the former is consistently better by a modest degree. This benefit erodes as the size of the groups or differences between groups increases.

for (ngroups in 2:4) {
    cat(sprintf("Number of groups is %i\n", ngroups))
    for (nsize in c(20, 50, 100)) {
        cat(sprintf("  Size of each group is %i\n", nsize))
        g <- rep(seq_len(ngroups), each=nsize)

        for (diff in c(0.5, 1, 2)) {
            cat(sprintf("    Difference is %s\n", diff))
            collected.limma <- collected.t <- numeric(5)

            for (it in 1:5) {
                y <- getY(g, diff=diff)
                limma.p <- runLimma(g, y, c(1,2))
                t.p <- runT(g, y, c(1,2))
                collected.limma[it] <- mean(limma.p <= 0.001)
                collected.t[it] <- mean(t.p <= 0.001)
            }
            cat(sprintf("      Limma = %.5g, T = %.5g\n", 
                mean(collected.limma), mean(collected.t))) 
        }
    }
}

## Number of groups is 2
##   Size of each group is 20
##     Difference is 0.5
##       Limma = 0.0427, T = 0.03292
##     Difference is 1
##       Limma = 0.45002, T = 0.35756
##     Difference is 2
##       Limma = 0.9987, T = 0.9948
##   Size of each group is 50
##     Difference is 0.5
##       Limma = 0.2161, T = 0.19682
##     Difference is 1
##       Limma = 0.9554, T = 0.94134
##     Difference is 2
##       Limma = 1, T = 1
##   Size of each group is 100
##     Difference is 0.5
##       Limma = 0.59604, T = 0.57766
##     Difference is 1
##       Limma = 0.99994, T = 0.99994
##     Difference is 2
##       Limma = 1, T = 1
## Number of groups is 3
##   Size of each group is 20
##     Difference is 0.5
##       Limma = 0.04356, T = 0.03314
##     Difference is 1
##       Limma = 0.44654, T = 0.35444
##     Difference is 2
##       Limma = 0.99882, T = 0.99446
##   Size of each group is 50
##     Difference is 0.5
##       Limma = 0.2169, T = 0.19674
##     Difference is 1
##       Limma = 0.95688, T = 0.94192
##     Difference is 2
##       Limma = 1, T = 1
##   Size of each group is 100
##     Difference is 0.5
##       Limma = 0.59572, T = 0.57708
##     Difference is 1
##       Limma = 1, T = 0.99996
##     Difference is 2
##       Limma = 1, T = 1
## Number of groups is 4
##   Size of each group is 20
##     Difference is 0.5
##       Limma = 0.04266, T = 0.03284
##     Difference is 1
##       Limma = 0.45026, T = 0.3572
##     Difference is 2
##       Limma = 0.9989, T = 0.9944
##   Size of each group is 50
##     Difference is 0.5
##       Limma = 0.21326, T = 0.19432
##     Difference is 1
##       Limma = 0.95596, T = 0.94016
##     Difference is 2
##       Limma = 1, T = 1
##   Size of each group is 100
##     Difference is 0.5
##       Limma = 0.59408, T = 0.574
##     Difference is 1
##       Limma = 0.99988, T = 0.9999
##     Difference is 2
##       Limma = 1, T = 1

Note that these simulations are showing limma at its best, i.e., infinite prior d.f. for EB shrinkage. In practice, the estimated prior d.f. are lower (below 10, based on communications with Charlotte Soneson). So we can assume that the differences in performance would be even smaller in practice.

3.2 Versus `voom`

Interestingly, direct application of limma seems to be better than voom for this purpose. voom’s precision weighting means that only a few cells with large size factors contribute to the DE analysis. This is because the mean-variance trend is sharply decreasing so a small number of large cells will get very large weights. As a result, DE genes with high variability across the majority of cells will be highly ranked, because that variability is ignored by the model.

Technically, this is correct as we shouldn’t trust variability due to dropouts and other technical noise. However, if we want to be conservative, then we look for genes that are expressed consistently in spite of the dropouts. This favours DE genes in which we know that expression is consistent, at the expense of genes where we don’t know either way due to dropouts/noise. To do this, we take the log-expression values at face value and just use limma without weighting.

3.3 Versus edgeR

The use of t-tests is unsatisfying as the normality assumption is frequently violated and the log-transformation distorts the log-fold changes between groups (see https://github.com/LTLA/PseudoCount2018). Count-based models would be preferable, aside from a few nagging problems:

They are at least an order of magnitude slower, which is exacerbated by the large number of cells.

The saddlepoint approximation fails at low counts and large dispersions, rendering quasi-likelihood p-values invalid. This was particularly problematic for earlier read count data, though it may be less of an issue for UMI data.

library(edgeR)

## Warning: package 'edgeR' was built under R version 3.5.1

y <- matrix(rnbinom(10000*100, mu=0.1, size=1), ncol=100)
dev <- nbinomDeviance(y, mean=matrix(0.1, nrow(y), ncol(y)), dispersion=1)

# Observed moments of the deviance distribution.
mean(dev)

## [1] 40.83933

var(dev)

## [1] 62.04687

# What they should be (100 d.f., as we're using the known mean).
simdev <- rchisq(10000, df=100)
mean(simdev)

## [1] 99.86199

var(simdev)

## [1] 206.589

# And indeed, this is the case for more sane parameters.
y <- matrix(rnbinom(10000*100, mu=100, size=10), ncol=100)
dev <- nbinomDeviance(y, mean=matrix(100, nrow(y), ncol(y)), dispersion=0.1)
mean(dev)

## [1] 101.7439

var(dev)

## [1] 201.9379

They suffer from the same assumptions as linear models when fitted to the entire expression profile. Namely, they assume a constant dispersion and they are sensitive to cluster misspecification in the groups not being compared.
edgeR’s dispersion estimates are bounded by the grid search and can’t fully account for the variability. As a result, a few cells with large outlier expression end up driving the top DE calls. The log-transformation prior to using t-tests protects against outliers and yields better rankings, at least for read count data where amplification biases are most prevalent.
More generally, it is just as valid to perform DE on the log-counts as it is on the counts. There’s no reason to think that the comparison of arithmetic means is more biologically meaningful than the comparison of (log-)geometric means.

It is fairly simple to run edgeR separately, so I didn’t consider the need to create a wrapper for the LRT-related methods. It would be easy to do so, though, by collecting all pairwise test results and supplying them to combineMarkers.

Note that the direct use of log-transformed counts doesn’t work if you’re comparing log-fold changes, e.g., for allele-specific expression. This is because the prior count shrinks log-fold changes towards zero at different rates depending on the abundance. Thus, you could get a significant difference in log-fold changes if the abundances are different, even if the true log-fold changes are the same. This is avoided with count-based methods - or even voom`, where the observations with low abundances (most affected by the prior count) are not considered with much weight.

4 Reason for using pairwise tests

In findMarkers(), we perform tests between every pair of clusters, and then aggregate the results for all tests involving a particular cluster into a single DataFrame. The aggregation is performed in a manner that ensures that the top genes in each pairwise comparison remains among the top genes in the aggregated results. This means that the top set is not dominated by DE genes between the chosen cluster and the cluster that is most different. Such a ranking would not be informative with respect to how the chosen cluster differs from other clusters.

It is for a similar reason that we do not test each cluster against the (grand) average of all other clusters. Such a test would be additionally complicated by the fact that testing against the average expression profile is difficult to interpret, as a net change in expression can belie the complexity of up/down-regulation in the individual clusters. This would be made even worse by testing against the average of all other cells, which would be dependent on the frequency of cells in each cluster.

As a side note, an alternative is to use machine learning to choose a gene set that distinguishes a cluster from the rest. This takes advantage of combinatorics but is less useful for interpretation of the differences between clusters. For example, LASSO methods will only pick one gene in the set if two genes are redundant. This results in the failure to detect a gene that might be important for characterizing the biological identity of a cluster.

5 Session information

sessionInfo()

## R version 3.5.0 RC (2018-04-16 r74624)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] edgeR_3.23.7    limma_3.37.10   BiocStyle_2.9.6
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1     Rcpp_0.12.19       bookdown_0.7      
##  [4] lattice_0.20-35    digest_0.6.18      rprojroot_1.3-2   
##  [7] grid_3.5.0         backports_1.1.2    magrittr_1.5      
## [10] evaluate_0.12      stringi_1.2.4      rmarkdown_1.10    
## [13] tools_3.5.0        stringr_1.3.1      xfun_0.4          
## [16] yaml_2.2.0         compiler_3.5.0     BiocManager_1.30.3
## [19] htmltools_0.3.6    knitr_1.20

Comments on marker detection in scran

2018-10-29

1 Why t-tests?

2 Normality requirements, or lack thereof

3 Comparison to standard DE methods

3.1 Versus limma

3.2 Versus `voom`

3.3 Versus edgeR

4 Reason for using pairwise tests

5 Session information

Comments on marker detection in scran

2018-10-29

1 Why t-tests?

2 Normality requirements, or lack thereof

3 Comparison to standard DE methods

3.1 Versus limma

3.2 Versus voom

3.3 Versus edgeR

4 Reason for using pairwise tests

5 Session information

3.2 Versus `voom`