1 Arguments against

The obvious argument against using the CDR is the fact that it may be confounded with condition or cell type, such that meaningful biology is regressed out. For example, we have observed differences in the total number of expressed features between mESCs cultured under different conditions (Kolod data). More severe differences would probably be present between different cell types. Also see Supplementary Figures 2A, 4 and 6 in Charlotte’s Nat Comm’s paper, where inclusion of the CDR as a covariate results in fewer DE genes.

We can also perform a little illustration with some simulated data in a linear modelling framework. The confounding effect is strongest when the groups are well separated on the number of detected genes. Inclusion of the CDR in the model severely reduces the significance of the p-value:

set.seed(1000)
y <- rnorm(100) + rep(c(0, 1), each=50)
g <- gl(2, 50)
cov <- rnorm(100) + rep(c(0, 3), each=50)
summary(lm(y ~ g + cov))$coefficient

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -0.15362868  0.1414936 -1.0857645 0.2802741887
## g2           1.57361058  0.3950077  3.9837469 0.0001312611
## cov         -0.07297899  0.1112167 -0.6561874 0.5132571539

summary(lm(y ~ g))$coefficient

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -0.1586302  0.1408771 -1.126018 2.629076e-01
## g2           1.3500188  0.1992302  6.776174 9.270756e-10

Fortunately, this extreme change in p-values is probably uncommon - in practice, the separation between cell types on the CDR should be minimal. Even if the population means are well-separated, there should be enough variability in sequencing depth, etc. that the two distributions are quite well mixed. This reduces the confounding impact, as cov is a much noisier proxy for g that results in a poorer fit. The effect of noise is further exacerbated when cov needs to be scaled in the linear model to match the magnitude of difference corresponding to the g effect.

set.seed(100)
y <- rnorm(100) + rep(c(0, 1), each=50)
g <- gl(2, 50)
cov <- rnorm(100) + rep(c(0, 1), each=50)
summary(lm(y ~ g + cov))$coefficient

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  0.06529703  0.1448197  0.450885 0.6530787592
## g2           1.04131691  0.2585522  4.027492 0.0001120016
## cov         -0.16246584  0.1300256 -1.249491 0.2144920486

summary(lm(y ~ g))$coefficient

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 0.08150894  0.1446497 0.5634918 5.743871e-01
## g2          0.84280724  0.2045656 4.1199848 7.925389e-05

Nonetheless, there is still some increase in the p-value that cannot be dismissed, especially when multiple testing correction comes into play. For example, with a different seed:

set.seed(300)
y <- rnorm(100) + rep(c(0, 1), each=50)
g <- gl(2, 50) 
cov <- rnorm(100) + rep(c(0, 1), each=50)
summary(lm(y ~ g + cov))$coefficient

##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 0.2687505 0.13122558 2.048004 0.043262028
## g2          0.7079405 0.21274531 3.327643 0.001238458
## cov         0.1144778 0.09415858 1.215798 0.227012493

summary(lm(y ~ g))$coefficient

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 0.2788355  0.1312823 2.123939 3.619256e-02
## g2          0.8352100  0.1856612 4.498572 1.882244e-05

2 Arguments for

The main argument for using the CDR is that it mops up errors in normalization, by serving as a proxy for any leftover technical biases (see linearity.Rmd for comments). This was primarily proposed for MAST, though at least some of the benefit is probably due to the fact that the zeroes are modelled separately. Without the CDR, the hurdle model cannot handle scaling differences in means (even after normalization) due to the different frequencies of zeroes.

set.seed(0)
library(MAST)
mus <- rep(c(0.1, 1), each=100)
y <- matrix(rnbinom(10000, mu=mus, size=10), byrow=TRUE, ncol=length(mus))
sca <- FromMatrix(log2(t(t(y)/mus)+1)) # log-counts: for reasons unknown to me, log-CPMs don't work.
g <- gl(2, 100)
Z <- zlm(~g, sca)
out <- summary(Z, n=nrow(Z), doLRT='g2')
hist(as.numeric(out$datatable[,4][[1]]))

This gets a lot better if you throw in the CDR as a covariate, as this soaks up the differences between groups and restores type I error control. It suggests that the much of the benefit from using the CDR is specific to MAST’s hurdle model, where scaling of the mean count is not otherwise handled.

cdr <- colSums(y==0)
Z <- zlm(~cdr + g, sca)
out <- summary(Z, n=nrow(Z), doLRT='g2')
hist(as.numeric(out$datatable[,4][[1]]))

3 Use in data exploration

Another question is whether regression on the CDR should be performed prior to data exploration, and the residuals used for PCA and clustering. I would suggest not, as all of the underlying factors of variation are not fully known. Unless the CDR is completely orthogonal to the structure, the coefficient for the former will absorb some of the latter. At best, this will result in loss of power to detect structured heterogeneity.

At worst, artificial structure will be introduced. This is because the residuals will be spuriously correlated across genes when the model is consistently misspecified. Consider the following example where a small subset of cells with large CDRs have upregulation for a few genes. The residuals are correlated as the trend is not properly fitted, being dragged up by the subpopulation. This results in an apparent trajectory along the first PC, which is not present if the CDR was not used.

# Mocking up some data.
set.seed(0)
y <- matrix(rpois(10000, lambda=5), ncol=100)
y[,91:100] <- rpois(10*100, lambda=20)

# Assume that the CDR is just increasing.
library(edgeR)
cdr <- 1:100
design <- model.matrix(~cdr)
fit <- glmFit(y, design, dispersion=0, offset=0)
r <- (y - fit$fitted.value)/sqrt(fit$fitted.value)

pc <- prcomp(t(r))
par(mfrow=c(1,2))
col <- rgb(0,0,seq(0.9, 0, length.out=100))
plot(pc$x[,1], pc$x[,2], pch=16, col=col, main="With CDR")

# Repeat it with an empty design matrix, for comparison.
fit0 <- glmFit(y, design[,1], dispersion=0, offset=0)
r0 <- (y - fit0$fitted.value)/sqrt(fit0$fitted.value)
pc0 <- prcomp(t(r0))
plot(pc0$x[,1], pc0$x[,2], pch=16, col=col, main="No CDR")

4 Session information

sessionInfo()

## R version 3.5.0 RC (2018-04-16 r74624)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] edgeR_3.23.7                limma_3.37.10              
##  [3] MAST_1.7.2                  SingleCellExperiment_1.3.12
##  [5] SummarizedExperiment_1.11.6 DelayedArray_0.7.49        
##  [7] BiocParallel_1.15.15        matrixStats_0.54.0         
##  [9] Biobase_2.41.2              GenomicRanges_1.33.14      
## [11] GenomeInfoDb_1.17.4         IRanges_2.15.19            
## [13] S4Vectors_0.19.24           BiocGenerics_0.27.1        
## [15] BiocStyle_2.9.6            
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1         tidyselect_0.2.5       xfun_0.4              
##  [4] reshape2_1.4.3         purrr_0.2.5            lattice_0.20-35       
##  [7] colorspace_1.3-2       htmltools_0.3.6        yaml_2.2.0            
## [10] rlang_0.3.0.1          pillar_1.3.0           glue_1.3.0            
## [13] bindrcpp_0.2.2         GenomeInfoDbData_1.2.0 plyr_1.8.4            
## [16] bindr_0.1.1            stringr_1.3.1          zlibbioc_1.27.0       
## [19] munsell_0.5.0          gtable_0.2.0           evaluate_0.12         
## [22] knitr_1.20             Rcpp_0.12.19           backports_1.1.2       
## [25] scales_1.0.0           BiocManager_1.30.3     XVector_0.21.4        
## [28] abind_1.4-5            ggplot2_3.1.0          digest_0.6.18         
## [31] stringi_1.2.4          bookdown_0.7           dplyr_0.7.7           
## [34] grid_3.5.0             rprojroot_1.3-2        tools_3.5.0           
## [37] bitops_1.0-6           magrittr_1.5           RCurl_1.95-4.11       
## [40] lazyeval_0.2.1         tibble_1.4.2           crayon_1.3.4          
## [43] pkgconfig_2.0.2        Matrix_1.2-14          data.table_1.11.8     
## [46] assertthat_0.2.0       rmarkdown_1.10         R6_2.3.0              
## [49] compiler_3.5.0

Comments on the cellular detection rate

2018-10-29

1 Arguments against

2 Arguments for

3 Use in data exploration

4 Session information