1 Background

This script demonstrates that TMM normalization fails with low counts. Namely, does it accurately estimate the normalization factors? We set up a simulation where some genes are upregulated by mult in the second sample.

simulator <- function(ngenes, genes.spiked, mu.back, mult, disp) {
    mu.spike <- mu.back*mult
    x1 <- rnbinom(ngenes, mu=mu.back, size=1/disp)
    normed <- mu.back*ngenes/(mu.back*(ngenes-genes.spiked)+mu.spike*genes.spiked)
    x2 <- rnbinom(ngenes, mu=mu.back*normed, size=1/disp)
    spiked <- sample(ngenes, genes.spiked)
    x2[spiked] <- rnbinom(genes.spiked, mu=mu.spike*normed, size=1/disp)
    return(list(counts=cbind(x1, x2), factor=normed, spiked=spiked))
}

2 Simulating with various count sizes

Simulating over three times for various count sizes:

library(edgeR)
set.seed(1000)
ngenes <- 10000
lapply(1:3, FUN=function(i) {
    x <- simulator(ngenes, 200, 2, 5, 0.05) # low count
    calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0013521 0.9986498
## 
## [[2]]
## [1] 1.0069616 0.9930865
## 
## [[3]]
## [1] 1.0052563 0.9947712
lapply(1:3, FUN=function(i) {
    x <- simulator(ngenes, 200, 10, 5, 0.05) # middle count
    calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0291659 0.9716607
## 
## [[2]]
## [1] 1.029547 0.971301
## 
## [[3]]
## [1] 1.0341057 0.9670192
lapply(1:3, FUN=function(i) {
    x <- simulator(ngenes, 200, 50, 5, 0.05) # high count
    calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0356454 0.9655815
## 
## [[2]]
## [1] 1.0345765 0.9665791
## 
## [[3]]
## [1] 1.0356693 0.9655591

We then compare these values to the truth. The lower counts do not perform well, due to the low precision for trimming when M-values are discrete. The shift of the median with unbalanced DE is also more pronounced when the non-DE M-values are more variable.

x <- simulator(ngenes, 200, 50, 5, 0.05) 
c(1/sqrt(x$factor), sqrt(x$factor)) # Truth.
## [1] 1.0392305 0.9622504

3 Failure even without understampling

Consider these simulations where there is no undersampling at all, just differences in library size. True normalization factors should be 1, but this is not the case, corresponding to loss of precision in trimming.

lapply(1:3, FUN=function(i) {
    x <- matrix(rnbinom(ngenes*2, mu=c(1, 5), size=20), nrow=ngenes, ncol=2, byrow=TRUE)
    calcNormFactors(x)
})
## [[1]]
## [1] 1.229705 0.813203
## 
## [[2]]
## [1] 1.2368873 0.8084811
## 
## [[3]]
## [1] 1.2328770 0.8111109
lapply(1:3, FUN=function(i) {
    x <- matrix(rnbinom(ngenes*2, mu=c(10, 50), size=20), nrow=ngenes, ncol=2, byrow=TRUE)
    calcNormFactors(x)
})
## [[1]]
## [1] 0.9935526 1.0064892
## 
## [[2]]
## [1] 0.9947951 1.0052322
## 
## [[3]]
## [1] 0.9936823 1.0063579

4 Wrapping up

sessionInfo()
## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] edgeR_3.25.2     limma_3.39.3     BiocStyle_2.11.0
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1     Rcpp_1.0.0         bookdown_0.9      
##  [4] lattice_0.20-38    digest_0.6.18      grid_3.6.0        
##  [7] magrittr_1.5       evaluate_0.12      stringi_1.2.4     
## [10] rmarkdown_1.11     tools_3.6.0        stringr_1.3.1     
## [13] xfun_0.4           yaml_2.2.0         compiler_3.6.0    
## [16] BiocManager_1.30.4 htmltools_0.3.6    knitr_1.21