This script demonstrates that TMM normalization fails with low counts. Namely, does it accurately estimate the normalization factors? We set up a simulation where some genes are upregulated by mult
in the second sample.
simulator <- function(ngenes, genes.spiked, mu.back, mult, disp) {
mu.spike <- mu.back*mult
x1 <- rnbinom(ngenes, mu=mu.back, size=1/disp)
normed <- mu.back*ngenes/(mu.back*(ngenes-genes.spiked)+mu.spike*genes.spiked)
x2 <- rnbinom(ngenes, mu=mu.back*normed, size=1/disp)
spiked <- sample(ngenes, genes.spiked)
x2[spiked] <- rnbinom(genes.spiked, mu=mu.spike*normed, size=1/disp)
return(list(counts=cbind(x1, x2), factor=normed, spiked=spiked))
}
Simulating over three times for various count sizes:
library(edgeR)
set.seed(1000)
ngenes <- 10000
lapply(1:3, FUN=function(i) {
x <- simulator(ngenes, 200, 2, 5, 0.05) # low count
calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0013521 0.9986498
##
## [[2]]
## [1] 1.0069616 0.9930865
##
## [[3]]
## [1] 1.0052563 0.9947712
lapply(1:3, FUN=function(i) {
x <- simulator(ngenes, 200, 10, 5, 0.05) # middle count
calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0291659 0.9716607
##
## [[2]]
## [1] 1.029547 0.971301
##
## [[3]]
## [1] 1.0341057 0.9670192
lapply(1:3, FUN=function(i) {
x <- simulator(ngenes, 200, 50, 5, 0.05) # high count
calcNormFactors(x$counts)
})
## [[1]]
## [1] 1.0356454 0.9655815
##
## [[2]]
## [1] 1.0345765 0.9665791
##
## [[3]]
## [1] 1.0356693 0.9655591
We then compare these values to the truth. The lower counts do not perform well, due to the low precision for trimming when M-values are discrete. The shift of the median with unbalanced DE is also more pronounced when the non-DE M-values are more variable.
x <- simulator(ngenes, 200, 50, 5, 0.05)
c(1/sqrt(x$factor), sqrt(x$factor)) # Truth.
## [1] 1.0392305 0.9622504
Consider these simulations where there is no undersampling at all, just differences in library size. True normalization factors should be 1, but this is not the case, corresponding to loss of precision in trimming.
lapply(1:3, FUN=function(i) {
x <- matrix(rnbinom(ngenes*2, mu=c(1, 5), size=20), nrow=ngenes, ncol=2, byrow=TRUE)
calcNormFactors(x)
})
## [[1]]
## [1] 1.229705 0.813203
##
## [[2]]
## [1] 1.2368873 0.8084811
##
## [[3]]
## [1] 1.2328770 0.8111109
lapply(1:3, FUN=function(i) {
x <- matrix(rnbinom(ngenes*2, mu=c(10, 50), size=20), nrow=ngenes, ncol=2, byrow=TRUE)
calcNormFactors(x)
})
## [[1]]
## [1] 0.9935526 1.0064892
##
## [[2]]
## [1] 0.9947951 1.0052322
##
## [[3]]
## [1] 0.9936823 1.0063579
sessionInfo()
## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.25.2 limma_3.39.3 BiocStyle_2.11.0
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.1 Rcpp_1.0.0 bookdown_0.9
## [4] lattice_0.20-38 digest_0.6.18 grid_3.6.0
## [7] magrittr_1.5 evaluate_0.12 stringi_1.2.4
## [10] rmarkdown_1.11 tools_3.6.0 stringr_1.3.1
## [13] xfun_0.4 yaml_2.2.0 compiler_3.6.0
## [16] BiocManager_1.30.4 htmltools_0.3.6 knitr_1.21