Each transcript molecule should be processed independently from each other transcript molecule in the same reaction volume. The number of reads generated from each transcript molecule is a random variable, and the count for each gene is the sum of these RVs. Doubling the number of molecules should result in a doubling of the expected count.
The concept of a “detection limit” is also misleading: it’s not as if, below a certain number of transcript molecules, no reads will be detected. This would require some mechanism of communication between transcript molecules in order that the total number of molecules can affect the processing of each molecule. It makes more sense to consider the probability of observing zero reads with decreasing abundance, as cDNA molecules are randomly sampled for sequencing. Having a lot of zero counts beyond a certain expected count is not evidence for a detection limit - this is a simple consequence of sampling.
Linearity should still hold if sequencing resources are saturated. This would provide some sort of message passing mechanism as transcripts compete for sequencing resources. However, composition biases should affect all transcripts equally, and the final count should still be linear with respect to the number of input molecules.
The count distributions for lower-abundance spike-in transcripts often have higher dispersions. This shouldn’t be possible if transcript molecules are captured independently of one another with the same fixed probability. In such a system, everything should be effectively Poisson distributed (assuming rare capture).
Instead, we can consider the number of reads generated from each transcript molecule as a NB distribution. For UMIs, we can similarly consider a Beta-binomial distribution where the capture probability fluctuates between molecules. The sum of counts from many transcripts will subsequently result in a lower dispersion for high-abundance genes, due to averaging of fluctuations in the process between molecules.
Ideally, linearity is demonstrated by diluting spike-ins and adding it to the same pool of cellular RNA. You would then look for linearity in the counts for each spike-in compared to the dilution. This should be done per transcript to avoid transcript-specific effects that could cause scatter around the trend. The availability of a pool of endogenous RNA also simplifies normalization. Something similar is shown in Figure 3 of the CEL-seq paper (http://dx.doi.org/10.1016/j.celrep.2012.08.003).
The simpler, more common approach is to plot the theoretical abundances of the spike-ins against the observed count. Here are several instances where this can be found:
As previously mentioned, this requires a message-passing mechanism between transcripts. The most obvious way that this can be managed is through hybridization. If reverse transcription is only partially completed and the cDNA dissociates from the RNA, it is easier to re-hybridize if there are many other transcripts of the same type. This means that the probability of successful reverse transcription becomes dependent on the abundance of the transcript. Subsequently, it is possible to have non-linear behaviour, most obviously at low abundances where the chance of re-hybridization may be low.
Comparison of single-cell averages to bulk equivalents in @hicks2018missing show discrepancies at low abundances. This corresponds to non-linear behaviour where the single-cell averages systematically drop below their bulk counterparts for a number of genes.
That said, it is unsurprising that you get non-linearities when you increase the amount of input RNA by several orders of magnitude. You wouldn’t perform (as much) pre-amplification of the cDNA in bulk samples (18 cycles for Smart-seq2, <8 for Smart-seq on 1000 cells). This would result in PCR biases where geenes that are more easily amplified get higher coverage. One could argue that such effects would be minor between cells with more similar amounts of RNA and processed with similar numbers of amplification cycles.
Similar arguments motivate the development of SCnorm (https://doi.org/10.1038/nmeth.4263), based on a non-linear fitted trend to the counts against the library size. However, their trend fit does not include zero counts. This will inevitably result in non-linearity at low-abundance genes, as zeroes occur naturally when you scale down the mean.
One way to test for coverage-associated non-linearity is to examine size factors computed with low- and high-abundance genes. In particular, we examine the variances of the two sets of log-factors, and the variance in the difference of the log-factors. We then consider three scenarios:
For example, the data set below corresponds to scenario 1, though this depends on the data set and the exact threshold in use.
library(scran)
library(scater)
library(scRNAseq)
data(allen)
sce <- as(allen, "SingleCellExperiment")
counts(sce) <- assay(sce, "tophat_counts")
ab <- calcAverage(sce)
bottom <- ab < median(ab)
summary(ab[bottom])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.01699 1.56256 1.49028 13.85898
summary(ab[!bottom])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.86 60.07 161.71 494.29 423.38 84046.39
sf.low <- computeSumFactors(sce, subset.row=bottom, min.mean=1, sf.out=TRUE)
sf.hi <- computeSumFactors(sce, subset.row=!bottom, min.mean=1, sf.out=TRUE)
var(log10(sf.hi))
## [1] 0.04281206
var(log10(sf.low))
## [1] 0.07250319
var(log10(sf.hi/sf.low))
## [1] 0.02792404
Non-linearity is problematic as it breaks out global scaling normalization methods. Differences in the expectation of low-abundance genes are no longer properly eliminated by scaling methods. This could result in systematic differences between cells in the expression space, which are technically driven and not of interest. (Or at the very least, they are driven by differences in the initial RNA content, which is not totally uninteresting but may be undesirable nonetheless.) There are two obvious workarounds to this - non-linear normalization, and blocking on the RNA content of each cell.
It’s very hard to fit a robust trend with respect to abundance. Most robust methods (e.g., loess) rely on normality, and the log-transformation becomes highly dependent on the pseudo-count at low counts. We could use discrete GLMs but that depends on proper specification of dispersions and is also less robust to outliers.
Another problem is that non-linear normalization assumes that most genes at each point of the covariate range are not DE. This is reasonable in bulk data, but might not hold at the single-cell level. Cell-to-cell heterogeneity and intra-cell correlations means that it is entirely possible that we get large-scale shifts in highly expressed genes. For example, a subpopulation may upregulate a set of genes, resulting in a skew in a particular abundance category. This would be eliminated upon normalization, which would not be ideal.
Consider the situation where we fail to normalize non-DE genes correctly due to non-linearity. This requires having differences in transcript numbers of non-DE genes during RT, etc. which is most obviously driven by changes in total RNA content between cellls. In effect, total RNA content drives normalization errors, which introduces systematic variation in homogeneous populations. (The errors involved are probably too small to distort genuine biological effects.) This is despite the fact that it should have been normalized out already.
To get around this, we could imagine blocking on measures of total RNA content, e.g., the cellular detection rate as done in MAST. This would allow us to regress out the normalization error and avoid problems in downstream analyses. The obvious problem is that this may also regress out meaningful biology; see cdr.Rmd
for details.
sessionInfo()
## R version 3.5.0 RC (2018-04-16 r74624)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scRNAseq_1.7.0 scater_1.9.24
## [3] ggplot2_3.1.0 scran_1.9.39
## [5] SingleCellExperiment_1.3.12 SummarizedExperiment_1.11.6
## [7] DelayedArray_0.7.49 matrixStats_0.54.0
## [9] Biobase_2.41.2 GenomicRanges_1.33.14
## [11] GenomeInfoDb_1.17.4 IRanges_2.15.19
## [13] S4Vectors_0.19.24 BiocGenerics_0.27.1
## [15] BiocParallel_1.15.15 BiocStyle_2.9.6
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 dynamicTreeCut_1.63-1
## [3] edgeR_3.23.7 viridisLite_0.3.0
## [5] DelayedMatrixStats_1.3.11 assertthat_0.2.0
## [7] statmod_1.4.30 BiocManager_1.30.3
## [9] GenomeInfoDbData_1.2.0 vipor_0.4.5
## [11] yaml_2.2.0 pillar_1.3.0
## [13] backports_1.1.2 lattice_0.20-35
## [15] glue_1.3.0 limma_3.37.10
## [17] digest_0.6.18 XVector_0.21.4
## [19] colorspace_1.3-2 htmltools_0.3.6
## [21] Matrix_1.2-14 plyr_1.8.4
## [23] pkgconfig_2.0.2 bookdown_0.7
## [25] zlibbioc_1.27.0 purrr_0.2.5
## [27] scales_1.0.0 HDF5Array_1.9.19
## [29] tibble_1.4.2 withr_2.1.2
## [31] lazyeval_0.2.1 magrittr_1.5
## [33] crayon_1.3.4 evaluate_0.12
## [35] beeswarm_0.2.3 tools_3.5.0
## [37] stringr_1.3.1 Rhdf5lib_1.3.3
## [39] munsell_0.5.0 locfit_1.5-9.1
## [41] bindrcpp_0.2.2 compiler_3.5.0
## [43] rlang_0.3.0.1 rhdf5_2.25.11
## [45] grid_3.5.0 RCurl_1.95-4.11
## [47] BiocNeighbors_0.99.22 igraph_1.2.2
## [49] bitops_1.0-6 rmarkdown_1.10
## [51] gtable_0.2.0 reshape2_1.4.3
## [53] R6_2.3.0 gridExtra_2.3
## [55] knitr_1.20 dplyr_0.7.7
## [57] bindr_0.1.1 rprojroot_1.3-2
## [59] stringi_1.2.4 ggbeeswarm_0.6.0
## [61] Rcpp_0.12.19 tidyselect_0.2.5
## [63] xfun_0.4