1 Why should measurements be linear?

1.1 Theoretical arguments

1.1.1 Mean effects

Each transcript molecule should be processed independently from each other transcript molecule in the same reaction volume. The number of reads generated from each transcript molecule is a random variable, and the count for each gene is the sum of these RVs. Doubling the number of molecules should result in a doubling of the expected count.

The concept of a “detection limit” is also misleading: it’s not as if, below a certain number of transcript molecules, no reads will be detected. This would require some mechanism of communication between transcript molecules in order that the total number of molecules can affect the processing of each molecule. It makes more sense to consider the probability of observing zero reads with decreasing abundance, as cDNA molecules are randomly sampled for sequencing. Having a lot of zero counts beyond a certain expected count is not evidence for a detection limit - this is a simple consequence of sampling.

Linearity should still hold if sequencing resources are saturated. This would provide some sort of message passing mechanism as transcripts compete for sequencing resources. However, composition biases should affect all transcripts equally, and the final count should still be linear with respect to the number of input molecules.

1.1.2 An aside into variance effects

The count distributions for lower-abundance spike-in transcripts often have higher dispersions. This shouldn’t be possible if transcript molecules are captured independently of one another with the same fixed probability. In such a system, everything should be effectively Poisson distributed (assuming rare capture).

Instead, we can consider the number of reads generated from each transcript molecule as a NB distribution. For UMIs, we can similarly consider a Beta-binomial distribution where the capture probability fluctuates between molecules. The sum of counts from many transcripts will subsequently result in a lower dispersion for high-abundance genes, due to averaging of fluctuations in the process between molecules.

1.2 Practical arguments

Ideally, linearity is demonstrated by diluting spike-ins and adding it to the same pool of cellular RNA. You would then look for linearity in the counts for each spike-in compared to the dilution. This should be done per transcript to avoid transcript-specific effects that could cause scatter around the trend. The availability of a pool of endogenous RNA also simplifies normalization. Something similar is shown in Figure 3 of the CEL-seq paper (http://dx.doi.org/10.1016/j.celrep.2012.08.003).

The simpler, more common approach is to plot the theoretical abundances of the spike-ins against the observed count. Here are several instances where this can be found:

2 Why should measurements not be linear?

2.1 Theoretical arguments

As previously mentioned, this requires a message-passing mechanism between transcripts. The most obvious way that this can be managed is through hybridization. If reverse transcription is only partially completed and the cDNA dissociates from the RNA, it is easier to re-hybridize if there are many other transcripts of the same type. This means that the probability of successful reverse transcription becomes dependent on the abundance of the transcript. Subsequently, it is possible to have non-linear behaviour, most obviously at low abundances where the chance of re-hybridization may be low.

2.2 Practical arguments

Comparison of single-cell averages to bulk equivalents in @hicks2018missing show discrepancies at low abundances. This corresponds to non-linear behaviour where the single-cell averages systematically drop below their bulk counterparts for a number of genes.

That said, it is unsurprising that you get non-linearities when you increase the amount of input RNA by several orders of magnitude. You wouldn’t perform (as much) pre-amplification of the cDNA in bulk samples (18 cycles for Smart-seq2, <8 for Smart-seq on 1000 cells). This would result in PCR biases where geenes that are more easily amplified get higher coverage. One could argue that such effects would be minor between cells with more similar amounts of RNA and processed with similar numbers of amplification cycles.

Similar arguments motivate the development of SCnorm (https://doi.org/10.1038/nmeth.4263), based on a non-linear fitted trend to the counts against the library size. However, their trend fit does not include zero counts. This will inevitably result in non-linearity at low-abundance genes, as zeroes occur naturally when you scale down the mean.

3 Testing for non-linearities in any data set

One way to test for coverage-associated non-linearity is to examine size factors computed with low- and high-abundance genes. In particular, we examine the variances of the two sets of log-factors, and the variance in the difference of the log-factors. We then consider three scenarios:

  1. If the measurements are linear, the variances should be similar between the two sets of log-factors. Any differences in variances of the log-factors should be due to random estimation noise. This would mean that the differences in variance would be equal to the variance of the difference.
  2. If decreases in transcript abundance resulted in a superlinear decrease in the counts, the variance of the log-factors from low-abundance genes should be larger. This is because the expression of low-abundance genes would be disproportionately lower in cells that have low size factors. However, the difference in the variance should be larger than the variance of the differences. This is because, in the latter, part of the variance in the low-abundance factors is explained by the fact that the high-abundance factors are low.
  3. If decreases in abundance resulted in a sublinear decrease in the counts, the variance of the log-factors from low-abundance genes would be smaller. This is because the expression of low-abundance genes would be closer to the average in cells that have low size factors. Again, the difference in the variance would be larger than the variance of the differences, as one set of factors explains the variance in the other.

For example, the data set below corresponds to scenario 1, though this depends on the data set and the exact threshold in use.

library(scran)
library(scater)
library(scRNAseq)

data(allen)
sce  <- as(allen, "SingleCellExperiment")
counts(sce) <- assay(sce, "tophat_counts")

ab <- calcAverage(sce)
bottom <- ab < median(ab)
summary(ab[bottom])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.01699  1.56256  1.49028 13.85898
summary(ab[!bottom])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    13.86    60.07   161.71   494.29   423.38 84046.39
sf.low <- computeSumFactors(sce, subset.row=bottom, min.mean=1, sf.out=TRUE)
sf.hi <- computeSumFactors(sce, subset.row=!bottom, min.mean=1, sf.out=TRUE)
var(log10(sf.hi))
## [1] 0.04281206
var(log10(sf.low))
## [1] 0.07250319
var(log10(sf.hi/sf.low)) 
## [1] 0.02792404

4 Solutions to non-linearity

4.1 Overview

Non-linearity is problematic as it breaks out global scaling normalization methods. Differences in the expectation of low-abundance genes are no longer properly eliminated by scaling methods. This could result in systematic differences between cells in the expression space, which are technically driven and not of interest. (Or at the very least, they are driven by differences in the initial RNA content, which is not totally uninteresting but may be undesirable nonetheless.) There are two obvious workarounds to this - non-linear normalization, and blocking on the RNA content of each cell.

4.2 Motivation for not using non-linear normalization

It’s very hard to fit a robust trend with respect to abundance. Most robust methods (e.g., loess) rely on normality, and the log-transformation becomes highly dependent on the pseudo-count at low counts. We could use discrete GLMs but that depends on proper specification of dispersions and is also less robust to outliers.

Another problem is that non-linear normalization assumes that most genes at each point of the covariate range are not DE. This is reasonable in bulk data, but might not hold at the single-cell level. Cell-to-cell heterogeneity and intra-cell correlations means that it is entirely possible that we get large-scale shifts in highly expressed genes. For example, a subpopulation may upregulate a set of genes, resulting in a skew in a particular abundance category. This would be eliminated upon normalization, which would not be ideal.

4.3 Blocking on the RNA content

Consider the situation where we fail to normalize non-DE genes correctly due to non-linearity. This requires having differences in transcript numbers of non-DE genes during RT, etc. which is most obviously driven by changes in total RNA content between cellls. In effect, total RNA content drives normalization errors, which introduces systematic variation in homogeneous populations. (The errors involved are probably too small to distort genuine biological effects.) This is despite the fact that it should have been normalized out already.

To get around this, we could imagine blocking on measures of total RNA content, e.g., the cellular detection rate as done in MAST. This would allow us to regress out the normalization error and avoid problems in downstream analyses. The obvious problem is that this may also regress out meaningful biology; see cdr.Rmd for details.

5 Session information

sessionInfo()
## R version 3.5.0 RC (2018-04-16 r74624)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scRNAseq_1.7.0              scater_1.9.24              
##  [3] ggplot2_3.1.0               scran_1.9.39               
##  [5] SingleCellExperiment_1.3.12 SummarizedExperiment_1.11.6
##  [7] DelayedArray_0.7.49         matrixStats_0.54.0         
##  [9] Biobase_2.41.2              GenomicRanges_1.33.14      
## [11] GenomeInfoDb_1.17.4         IRanges_2.15.19            
## [13] S4Vectors_0.19.24           BiocGenerics_0.27.1        
## [15] BiocParallel_1.15.15        BiocStyle_2.9.6            
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1             dynamicTreeCut_1.63-1    
##  [3] edgeR_3.23.7              viridisLite_0.3.0        
##  [5] DelayedMatrixStats_1.3.11 assertthat_0.2.0         
##  [7] statmod_1.4.30            BiocManager_1.30.3       
##  [9] GenomeInfoDbData_1.2.0    vipor_0.4.5              
## [11] yaml_2.2.0                pillar_1.3.0             
## [13] backports_1.1.2           lattice_0.20-35          
## [15] glue_1.3.0                limma_3.37.10            
## [17] digest_0.6.18             XVector_0.21.4           
## [19] colorspace_1.3-2          htmltools_0.3.6          
## [21] Matrix_1.2-14             plyr_1.8.4               
## [23] pkgconfig_2.0.2           bookdown_0.7             
## [25] zlibbioc_1.27.0           purrr_0.2.5              
## [27] scales_1.0.0              HDF5Array_1.9.19         
## [29] tibble_1.4.2              withr_2.1.2              
## [31] lazyeval_0.2.1            magrittr_1.5             
## [33] crayon_1.3.4              evaluate_0.12            
## [35] beeswarm_0.2.3            tools_3.5.0              
## [37] stringr_1.3.1             Rhdf5lib_1.3.3           
## [39] munsell_0.5.0             locfit_1.5-9.1           
## [41] bindrcpp_0.2.2            compiler_3.5.0           
## [43] rlang_0.3.0.1             rhdf5_2.25.11            
## [45] grid_3.5.0                RCurl_1.95-4.11          
## [47] BiocNeighbors_0.99.22     igraph_1.2.2             
## [49] bitops_1.0-6              rmarkdown_1.10           
## [51] gtable_0.2.0              reshape2_1.4.3           
## [53] R6_2.3.0                  gridExtra_2.3            
## [55] knitr_1.20                dplyr_0.7.7              
## [57] bindr_0.1.1               rprojroot_1.3-2          
## [59] stringi_1.2.4             ggbeeswarm_0.6.0         
## [61] Rcpp_0.12.19              tidyselect_0.2.5         
## [63] xfun_0.4