Transformation of expression data usually aims to achieve variance stabilization, i.e., the expected precision of each transformed value should be the same. This simplifies downstream analyses by circumventing the need for complex models to account for the mean-variance relationship. For example, one could apply linear models directly under the assumption of i.i.d. normal errors. However, the motivation for variance stabilization is more nebulous in the context of scRNA-seq data analysis. Most common procedures are not model-based and do not assume i.i.d. errors in the first place.
Here, we will discuss some of the consequences of using variance stabilizing transformations (VSTs) for scRNA-seq. We will base our discussions on the square root transformation, which is the VST for Poisson-distributed counts; the log-transformation for log-normal variates (which provides a near-VST for Gamma variates and large negative binomial counts); the general purpose VST from DESeq2 (Anders and Huber 2010); and the VST from sctransform (Hafemeister and Satija 2019).
The transformation determines the type of variation that contributes most to downstream analyses. In this respect, the choice of transformation also serves as an implicit feature selection (or “weighting”) step. A linear transformation would favour genes with large counts as these have the greatest absolute differences. This is usually undesirable as high-abundance genes are often constitutively expressed, and thus not informative for distinguishing cell types or states. A log-transformation is more conventional as it favours genes with large relative differences, downweighting high-abundance genes that usually exhibit small fold-differences between cells.
Consider a scenario containing Poisson-distributed counts. Here, the VST is the square-root function, which we duly apply. This yields a mean-variance trend that plateaus at large counts, as expected.
library(matrixStats)
mu <- 2^runif(1000, -5, 10)
y <- matrix(rpois(length(mu)*1000, lambda=mu), nrow=length(mu))
v <- rowVars(sqrt(y))
plot(mu, v, xlab="Mean", ylab="Variance", log="x")
Now, let us introduce some differential expression between two groups of cells. The ADD_DE
function simply adds a random set of highly variable genes (HVGs) in a balanced manner, i.e., the library size of each cell and the average abundance of each gene are unchanged. In this case, the increase in variability is due to a two-fold change in abundance between two groups of cells, but any source of additional variation can be used.
ADD_DE <- function(means, ncells, fc=2) {
if (!is.matrix(means)) {
means <- matrix(means, nrow=length(means), ncol=ncells)
}
first <- 1:100
second <- first + 100
N <- ncells/2
g1 <- 1:N
g2 <- N+g1
F <- sqrt(fc)
means[first,g1] <- means[first,g1]*F
means[first,g2] <- means[first,g2]/F
means[second,g1] <- means[second,g1]/F
means[second,g2] <- means[second,g2]*F
means
}
After transformation, the contribution of variance from each HVG is proportional to the gene’s abundance. This is caused by the fact that the VST phrases the variance in terms of differences in square-root means. The fold-difference in the means between groups is constant with abundance, but the difference in the root-means is not. As a result, downstream analyses on the rooted counts will be driven by the high-abundance genes.
means <- ADD_DE(mu, 1000)
y2 <- matrix(rpois(length(means), lambda=means), nrow=nrow(means))
v2 <- rowVars(sqrt(y2))
plot(mu, v2, xlab="Mean", ylab="Variance", log="xy")
By comparison, the log-transformation avoids a major trend in abundance with respect to the HVGs, even though it fails to stabilize the variance due to the Poisson component.
vl <- rowVars(log(y2+1))
plot(mu, vl, xlab="Mean", ylab="Variance", log="xy")
We repeat this simulation with negative binomial-distributed counts. The parameters have been tweaked slightly to demonstrate the effects more clearly (and to avoid some warnings from DESeq2::vst()
).
ngenes <- 10000
mu <- 2^runif(ngenes, 0, 10)
means <- ADD_DE(mu, 200, fc=5)
ncells <- 500
counts <- matrix(rnbinom(ngenes*ncells, mu=means,
size=10), ncol=ncells)
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
Again, the contribution of variance from the transformed values of each HVG is proportional to the gene’s abundance. This reflects the fact that differences in the transformed values no longer reflect the original (constant) log-fold change. Consequently, downstream analyses will be primarily driven by the high-abundance HVGs.
outDS <- DESeq2::vst(counts)
outSC <- sctransform::vst(counts, show_progress=FALSE)
par(mfrow=c(1,2))
plot(mu, matrixStats::rowVars(outDS), log="xy", main="DESeq2")
plot(mu, matrixStats::rowVars(outSC$y), log="xy", main="sctransform")
By comparison, the HVG variance after log-transformation plateaus quickly and avoids a strong mean-dependence at high counts.
vl <- rowVars(log(counts+1))
plot(mu, vl, xlab="Mean", ylab="Variance", log="xy")
These scenarios demonstrate how transformation is intimately linked with implicit feature selection. Different transformations will increase the variance of different sets of genes, which has major implications for the outcome of downstream analyses. The best transformation is the one that favours genes with the most relevant type of variation, though exactly what this might be is debatable. The main arguments for the log-transformation are that:
Variance stabilization ensures that the population variance is independent of the coverage of that population. For example, a population with deeper sequencing/coverage would have lower variance in the normalized expression values, which may be misleading for comparisons of heterogeneity (e.g., to study differentiation priming). Similarly, a population that contains a spread of library sizes will exhibit some structure whereby high-coverage cells are enveloped by low-coverage cells. A VST aims to remove such library size-dependent differences in variance, which is only possible when the mean-abundance trend is eliminated.
Admittedly, differences in variance tend not to the primary focus of the analysis. Of greater concern is the absence of any guarantee that the equality of expectations is preserved after transformation. To illustrate, consider a situation where the normalized expression from cells of different coverage have equal expectations. If this equality is not preserved in transformed data, the transformation will have re-introduced coverage-dependent structure. This clearly occurs with the log-transformation (Lun 2018), and the second-order Taylor approximation suggests that it will occur to some extent for all transformations.
Consider a situation with libaries of different sizes. This should have no structure after library size normalization.
ngenes <- 10000
ncells <- 200
mu <- 2^runif(ngenes, 0, 10)
libsize <- rep(rep(c(0.5, 2), each=50), 2)
means <- outer(mu, libsize)
counts <- matrix(rnbinom(ngenes*ncol(means), mu=means, size=100),
ncol=ncol(means))
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
We can see that both VSTs fail to fully preserve the equality of expected normalized expression values, with some structure correlated to the library size.
outDS <- DESeq2::vst(counts)
outSC <- sctransform::vst(counts, show_progress=FALSE)
pcDS <- prcomp(t(outDS))
pc.sc <- prcomp(t(outSC$y))
par(mfrow=c(1,2))
plot(pcDS$x[,1], pcDS$x[,2], col=(libsize > 1)+1,
main="DESeq2", xlab="PC1", ylab="PC2")
plot(pc.sc$x[,1], pc.sc$x[,2], col=(libsize > 1)+1,
main="sctransform", xlab="PC1", ylab="PC2")
sctransform seems more performant in this example, with weaker structure associated with the library size. However, the advantage disappears when one introduces more complexity. The simulation below separates the cells into two populations, where each population contains a variety of libary sizes. After normalization, we would hope to see only separation between those two populations and not due to library size.
means <- outer(mu, libsize)
first <- 1:100
second <- first + 100
g1 <- 1:100
g2 <- 100+g1
means[first,g1] <- means[first,g1]*2
means[first,g2] <- means[first,g2]/2
means[second,g1] <- means[second,g1]/2
means[second,g2] <- means[second,g2]*2
counts <- matrix(rnbinom(ngenes*ncol(means), mu=means, size=100),
ncol=ncol(means))
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
We see that both VSTs now yield separate structures according to the library size. This serves to demonstrate that the equality of expectations is not guaranteed to be preserved after transformation. In fact, sctransform manages to invert the positions of the size-related structure, which is rather unexpected.
outDS <- DESeq2::vst(counts)
outSC <- sctransform::vst(counts, show_progress=FALSE)
pcDS <- prcomp(t(outDS))
pcSC <- prcomp(t(outSC$y))
par(mfrow=c(1,2))
plot(pcDS$x[,1], pcDS$x[,2], col=(libsize > 1)+1,
main="DESeq2", xlab="PC1", ylab="PC2")
plot(pcSC$x[,1], pcSC$x[,2], col=(libsize > 1)+1,
main="sctransform", xlab="PC1", ylab="PC2")
We see the same effect in a more realistic example involving a continuous change in the library size. Here, the addition of a constant log-fold change introduces a library size-associated trajectory in the affected subpopulation. This is because a log-fold change at low abundance produces a different Pearson residual than the same log-fold change at a high abundance due to the change along the mean-variance relationship.
set.seed(10000)
libsize2 <- 2^rnorm(ncells)
mu2 <- 2^runif(ngenes, -2, 5)
means <- outer(mu2, libsize2)
# Introducing a simulated DE population.
chosen <- sample(ncells, 50)
means[1:100,chosen] <- means[1:100,chosen] * 5
counts <- matrix(rnbinom(ngenes*ncells, mu=means, size=100), ncol=ncells)
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
# DESeq2's vst isn't happy with low counts, so we'll just
# test sctransform here.
outSC <- sctransform::vst(counts, show_progress=FALSE)
pcSC <- prcomp(t(outSC$y))
par(mfrow=c(1,2))
plot(libsize2, outSC$y[3,], col=seq_along(libsize2) %in% chosen + 1,
log="x", xlab="Library size", ylab="Residual", main="DE gene 1")
plot(pcSC$x[,1], pcSC$x[,2], main="sctransform", xlab="PC1", ylab="PC2",
col=topo.colors(100)[cut(log2(libsize2), 100)], pch=16)
Of course, the above example is somewhat pathological. We might actually consider consistent differences in library sizes to be a valid marker of a distinct subpopulation, such that the structure observed above is scientifically informative. In practice, we would expect to see a smooth distribution of library sizes due to technical variation in coverage. This means that any library size-related structure induced by transformation would more likely form spurious trajectories - which, arguably, can be just as problematic.
No transformation can achieve complete variance stabilization in scRNA-seq data (Lun 2018), especially UMI data. At a mean of zero, the variance is necessarily zero. At any non-zero mean, the variance can only increase from zero. Thus, there is an inherent mean-variance relationship that cannot be removed by any (monotonic) transformation. This is particularly relevant for UMI data due to the low counts and high frequency of zeroes.
Of course, the trend near zero can be minimized by using a transformation that narrows the interval to stabilization. If a sufficiently narrow interval is achieved, very low-abundance genes with few non-zero counts will have the same contribution to population heterogeneity as genes with higher expression. However, this is problematic given that Poisson noise dominates at such low counts. Upweighting this technical noise would probably be counter-productive.
The log-transformation is often criticised due to the arbitrariness of choosing a pseudo-count to avoid undefined values at zero. However, this is no more arbitrary than deciding that relative differences are of greatest interest (and not, say, absolute differences in the counts, or differences in the root-means). Indeed, complaining about arbitrariness in an exploratory analysis is rather ironic, especially in light of the many downstream procedures that explicitly or implicitly involve arbitrary parameter choices.
In principle, the pseudo-count can be interpreted as a tuning parameter for the transformaation. This could conceivably be used to adjust the relative contribution of genes of different abundance. Higher pseudo-counts would reduce both noise and biological signal from low-abundance genes, providing a simple approach for analysts to modify their feature selection strategy. In practice, this is not done, probably because the (arbitrary) choice of pseudo-count makes little difference!
The use of the log-total count as a covariate in the GLM fit of sctransform is somewhat concerning. However, it is protected by using the regularized (i.e., abundance-trended) estimates of all coefficients. This assumes that most genes are not correlated at a given abundance, equivalent to the assumption of SCnorm. It is likely that this is a reasonable assumption in the majority of cases. Nonetheless, some biological variation will be regressed out in the case where this assumption fails.
ngenes <- 10000
means <- 2^sort(runif(ngenes, 0, 10))
ncells <- 500
counts <- matrix(rnbinom(ngenes*ncells, mu=means, size=10), ncol=ncells)
# Some structure (imperfectly) correlated with library size,
# concentrated around the mean of mean(1:500).
N <- 1000
for (i in 1:N) {
counts[i,] <- rnbinom(ncells, mu=1:500, size=10)
}
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
outSC <- sctransform::vst(counts, show_progress=FALSE)
normed <- t(t(counts)/scater::librarySizeFactors(counts))
plot(
matrixStats::rowVars(log(normed+1)),
matrixStats::rowVars(outSC$y),
col=1+as.integer(seq_len(ngenes) <= N),
xlab="Log expression variance",
ylab="scTransform variance"
)
As an entertaining aside, sctransform is highly sensitive to even minor fluctuations in library size. This manifests most clearly when we generate data with large counts and very little variation in library size, as shown below with a slight adaptation of our previous trajectory simulations. It is not entirely clear what drives this effect; if I were to guess, I would say that the lack of variation reduces the precision of the model coefficients such that the residuals are computed from incorrect values that under- or overcompensate for the library size effect.
set.seed(10000)
libsize2 <- 2^rnorm(ncells, sd=0.01) # minor variation
mu2 <- 2^runif(ngenes, -2, 5) * 100 # deep coverage
means <- outer(mu2, libsize2)
# Introducing a simulated DE population.
chosen <- sample(ncells, 50)
means[1:100,chosen] <- means[1:100,chosen] * 5
counts <- matrix(rnbinom(ngenes*ncells, mu=means, size=100), ncol=ncells)
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
outSC <- sctransform::vst(counts, show_progress=FALSE)
pcSC <- prcomp(t(outSC$y))
plot(pcSC$x[,1], pcSC$x[,2], main="sctransform", xlab="PC1", ylab="PC2",
col=topo.colors(100)[cut(log2(libsize2), 100)], pch=16)
The transformed values from sctransform exhibit no relation to the original scale of the (log-)counts. This is not a problem for exploratory analyses but makes it difficult to interpret differential expression analyses.
ngenes <- 10000
means <- 2^sort(runif(ngenes, 0, 10))
ncells <- 500
counts <- matrix(rnbinom(ngenes*ncells, mu=means, size=10), ncol=ncells)
rownames(counts) <- sprintf("GENE_%i", seq_len(ngenes))
colnames(counts) <- sprintf("CELL_%i", seq_len(ncells))
outSC <- sctransform::vst(counts, show_progress=FALSE)
smoothScatter(matrixStats::rowVars(log2(counts+1)),
matrixStats::rowVars(outSC$y))
DESeq2’s VST converges to the log-fold change at high abundances but it is not entirely faultless. A strong mean-dispersion trend is sufficient to induce major discrepancies, even when the counts should have been sufficiently large to obtain an accurate log-fold change estimate.
ngenes <- 10000
ncells <- 200
mu <- 2^runif(ngenes, 5, 10)
dispersion <- 10/mu + 0.1
means <- ADD_DE(mu, ncells)
counts <- matrix(rnbinom(ngenes*ncells, mu=means,
size=1/dispersion), ncol=ncells)
first <- 1:100
g1 <- 1:100
g2 <- 1:100+100
outDS <- DESeq2::vst(counts)
plot(mu[first], (rowMeans(outDS[first,g1]) - rowMeans(outDS[first,g2])),
main="DESeq VST", ylab="Diff", xlab="Mean", log="x")
abline(h=1, col="red")
outL <- log2(counts + 1)
plot(mu[first], (rowMeans(outL[first,g1]) - rowMeans(outL[first,g2])),
main="Log-transformed", ylab="Diff", xlab="Mean", log="x")
abline(h=1, col="red")
A major practical issue with both VSTs is that they are not sparsity preserving. This affects the computational performance of all downstream analyses that are performed on the transformed data. One could avoid this by performing feature selection and the VST at the same time, such that the dense transformed values are only returned for the features of interest. However, it is questionable whether or not this provides benefits worth the increased complexity.
In most cases, the log-transformation is probably satisfactory. It is simple to compute, preserves sparsity and allows interpretation of differences as log-fold changes (for large counts, at least). It upweights genes with large relative changes in expression that are most likely to be biologically relevant. And if we put aside theoretical arguments, the widespread use of the log-transformation “in the wild” reflects its adequacy and reliability for most analysts.
sessionInfo()
## R Under development (unstable) (2019-10-31 r77342)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/luna/Software/R/trunk/lib/libRblas.so
## LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] matrixStats_0.55.0 BiocStyle_2.15.5
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] RColorBrewer_1.1-2 GenomeInfoDb_1.23.7
## [5] sctransform_0.2.1 tools_4.0.0
## [7] backports_1.1.5 R6_2.4.1
## [9] irlba_2.3.3 KernSmooth_2.23-16
## [11] vipor_0.4.5 DBI_1.1.0
## [13] lazyeval_0.2.2 BiocGenerics_0.33.0
## [15] colorspace_1.4-1 tidyselect_0.2.5
## [17] gridExtra_2.3 DESeq2_1.27.19
## [19] bit_1.1-15.1 compiler_4.0.0
## [21] Biobase_2.47.2 BiocNeighbors_1.5.1
## [23] DelayedArray_0.13.2 bookdown_0.17
## [25] scales_1.1.0 genefilter_1.69.0
## [27] stringr_1.4.0 digest_0.6.23
## [29] rmarkdown_2.0 XVector_0.27.0
## [31] scater_1.15.16 pkgconfig_2.0.3
## [33] htmltools_0.4.0 rlang_0.4.2
## [35] RSQLite_2.2.0 DelayedMatrixStats_1.9.0
## [37] BiocParallel_1.21.2 dplyr_0.8.3
## [39] RCurl_1.95-4.13 magrittr_1.5
## [41] BiocSingular_1.3.1 GenomeInfoDbData_1.2.2
## [43] Matrix_1.2-18 Rcpp_1.0.3
## [45] ggbeeswarm_0.6.0 munsell_0.5.0
## [47] S4Vectors_0.25.10 viridis_0.5.1
## [49] lifecycle_0.1.0 stringi_1.4.5
## [51] yaml_2.2.0 MASS_7.3-51.5
## [53] SummarizedExperiment_1.17.1 zlibbioc_1.33.0
## [55] plyr_1.8.5 grid_4.0.0
## [57] blob_1.2.0 parallel_4.0.0
## [59] listenv_0.8.0 crayon_1.3.4
## [61] lattice_0.20-38 splines_4.0.0
## [63] annotate_1.65.0 locfit_1.5-9.1
## [65] magick_2.2 zeallot_0.1.0
## [67] knitr_1.27 pillar_1.4.3
## [69] GenomicRanges_1.39.1 geneplotter_1.65.0
## [71] future.apply_1.4.0 reshape2_1.4.3
## [73] codetools_0.2-16 stats4_4.0.0
## [75] XML_3.99-0.2 glue_1.3.1
## [77] evaluate_0.14 BiocManager_1.30.10
## [79] vctrs_0.2.1 gtable_0.3.0
## [81] purrr_0.3.3 future_1.16.0
## [83] assertthat_0.2.1 ggplot2_3.2.1
## [85] xfun_0.12 rsvd_1.0.2
## [87] xtable_1.8-4 viridisLite_0.3.0
## [89] survival_3.1-8 SingleCellExperiment_1.9.1
## [91] tibble_2.1.3 AnnotationDbi_1.49.0
## [93] beeswarm_0.2.3 memoise_1.1.0
## [95] IRanges_2.21.3 globals_0.12.5
Anders, S., and W. Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biol. 11 (10): R106.
Hafemeister, C., and R. Satija. 2019. “Normalization and Variance Stabilization of Single-Cell Rna-Seq Data Using Regularized Negative Binomial Regression.” bioRxiv. doi:10.1101/576827.
Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv. doi:10.1101/404962.