1 Rationale

Standardization involves scaling all features so that they have the same (unit) variance across all samples. This is commonly recommended for features that are not directly comparable (e.g., annual income, lifespan, education level) prior to computing an objective function. It ensures that the objective function is not solely determined by the feature with the largest variance, as this has no meaning when the variances are not comparable. In scRNA-seq contexts, standardization ensures that all genes contribute the same amount of variance to downstream steps like PCA and clustering. However, this has a number of drawbacks that are not often considered by analysts.

2 Inappropriate gene weighting

Standardizing will downweight the contribution of interesting genes with large total variances due to biological heterogeneity. This will reduce the resolution of biological differences between cell populations. Of course, some genes may just be biologically noisy, but more often than not, large biological components will represent some interesting structured variation. Conversely, standardization will upweight genes with low total variance and small (non-zero) biological components. This will amplify biological variability that was originally minor, which may be misleading.

set.seed(10)
a <- matrix(rnorm(100000), ncol=100)
a[1:10,1:50] <- a[1:10,1:50] + 10

out.raw <- prcomp(t(a))
out.sca <- prcomp(t(a), scale=TRUE)

# Populations are less clearly separated after scaling.
col <- rep(c("blue", "red"), each=50)
plot(out.raw$x[,1], out.raw$x[,2], col=col, main="Raw")

plot(out.sca$x[,1], out.sca$x[,2], col=col, main="Scaled")

Distances between subpopulations also become unnecessarily inter-dependent in standardized data. To illustrate, imagine a dataset containing two subpopulations. One of them highly expresses gene \(X\), while the other only has moderate expression - thus, we are able to distinguish these two populations on the expression of gene \(X\). Now, imagine adding a third subpopulation that is silent for gene \(X\). If standardization is performed, this will reduce the power of \(X\) to discriminate between the first two subpopulations. This is counterintuitive as nothing has changed between the first two subpopulations.

3 Distortion of log-fold changes

Any scaling distorts the true log-fold changes for genes between subpopulations. This affects interpretation of relative distances between three or more groups of cells. In particular, it becomes difficult to determine whether two groups are more related to each other than to a third group.

One could argue that log-fold changes of different genes are not comparable anyway. A 2-fold change in a cell type-defining marker gene may be more important than a 10-fold change in another gene involved in cell cycle or something. Even so, it is hard to see how standardization does any better in this regard than using the unbiased estimates of the log-fold changes, as neither incorporate a priori knowledge about the importance of the genes.

Under certain conditions, standardization means that the magnitude of the separation between populations is driven by the number of DEGs, not their log-fold changes. Again, this is not clearly better (or worse) than computing distances based on the magnitude of the log-fold changes.

4 Alternative scaling approaches

The use for standardization would require us to assume that biological differences in variance between genes are not interesting. A slightly more appropriate approach is to remove differences in the technical component of variation. This aim would be to avoid domination of the results by genes with large technical components due to the nature of the mean-variance trend. The problem with this strategy is that genes with very low technical components (e.g., high-abundance genes) would be strongly scaled up. This would inflate their biological components, allowing them to dominate and effectively inverting the original problem.

Another option is to scale each gene such that its variance becomes equal to the biological component. This accounts for the mean-variance trend and upweights genes with large biological components rather than penalizing them. Conversely, genes with near-zero biological components are effectively ignored during the PCA. However, this is still an ad hoc strategy. For total variance \(V\) decomposed into \(B + T\), the rescaled biological component becomes \(B^2/V\) while the rescaled technical component is \(TB/V\); neither of these values has much meaning to me, and treating them as \(B\) and \(T\) would clearly be wrong unless \(T = 0\).

5 Concluding remarks

Standardization effectively weights each gene in inverse proportion to its variance. In this respect, standardization can be viewed as another feature selection step. However, I would argue that the effects of standardization are largely undesirable in scRNA-seq contexts where we expect to see large differences in variance between genes due to biology. It is true that large differences in the technical components are also present, but it is not clear how to remove them without distorting the biological differences.

I believe that not performing standardization is the best approach in routine applications. This allows genes with large biological components to drive systematic separation in downstream analyses. While genes with large technical components will also drive separation, this should be stochastic and removable by denoising.

6 Session information

sessionInfo()
## R Under development (unstable) (2019-03-02 r76189)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.6 LTS
## 
## Matrix products: default
## BLAS: /home/aaron/Software/R/trunk/lib/libRblas.so
## LAPACK: /home/aaron/Software/R/trunk/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.11.0
## 
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.4 compiler_3.6.0     magrittr_1.5      
##  [4] bookdown_0.9       tools_3.6.0        htmltools_0.3.6   
##  [7] yaml_2.2.0         Rcpp_1.0.0         stringi_1.4.3     
## [10] rmarkdown_1.12     knitr_1.22         stringr_1.4.0     
## [13] xfun_0.5           digest_0.6.18      evaluate_0.13