1 Motivation

The idea with detecting correlated gene pairs is to provide an additional level of verification for follow-up studies. It does not depend on explicit subpopulation/trajectory identification and thus avoids problems with their assumptions. Direct calculation of correlations is closer to the data and allows sanity checking of the inferred structure. A related motivation is that validation involves genes rather than subpopulations (as the current cells are destroyed). Thus, by obtaining reliable statistics for correlated gene pairs, we can mirror the results that might be expected from two-marker FACS or FISH.

2 Why use a modified rho?

2.1 Overview

The default tie-handling behaviour in correlatePairs is to use ties.method="expected". This computes the expected rho by averaging across all possible permutations of randomly broken tied values. Specifically, given set of tied values, one can imagine randomly assigning unique ranks to each value. By comparison, the standard rho calculation (with ties.method="average") would assign the average rank to all tied values.

Calculation of the expected rho is actually rather simple. Like ties.method="average", it involves just replacing the tied observations with the average rank when computing the covariance between genes. The expected rank for a tied observation is simply the average, which can be used in the product as tie breaking is done independently between genes. The difference from the standard rho calculation is that the sum of squared differences (in the denominator of the correlation expression) is still computed with unique ranks, representing the denominator for any given tie breaking permutation.

2.2 Computational speed

The use of tie-breaking simplifies things by allowing the same tie-free null distribution to be used for all genes. This means that we do not have to generate new permutations for every gene pair, which would be very time-consuming. As shown below, type I error control is maintained under the null with randomly broken ties.

library(scran)
set.seed(1023423)
ncells <- 100
null.dist <- correlateNull(ncells)
all.p <- list()
for (it in 1:10000) {
    x1 <- rpois(ncells, lambda=10)
    x2 <- rpois(ncells, lambda=20)
    rho2 <- cor(rank(x1, ties.method="random"), rank(x2, ties.method="random"), method="spearman")
    all.p[[it]] <- sum(null.dist >= rho2)/length(null.dist)
}
sum(unlist(all.p) <= 0.01)/10000

## [1] 0.0089

sum(unlist(all.p) <= 0.05)/10000

## [1] 0.0518

In practice, the expected rho does not follow the distribution returned by correlateNull. This is because the expected rho is an average across tie-breaking permutations while the null distribution refers to the rho for each individual permutation. We can expect the expected rho to be less variable and thus the \(p\)-values will probably be somewhat conservative.

The opposite applies for the standard rho computed by ties.method="average". This yields larger absolute rho in the presence of ties, resulting in anticonservative p-values when compared to correlateNull. A more accurate p-value would require permutations for each gene pair to account for the pair-specific pattern of ties. This would be unacceptably slow (and have exchangeability problems anyway, see below) so I’m not doing this..

2.3 Avoid tie-based inflation

Computing the expected rho from random tie breaking avoids spuriously large correlations with the standard rho. The example below provides a worst-case scenario where two genes are perfectly correlated despite only being expressed in one cell. The standard calculation with averaged tie ranks would yield a perfect correlation in spite of the weakness of the correlation. (A permutation-based p-value won’t provide any protection here; the expected p-value would always be 1 over the number of cells.)

r1 <- rep(0:1, c(99, 1))
r2 <- r1
cor(r1, r2, method="spearman")

## [1] 1

Tie breaking ensures that the correlation is not purely driven by the few cells that happen to express both genes. This favours genes with correlations driven by expression across many cells in the data set.

set.seed(1000)
correlatePairs(rbind(r1, r2))

## DataFrame with 1 row and 6 columns
##         gene1       gene2                rho           p.value
##   <character> <character>          <numeric>         <numeric>
## 1          r1          r2 0.0297029702970297 0.768311231688768
##                 FDR   limited
##           <numeric> <logical>
## 1 0.768311231688768     FALSE

One could argue that a large correlation is desirable in the above example, e.g., to favour gene pairs that are co-expressed in rare subpopulations. However, a low correlation for a small subpopulation makes more sense to me.

3 Constructing a null distribution with `design=`

We can also check what happens with a design matrix. Naively comparing against a null distribution of correlations that was constructed without considering design will result in loss of control. Rather, the null distribution should be compared to an appropriate null that accounts for the loss of residual d.f., as shown below.

set.seed(12120)
design <- model.matrix(~factor(rep(1:5, 2)))
y <- matrix(rnorm(1000, mean=rep(1:5, 5), sd=2), ncol=10, byrow=TRUE)

null <- correlateNull(ncol(y))
out <- correlatePairs(y, design=design, null.dist=null, lower.bound=-Inf)
plot(log10(sort(out$p.value)/1:nrow(out)*nrow(out)),
    ylab="Log(Expected/Observed)") # wrong

null <- correlateNull(design=design)
out <- correlatePairs(y, design=design, null.dist=null, lower.bound=-Inf)
plot(log10(sort(out$p.value)/1:nrow(out)*nrow(out)),
    ylab="Log(Expected/Observed)") # right

Note that the construction of the null distribution assumes that the residual error is normally distributed and that design is correctly specified. Normality is almost certainly not going to hold in real scRNA-seq data, so any use of design= should be treated with caution. It is for this reason that we prefer block=, as discussed in ?correlatePairs.

4 Statistical issues to be solved

4.1 Problems with exchangeability

Generation of the null distribution assumes exchangeability of observations. Specifically, there is the assumption that all observations are equally likely to receive any rank when performing the permutations. This will not be the case in practice as some observations are more variable than others, depending on the mean-variance relationship. As such, the variance of the correlations under the null will be underestimated:

means <- rep(c(5, 50), each=50)
disp <- rep(c(1, 0.1), each=50)
counts <- matrix(rnbinom(50000, mu=means, size=1/disp), byrow=TRUE, ncol=length(means))
counts <- t(t(counts)/means)

actual.cor <- cor(t(counts), method="spearman") 
pretend.cor <- correlateNull(100, iters=10000)
var(as.vector(actual.cor))

## [1] 0.01434036

var(pretend.cor)

## [1] 0.01006008

testing <- correlatePairs(counts, pretend.cor) 
hist(testing$p.value) # fairly substantial loss of type I error control

I’m not sure that there’s any way to get around this, without making some strong parametric assumptions about how the variance affects the ranking. I guess we’ll just have to suck it up - at least we get some level of protection from spurious correlations.

4.2 Deficiencies with residuals

An obvious approach is to just estimate the correlations between residuals. However, this is problematic, even in simple one-way layouts. Consider a situation where you have two groups, with zeroes in almost all cells except for a few. When you calculate residuals for each gene, you’ll get blocks of values corresponding to the zeroes. The exact value of these blocks with likely differ between groups; this can generate apparent correlations between genes.

X <- model.matrix(~rep(LETTERS[1:2], each=50))
g1 <- integer(100)
g1[1] <- 100
g1[51] <- 1000
r1 <- lm.fit(X, g1)$residuals
g2 <- integer(100)
g2[3] <- 200
g2[53] <- 2000
r2 <- lm.fit(X, g2)$residuals
cor(r1, r2, method="spearman")

## [1] 0.890356

The problem above is why we calculate correlations within each group. However, this is not possible for complex designs where we need to know the exact effect of each nuisance term on expression and thus the rank. Consider the following, where you get correlations of 1 because the residual effects will be increasingly negative for zeros with larger covariate values. (Of course, the same problem would be present if you misspecified the model, regardless of the presence of zeroes.)

covariates <- 1:100
Y <- model.matrix(~covariates)
g3 <- integer(100)
g3[100] <- 1000
r3 <- lm.fit(Y, g3)$residuals
g4 <- integer(100)
g4[100] <- 2000
r4 <- lm.fit(Y, g4)$residuals
cor(r3, r4, method="spearman")

## [1] 1

4.3 Motivating the use of a lower bound on the ranks

An ad hoc solution is to set all residuals computed from zeroes to a constant value. This preserves the ties between zeroes, thus avoiding the problems with correlations above. To justify this, consider the process of correcting the raw expression values to remove the nuisance effects:

There is a lower bound on the expression values, derived from applying the equivalent transformation to a count of zero.
Correction involves modifying the expression values such that the coefficients for the nuisance effects are equal to zero. This is most easily done by replacing the expression values with their residuals plus some intercept term.
An expression value at the lower bound cannot drop below the bound upon correction, by definition. Similarly, an expression value at the lower bound cannot increase upon correction, as this suggests expression where there is no evidence for it. Thus, all expression values at the lower bound should stay at the bound upon correction.
The intercept is defined so that the corrected values of non-lower-bound observations are always greater than the lower bound. This is reasonable as there is evidence for expression for those observations compared to values at the bound.

To implement this, we fit a linear model, compute the residuals, and set the residuals for all lower-bound observations to a value below the smallest residual. This is equivalent to computing corrected values with an intercept value that fulfills requirement 4 above. The exact value does not matter for rank-based methods, as long as it is clearly lower than other residuals.

We now look at the performance of scran’s correlation calculator with a lower bound. This avoids the problems with overstatement of the correlation. While the p-values are unlikely to be accurate here, the normality assumption in simulating the observations is a bigger problem, so don’t sweat the small stuff.

set.seed(1020)
nulls <- correlateNull(design=X, iters=1e4)
correlatePairs(rbind(g1, g2), design=X, null.dist=nulls, lower.bound=NA) # Bad

## DataFrame with 1 row and 6 columns
##         gene1       gene2              rho            p.value
##   <character> <character>        <numeric>          <numeric>
## 1          g1          g2 0.69021302130213 0.0001999800019998
##                  FDR   limited
##            <numeric> <logical>
## 1 0.0001999800019998      TRUE

correlatePairs(rbind(g1, g2), design=X, null.dist=nulls, lower.bound=0) # Good

## DataFrame with 1 row and 6 columns
##         gene1       gene2                 rho           p.value
##   <character> <character>           <numeric>         <numeric>
## 1          g1          g2 -0.0012001200120012 0.994500549945005
##                 FDR   limited
##           <numeric> <logical>
## 1 0.994500549945005     FALSE

nulls <- correlateNull(design=Y, iters=1e4)
correlatePairs(rbind(g3, g4), design=Y, null.dist=nulls, lower.bound=NA) # Bad

## DataFrame with 1 row and 6 columns
##         gene1       gene2       rho            p.value                FDR
##   <character> <character> <numeric>          <numeric>          <numeric>
## 1          g3          g4         1 0.0001999800019998 0.0001999800019998
##     limited
##   <logical>
## 1      TRUE

correlatePairs(rbind(g3, g4), design=Y, null.dist=nulls, lower.bound=0) # Good

## DataFrame with 1 row and 6 columns
##         gene1       gene2                rho           p.value
##   <character> <character>          <numeric>         <numeric>
## 1          g3          g4 0.0297029702970297 0.767123287671233
##                 FDR   limited
##           <numeric> <logical>
## 1 0.767123287671233     FALSE

5 Session information

sessionInfo()

## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scran_1.11.14               SingleCellExperiment_1.5.1 
##  [3] SummarizedExperiment_1.13.0 DelayedArray_0.9.5         
##  [5] BiocParallel_1.17.3         matrixStats_0.54.0         
##  [7] Biobase_2.43.0              GenomicRanges_1.35.1       
##  [9] GenomeInfoDb_1.19.1         IRanges_2.17.3             
## [11] S4Vectors_0.21.8            BiocGenerics_0.29.1        
## [13] BiocStyle_2.11.0           
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0               locfit_1.5-9.1          
##  [3] lattice_0.20-38          assertthat_0.2.0        
##  [5] digest_0.6.18            R6_2.3.0                
##  [7] plyr_1.8.4               dynamicTreeCut_1.63-1   
##  [9] evaluate_0.12            ggplot2_3.1.0           
## [11] pillar_1.3.1             zlibbioc_1.29.0         
## [13] rlang_0.3.0.1            lazyeval_0.2.1          
## [15] Matrix_1.2-15            rmarkdown_1.11          
## [17] BiocNeighbors_1.1.7      statmod_1.4.30          
## [19] stringr_1.3.1            igraph_1.2.2            
## [21] RCurl_1.95-4.11          munsell_0.5.0           
## [23] HDF5Array_1.11.10        vipor_0.4.5             
## [25] compiler_3.6.0           xfun_0.4                
## [27] pkgconfig_2.0.2          ggbeeswarm_0.6.0        
## [29] htmltools_0.3.6          tidyselect_0.2.5        
## [31] gridExtra_2.3            tibble_1.4.2            
## [33] GenomeInfoDbData_1.2.0   bookdown_0.9            
## [35] edgeR_3.25.2             viridisLite_0.3.0       
## [37] crayon_1.3.4             dplyr_0.7.8             
## [39] bitops_1.0-6             grid_3.6.0              
## [41] gtable_0.2.0             magrittr_1.5            
## [43] scales_1.0.0             stringi_1.2.4           
## [45] XVector_0.23.0           viridis_0.5.1           
## [47] bindrcpp_0.2.2           limma_3.39.3            
## [49] scater_1.11.5            DelayedMatrixStats_1.5.0
## [51] Rhdf5lib_1.5.1           tools_3.6.0             
## [53] beeswarm_0.2.3           glue_1.3.0              
## [55] purrr_0.2.5              yaml_2.2.0              
## [57] colorspace_1.3-2         rhdf5_2.27.4            
## [59] BiocManager_1.30.4       knitr_1.21              
## [61] bindr_0.1.1

Some thoughts on testing for correlations

2019-01-06

1 Motivation

2 Why use a modified rho?

2.1 Overview

2.2 Computational speed

2.3 Avoid tie-based inflation

3 Constructing a null distribution with `design=`

4 Statistical issues to be solved

4.1 Problems with exchangeability

4.2 Deficiencies with residuals

4.3 Motivating the use of a lower bound on the ranks

5 Session information

Some thoughts on testing for correlations

2019-01-06

1 Motivation

2 Why use a modified rho?

2.1 Overview

2.2 Computational speed

2.3 Avoid tie-based inflation

3 Constructing a null distribution with design=

4 Statistical issues to be solved

4.1 Problems with exchangeability

4.2 Deficiencies with residuals

4.3 Motivating the use of a lower bound on the ranks

5 Session information

3 Constructing a null distribution with `design=`