1 Background

The 2014 NAR paper (https://doi.org/10.1093/nar/gku351) showed that many ad hoc peak selection strategies result in conservativeness. One could consider erring on the side of conservativeness to be acceptable, especially if more DB sites can pass the filter. However, this document demonstrates that the same strategies can also result in loss of type I error control. The possibility of anticonservativeness means that any increased detection from ad hoc strategies cannot be trusted.

2 Setting up the experimental design

Consider an experimental design with two replicates in each of two conditions.

group <- rep(c("A", "B"), each=2)
nlibs <- length(group)
design <- model.matrix(~group)

We set up a simulator for a count matrix where a certain proportion (10% by default) of the sites are DB. Note that the grand mean for DB and non-DB sites are the same, otherwise the filtering problem would be trivial.

simulateCounts <- function(nlibs, n.sites=1e5, prop.db=0.1, 
                           dispfun=function(x) { 0.1 }, 
                           n.mu=50, db.mu=rep(c(100, 0), each=2)) {
    P.n <- 1/dispfun(n.sites)
    db.sites <- n.sites*prop.db
    P.db <- 1/dispfun(db.sites)
    is.null <- seq_len(n.sites)
    counts <- rbind(matrix(rnbinom(n.sites*nlibs, mu=n.mu, size=P.n), ncol=nlibs, byrow=TRUE),
        matrix(rnbinom(db.sites*nlibs, mu=db.mu, size=P.db), ncol=nlibs, byrow=TRUE))
    return(list(counts=counts, null=is.null))
}

We set up a function to perform the differential analysis with edgeR to obtain p-values. We use equal library sizes here, assuming that normalization has already been performed to correct for composition biases.

library(edgeR)
detectDiff <- function(counts, design, coef=ncol(design), lib.size=rep(1e6, ncol(counts))) {
    y <- DGEList(counts, lib.size=lib.size)
    y <- estimateDisp(y, design)
    fit <- glmQLFit(y, design, robust=TRUE)
    res <- glmQLFTest(fit, coef=coef)
    return(list(common.dispersion=y$common.dispersion, PValue=res$table$PValue))
}

We’ll also set up a function to assess type I error control.

plotAlpha <- function(pvals, ylab="Observed/specified", xlab="Specified", xlim=NULL, ...) {
    for (i in seq_along(pvals)) { 
        cur.p <- pvals[[i]]
        exp <- (seq_along(cur.p) - 0.5)/length(cur.p)
        n <- findInterval(exp, sort(cur.p))
        obs <- n/length(cur.p)
        if (is.null(xlim)) { # Stable at 20 observations.
            xlim <- c(exp[which(n >= 20)[1]], 1)
        }
        if (i==1L) {
            plot(exp, obs/exp, log="xy", xlim=xlim, type="l", ...)
        } else {
            lines(exp, obs/exp, ...)
        }
    }
}

3 Applying the “at least 2” filter

Let’s see what happens when we apply the “at least two” filter to retain the top proportion of sites. First we set up a filtering function.

set.seed(10001)
AL2Filter <- function(counts) { 
    top.al2 <- apply(counts, 1, FUN=function(x) { sort(x, decreasing=TRUE)[2] })
    rank(-top.al2, ties.method="random")
}

Now we run through repeated simulations and collect the results.

retained <- dispersions <- numeric(10)
null.p <- vector("list", 10)

for (it in 1:10) {
    out <- simulateCounts(nrow(design))
    keep <- AL2Filter(out$counts) <= 10000
    kept.null <- which(keep) %in% out$null
    retained[it] <- sum(kept.null)
    res <- detectDiff(out$counts[keep,], design)
    dispersions[it] <- res$common.dispersion
    null.p[[it]] <- res$PValue[kept.null]
}

There is some inflation, but the presence of correct dispersion estimates for the DB sites keeps the common dispersion low.

summary(retained)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4084    4089    4136    4129    4157    4190
summary(dispersions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1051  0.1059  0.1076  0.1073  0.1083  0.1095

We observe loss of type I error control at low p-values. This is because the dispersion inflation is minimized and the “at least two” filter selects for spurious DB sites.

summary(sapply(null.p, FUN=function(x) { mean(x <= 1e-3) }))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002625 0.002722 0.003385 0.003540 0.004039 0.005141
plotAlpha(null.p)

4 Applying a union filter

Repeating the dose with a union filter. Here we retain fewer sites, which ensures that the DB percentage in the retained set is higher (see below).

set.seed(20002)
UnionFilter <- function(counts) {
    top.u <- apply(counts, 1, FUN=max)
    rank(-top.u, ties.method="random") 
}

Now we run through repeated simulations and collect the results. We use a more stringent filter to obtain a higher DB percentage, which keeps the dispersion inflation low.

retained <- dispersions <- numeric(10)
null.p <- vector("list", 10)
for (it in 1:10) {
    out <- simulateCounts(nrow(design))
    keep <- UnionFilter(out$counts) <= 5000 
    kept.null <- which(keep) %in% out$null
    retained[it] <- sum(kept.null)
    res <- detectDiff(out$counts[keep,], design)
    dispersions[it] <- res$common.dispersion
    null.p[[it]] <- res$PValue[kept.null]
}

Some inflation occurs, but all in all, the dispersions are kept reasonably low.

summary(retained)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   388.0   440.8   448.5   443.1   454.8   473.0
summary(dispersions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1666  0.1686  0.1691  0.1696  0.1711  0.1724

Testing again results in the loss of type I error control. Normally, the union approach enriches outliers and inflates the dispersion. However, enough DB sites ensures that the inflation is minimized, encouraging spurious rejection of the null.

summary(sapply(null.p, FUN=function(x) { mean(x <= 1e-2) }))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01418 0.01770 0.02385 0.02235 0.02559 0.02908
plotAlpha(null.p)

5 Applying the mean filter

Now, to demonstrate the correct way of doing it, we use a filter on the mean count.

set.seed(30003)
MeanFilter <- function(counts) { 
    top.m <- rowMeans(counts)
    rank(-top.m, ties.method="random")
}

Running these counts through edgeR.

retained <- dispersions <- numeric(10)
null.p <- vector("list", 10)
for (it in 1:10) {
    out <- simulateCounts(nrow(design))
    keep <- MeanFilter(out$counts) <= 10000
    kept.null <- which(keep) %in% out$null
    retained[it] <- sum(kept.null)
    res <- detectDiff(out$counts[keep,], design)
    dispersions[it] <- res$common.dispersion
    null.p[[it]] <- res$PValue[kept.null]
}

Dispersions are equal to their expected value.

summary(retained)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8479    8494    8509    8515    8537    8554
summary(dispersions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09713 0.09754 0.09917 0.09898 0.10021 0.10100

Testing indicates that type I error control is mostly maintained.

summary(sapply(null.p, FUN=function(x) { mean(x <= 1e-2) }))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.007301 0.008144 0.008565 0.008914 0.010027 0.010261
plotAlpha(null.p, ylim=c(0.5, 2))

6 Exploiting complex experimental designs

There are also other ways of mitigating the variance inflation that don’t involve introducing DB sites between two conditions. For example, if an orthogonal batch effect increases binding in some sites, those sites will get preferentially enriched by an “at least 2” filter. This will suppress the dispersion inflation in the other sites and encourage loss of type I error control. FDR control is irrelevant here as there are no DB sites to the contrast between the first two groups.

set.seed(40004)
design.b <- model.matrix(~c(0,1,0,1)+c(0,0,1,1))
null.p <- vector("list", 10)
for (it in 1:10) {
    out <- simulateCounts(nrow(design.b), db.mu=rep(c(10, 90), 2)) # not really DB.
    keep <- AL2Filter(out$counts) <= 5000
    res <- detectDiff(out$counts[keep,], design.b, coef=3)
    kept.null <- which(keep) %in% out$null
    null.p[[it]] <- res$PValue[kept.null]
}
plotAlpha(null.p)

Alternatively, you could imagine a situation with three groups, and DB sites in the third group can suppress dispersion inflation. This is done without contributing differences to the contrast between first two groups.

set.seed(50005)
group.3 <- factor(rep(LETTERS[1:3], each=2))
design.3 <- model.matrix(~group.3)
null.p <- vector("list", 10)
for (it in 1:10) {
    out <- simulateCounts(nrow(design.3), db.mu=rep(c(25, 100), c(4, 2))) # not really DB.
    keep <- AL2Filter(out$counts) <= 5000
    res <- detectDiff(out$counts[keep,], design.3, coef=2)
    kept.null <- which(keep) %in% out$null
    null.p[[it]] <- res$PValue[kept.null]
}
plotAlpha(null.p)

7 Wrapping up

sessionInfo()
## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] edgeR_3.25.2     limma_3.39.3     BiocStyle_2.11.0
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1     Rcpp_1.0.0         bookdown_0.9      
##  [4] lattice_0.20-38    digest_0.6.18      grid_3.6.0        
##  [7] magrittr_1.5       evaluate_0.12      stringi_1.2.4     
## [10] rmarkdown_1.11     splines_3.6.0      statmod_1.4.30    
## [13] tools_3.6.0        stringr_1.3.1      xfun_0.4          
## [16] yaml_2.2.0         compiler_3.6.0     BiocManager_1.30.4
## [19] htmltools_0.3.6    knitr_1.21