1 Subtracting input from ChIP

One could theoretically use control data to correct for chromatin state when searching for DB. The most obvious is to subtract the control coverage from the ChIP coverage. The idea is to avoid detecting spurious DB due to changes in chromatin state. However, this has a number of logistical and statistical issues:

It obviously requires sequencing of a negative control sample for each condition, which adds to cost. This may not even be possible for complex designs without a one-way layout.
It assumes that the controls are accurate estimators of background, which mightn’t be the case. Inputs are particularly troublesome as you don’t do any pulldown, which gives you something similar to DNase-seq profiles. In contrast, the pulldown and washes in an actual ChIP library would presumably remove a lot of the signal in open chromatin. It is also difficult to imagine that the input profile would not be affected by various IP biases, especially in relation to histone density and fragment length. I am comfortable with using inputs for direct comparisons to ChIP samples, where I can accept that the log-fold change may not be accurate but is in the right direction. I am less comfortable with using inputs for correction, where inaccuracies in the log-fold changes may lead to spurious DB.
Subtraction will distort the mean-variance relationship as the absolute size of the counts is lost. This is most relevant when both ChIP and control coverage are high, where subtraction is meant to be most beneficial. In particular, a small relative difference between libraries at large counts would be amplified in the subtracted counts. This leads to false positives (if between groups) or inflated variance (if within groups). Conservativeness seems to dominate, though anticonservativeness is also possible - see the simulations below.
Negative counts need to be coerced to zeros, resulting in an excess of zeros that aren’t easily handled by the NB model. Using the zero-inflated NB model is difficult, while simply filtering out low-abundance windows post-subtraction is not the solution. This is because you would retain windows with zeros where the other subtracted counts (e.g., for the replicate) are large due to variability for high raw counts.
In practice, other high-abundance regions will buffer any damage to the variance and EB statistics. However, this would result in more false positives if variance inflation is prevented.

2 Setting up some simulated data

This script performs a simulation to demonstrate the effects of subtracting control counts from ChIP counts in a simple case with equal baseline coverage. We set up a simulation with two ChIP replicates in each of two groups and matching input samples.

exp.type <- rep(c("ChIP", "Con", "ChIP", "Con"), each=2)
group.no <- rep(c("A", "B"), each=4)
groupings <- paste0(exp.type, group.no)
nlibs <- length(groupings)

We generate the mean vectors for DB and non-DB sites. The background is the same between groups in all cases, and the only difference is that there is genuine binding in group A for DB sites.

library(edgeR)
baseline <- 50
binding <- 50
mu.nodb <- rep(baseline, nlibs)
mu.nodb[exp.type=="ChIP"] <- baseline+binding
mu.db <- rep(baseline, nlibs)
mu.db[exp.type=="ChIP" & group.no=="A"] <- baseline+binding

Simulating counts, with an equal number of DB and non-DB sites:

set.seed(1000)
P <- 1/0.1
is.null <- 1:10000
counts <- rbind(
    matrix(rnbinom(10000*nlibs, mu=mu.nodb, size=P), ncol=nlibs, byrow=TRUE),
    matrix(rnbinom(10000*nlibs, mu=mu.db, size=P), ncol=nlibs, byrow=TRUE)
)

3 Running without subtraction

As a control, we do a vanilla analysis between the two groups directly.

g <- factor(groupings)
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)

Using edgeR:

y.d <- DGEList(counts, lib.size=rep(1e6, nlibs))
y.d <- estimateDisp(y.d, design)
fit.d <- glmQLFit(y.d, design, robust=TRUE)
res.d <- glmQLFTest(fit.d, contrast=makeContrasts(ChIPA - ChIPB, levels=design))
summary(y.d$trended.dispersion)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09733 0.09822 0.09960 0.09923 0.10008 0.10071

You can see that type I error is controlled for the true nulls.

nullp.d <- res.d$table$PValue[is.null]
sum(nullp.d <= 0.01)/length(nullp.d)

## [1] 0.0098

sum(nullp.d <= 0.05)/length(nullp.d)

## [1] 0.0459

At the same thresholds, there are more DB sites that get detected than non-DB sites. This indicates that power is good.

altp.d <- res.d$table$PValue[-is.null]
sum(altp.d <= 0.01)/length(altp.d)

## [1] 0.2739

sum(altp.d <= 0.05)/length(altp.d)

## [1] 0.5176

4 Running with subtraction

Now seeing what happens if we subtract counts before testing.

subcounts <- counts
is.chip <- exp.type=="ChIP"
is.A <- group.no=="A"
subcounts[,is.chip & is.A] <- subcounts[,is.chip & is.A] - subcounts[,!is.chip & is.A]
subcounts[,is.chip & !is.A] <- subcounts[,is.chip & !is.A] - subcounts[,!is.chip & !is.A]
subcounts[subcounts < 0] <- 0
subcounts <- subcounts[,is.chip]

Setting up the new design matrix.

g2 <- factor(groupings[is.chip])
design2 <- model.matrix(~0 + g2)
colnames(design2) <- levels(g2)

Running through edgeR:

y.s <- DGEList(subcounts, lib.size=rep(1e6, length(g2)))
y.s <- estimateDisp(y.s, design2)
fit.s <- glmQLFit(y.s, design2, robust=TRUE)
res.s <- glmQLFTest(fit.s, contrast=makeContrasts(ChIPA - ChIPB, levels=design2))
summary(y.s$trended.dispersion)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5194  0.9881  1.4736  1.5486  1.9523  3.6892

The results are now way too conservative, due to inflation of the dispersions.

nullp.s <- res.s$table$PValue[is.null]
sum(nullp.s <= 0.01)/length(nullp.s)

## [1] 0.0016

sum(nullp.s <= 0.05)/length(nullp.s)

## [1] 0.0153

We see a concomitant reduction in power relative to the no-subtraction case.

altp.s <- res.s$table$PValue[-is.null]
sum(altp.s <= 0.01)/length(altp.s)

## [1] 0.028

sum(altp.s <= 0.05)/length(altp.s)

## [1] 0.203

Can the conservativeness upon subtraction be offset by simply increasing the threshold (notwithstanding the loss of interpretability of the error rates)? No, based on AUC curves.

thresholds <- 1:100/1000
tp.s <- findInterval(thresholds, sort(altp.s))/length(altp.s)
fp.s <- findInterval(thresholds, sort(nullp.s))/length(nullp.s)
tp.d <- findInterval(thresholds, sort(altp.d))/length(altp.d)
fp.d <- findInterval(thresholds, sort(nullp.d))/length(nullp.d)
plot(fp.s, tp.s, col="red", type="l", xlab="FPR", ylab="TPR", 
    xlim=c(0, 0.1), ylim=c(0, 1))
lines(fp.d, tp.d, col="blue")

5 Anticonservativeness with buffering

Buffering with lots of entries that are high-abundance and did not require much subtraction.

others <- 1001:nrow(subcounts)
bufcounts <- subcounts
bufcounts[others,] <- matrix(rnbinom(length(others)*ncol(subcounts), mu=binding, size=P), length(others))

Running these through edgeR:

y.b <- DGEList(bufcounts, lib.size=rep(1e6, length(g2)))
y.b <- estimateDisp(y.b, design2)
fit.b <- glmQLFit(y.b, design2, robust=TRUE)
res.b <- glmQLFTest(fit.b, contrast=makeContrasts(ChIPA - ChIPB, levels=design2))
summary(y.b$trended.dispersion)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1116  0.1131  0.1171  0.1221  0.1237  0.6110

We see loss of type I error control, because the buffering removes the protection from variance inflation.

nullp.b <- res.b$table$PValue[-others]
sum(nullp.b <= 0.01)/length(nullp.b)

## [1] 0.072

sum(nullp.b <= 0.05)/length(nullp.b)

## [1] 0.135

6 Wrapping up

The other approach is to do log-subtraction, where you include the controls as part of the model. You could then ask if log-fold change over the control is the same in the two conditions. This avoids the statistical problems with pure subtraction, but can result in loss of detection power. For example, consider a case where both ChIP and control coverage increases two-fold. This would not be detected after log-correction, as the fold change cancels out. However, there is DB here if you assume an additive model for the raw coverage, e.g., \(x\) genuine binding plus \(y\) background in one condition, \(2(x+y)\) in the other condition, resulting in \(x\) against \(2x\) binding between conditions. This loss of power would be most problematic when DB coincides with changes in chromatin accessibility. This is biologically sensible but will not be detected if the latter cancels out the former (personal comm. from Gordon, Mike Love).

So, in short, it is difficult to rigorously adjust for changes in chromatin state. I’d say that several hundred false positives is an acceptable cost for keeping things simple, especially if you have thousands of detected regions. Protection is generally provided by filtering, as most changes in state should be depleted after the IP step. Of course, if there are likely to be extreme changes, then control correction may be the lesser of two evils.

sessionInfo()

## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] edgeR_3.25.2     limma_3.39.3     BiocStyle_2.11.0
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1     Rcpp_1.0.0         bookdown_0.9      
##  [4] lattice_0.20-38    digest_0.6.18      grid_3.6.0        
##  [7] magrittr_1.5       evaluate_0.12      stringi_1.2.4     
## [10] rmarkdown_1.11     splines_3.6.0      statmod_1.4.30    
## [13] tools_3.6.0        stringr_1.3.1      xfun_0.4          
## [16] yaml_2.2.0         compiler_3.6.0     BiocManager_1.30.4
## [19] htmltools_0.3.6    knitr_1.21

Subtracting input counts from ChIP samples

2018-12-27