One could theoretically use control data to correct for chromatin state when searching for DB. The most obvious is to subtract the control coverage from the ChIP coverage. The idea is to avoid detecting spurious DB due to changes in chromatin state. However, this has a number of logistical and statistical issues:
This script performs a simulation to demonstrate the effects of subtracting control counts from ChIP counts in a simple case with equal baseline coverage. We set up a simulation with two ChIP replicates in each of two groups and matching input samples.
exp.type <- rep(c("ChIP", "Con", "ChIP", "Con"), each=2)
group.no <- rep(c("A", "B"), each=4)
groupings <- paste0(exp.type, group.no)
nlibs <- length(groupings)
We generate the mean vectors for DB and non-DB sites. The background is the same between groups in all cases, and the only difference is that there is genuine binding in group A for DB sites.
library(edgeR)
baseline <- 50
binding <- 50
mu.nodb <- rep(baseline, nlibs)
mu.nodb[exp.type=="ChIP"] <- baseline+binding
mu.db <- rep(baseline, nlibs)
mu.db[exp.type=="ChIP" & group.no=="A"] <- baseline+binding
Simulating counts, with an equal number of DB and non-DB sites:
set.seed(1000)
P <- 1/0.1
is.null <- 1:10000
counts <- rbind(
matrix(rnbinom(10000*nlibs, mu=mu.nodb, size=P), ncol=nlibs, byrow=TRUE),
matrix(rnbinom(10000*nlibs, mu=mu.db, size=P), ncol=nlibs, byrow=TRUE)
)
As a control, we do a vanilla analysis between the two groups directly.
g <- factor(groupings)
design <- model.matrix(~0 + g)
colnames(design) <- levels(g)
Using edgeR:
y.d <- DGEList(counts, lib.size=rep(1e6, nlibs))
y.d <- estimateDisp(y.d, design)
fit.d <- glmQLFit(y.d, design, robust=TRUE)
res.d <- glmQLFTest(fit.d, contrast=makeContrasts(ChIPA - ChIPB, levels=design))
summary(y.d$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.09733 0.09822 0.09960 0.09923 0.10008 0.10071
You can see that type I error is controlled for the true nulls.
nullp.d <- res.d$table$PValue[is.null]
sum(nullp.d <= 0.01)/length(nullp.d)
## [1] 0.0098
sum(nullp.d <= 0.05)/length(nullp.d)
## [1] 0.0459
At the same thresholds, there are more DB sites that get detected than non-DB sites. This indicates that power is good.
altp.d <- res.d$table$PValue[-is.null]
sum(altp.d <= 0.01)/length(altp.d)
## [1] 0.2739
sum(altp.d <= 0.05)/length(altp.d)
## [1] 0.5176
Now seeing what happens if we subtract counts before testing.
subcounts <- counts
is.chip <- exp.type=="ChIP"
is.A <- group.no=="A"
subcounts[,is.chip & is.A] <- subcounts[,is.chip & is.A] - subcounts[,!is.chip & is.A]
subcounts[,is.chip & !is.A] <- subcounts[,is.chip & !is.A] - subcounts[,!is.chip & !is.A]
subcounts[subcounts < 0] <- 0
subcounts <- subcounts[,is.chip]
Setting up the new design matrix.
g2 <- factor(groupings[is.chip])
design2 <- model.matrix(~0 + g2)
colnames(design2) <- levels(g2)
Running through edgeR:
y.s <- DGEList(subcounts, lib.size=rep(1e6, length(g2)))
y.s <- estimateDisp(y.s, design2)
fit.s <- glmQLFit(y.s, design2, robust=TRUE)
res.s <- glmQLFTest(fit.s, contrast=makeContrasts(ChIPA - ChIPB, levels=design2))
summary(y.s$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5194 0.9881 1.4736 1.5486 1.9523 3.6892
The results are now way too conservative, due to inflation of the dispersions.
nullp.s <- res.s$table$PValue[is.null]
sum(nullp.s <= 0.01)/length(nullp.s)
## [1] 0.0016
sum(nullp.s <= 0.05)/length(nullp.s)
## [1] 0.0153
We see a concomitant reduction in power relative to the no-subtraction case.
altp.s <- res.s$table$PValue[-is.null]
sum(altp.s <= 0.01)/length(altp.s)
## [1] 0.028
sum(altp.s <= 0.05)/length(altp.s)
## [1] 0.203
Can the conservativeness upon subtraction be offset by simply increasing the threshold (notwithstanding the loss of interpretability of the error rates)? No, based on AUC curves.
thresholds <- 1:100/1000
tp.s <- findInterval(thresholds, sort(altp.s))/length(altp.s)
fp.s <- findInterval(thresholds, sort(nullp.s))/length(nullp.s)
tp.d <- findInterval(thresholds, sort(altp.d))/length(altp.d)
fp.d <- findInterval(thresholds, sort(nullp.d))/length(nullp.d)
plot(fp.s, tp.s, col="red", type="l", xlab="FPR", ylab="TPR",
xlim=c(0, 0.1), ylim=c(0, 1))
lines(fp.d, tp.d, col="blue")
Buffering with lots of entries that are high-abundance and did not require much subtraction.
others <- 1001:nrow(subcounts)
bufcounts <- subcounts
bufcounts[others,] <- matrix(rnbinom(length(others)*ncol(subcounts), mu=binding, size=P), length(others))
Running these through edgeR:
y.b <- DGEList(bufcounts, lib.size=rep(1e6, length(g2)))
y.b <- estimateDisp(y.b, design2)
fit.b <- glmQLFit(y.b, design2, robust=TRUE)
res.b <- glmQLFTest(fit.b, contrast=makeContrasts(ChIPA - ChIPB, levels=design2))
summary(y.b$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1116 0.1131 0.1171 0.1221 0.1237 0.6110
We see loss of type I error control, because the buffering removes the protection from variance inflation.
nullp.b <- res.b$table$PValue[-others]
sum(nullp.b <= 0.01)/length(nullp.b)
## [1] 0.072
sum(nullp.b <= 0.05)/length(nullp.b)
## [1] 0.135
The other approach is to do log-subtraction, where you include the controls as part of the model. You could then ask if log-fold change over the control is the same in the two conditions. This avoids the statistical problems with pure subtraction, but can result in loss of detection power. For example, consider a case where both ChIP and control coverage increases two-fold. This would not be detected after log-correction, as the fold change cancels out. However, there is DB here if you assume an additive model for the raw coverage, e.g., \(x\) genuine binding plus \(y\) background in one condition, \(2(x+y)\) in the other condition, resulting in \(x\) against \(2x\) binding between conditions. This loss of power would be most problematic when DB coincides with changes in chromatin accessibility. This is biologically sensible but will not be detected if the latter cancels out the former (personal comm. from Gordon, Mike Love).
So, in short, it is difficult to rigorously adjust for changes in chromatin state. I’d say that several hundred false positives is an acceptable cost for keeping things simple, especially if you have thousands of detected regions. Protection is generally provided by filtering, as most changes in state should be depleted after the IP step. Of course, if there are likely to be extreme changes, then control correction may be the lesser of two evils.
sessionInfo()
## R Under development (unstable) (2018-12-07 r75787)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] edgeR_3.25.2 limma_3.39.3 BiocStyle_2.11.0
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.1 Rcpp_1.0.0 bookdown_0.9
## [4] lattice_0.20-38 digest_0.6.18 grid_3.6.0
## [7] magrittr_1.5 evaluate_0.12 stringi_1.2.4
## [10] rmarkdown_1.11 splines_3.6.0 statmod_1.4.30
## [13] tools_3.6.0 stringr_1.3.1 xfun_0.4
## [16] yaml_2.2.0 compiler_3.6.0 BiocManager_1.30.4
## [19] htmltools_0.3.6 knitr_1.21