1 Background

The perils of aggressive QC are probably overstated but are concerning enough that we ask ourselves - why run the risk of doing QC at all, if you might throw out genuine cell types? Why not just put up with the presence of a “low quality cell” cluster? This document provides some case studies of what we can expect if we skip the QC step altogether.

2 Grun human pancreas, unfiltered

This dataset is useful as no quality control has been applied, as we can see from the fact that all cell counts are multiples of 96.

library(scRNAseq)
sce <- GrunPancreasData()
table(sce$donor, sce$sample)
##      
##       CD13+ sorted cells CD24+ CD44+ live sorted cells CD63+ sorted cells
##   D10                  0                             0                 96
##   D17                 96                            96                  0
##   D2                   0                             0                  0
##   D3                   0                             0                  0
##   D7                   0                             0                  0
##      
##       exocrine fraction, live sorted cells live sorted cells, library 1
##   D10                                    0                           96
##   D17                                    0                           96
##   D2                                    96                            0
##   D3                                    96                           96
##   D7                                     0                           96
##      
##       live sorted cells, library 2 live sorted cells, library 3
##   D10                           96                            0
##   D17                           96                            0
##   D2                             0                            0
##   D3                            96                           96
##   D7                            96                           96
##      
##       live sorted cells, library 4 TGFBR3+ sorted cells
##   D10                            0                    0
##   D17                            0                   96
##   D2                             0                    0
##   D3                            96                    0
##   D7                            96                    0

We compute some QC metrics and identify those cells that would have been removed, had we applied an outlier-based QC threshold.

library(scater)
sce <- addQCPerCell(sce)
discarded <- quickCellQC(colData(sce),
    percent_subsets="altexps_ERCC_percent", nmads=3)
sce$discarded <- discarded$discard
sce <- sce[,sce$sum > 0]

It is instructive to compare the mean-variance trends before and after QC. We see that the trend never stabilizes due to the continued prescence of dropouts - even in high-abundance genes - that introduce large variances from comparing zero and non-zero log-expression values.

library(scran)
filtered <- sce[,!sce$discarded]
filtered <- logNormCounts(filtered, use_altexps=FALSE)
dec.f <- modelGeneVar(filtered)
plot(metadata(dec.f)$mean, metadata(dec.f)$var)
curve(metadata(dec.f)$trend(x), col="dodgerblue", add=TRUE)

sce <- logNormCounts(sce, use_altexps=FALSE)
dec.u <- modelGeneVar(sce)
plot(metadata(dec.u)$mean, metadata(dec.u)$var)
curve(metadata(dec.u)$trend(x), col="dodgerblue", add=TRUE)

Similar issues are observed when we look at the PCA, where the discarded cells lie on one of the corners and effectively drive the first two PCs.

set.seed(100)
library(BiocSingular)
sce <- runPCA(sce, subset_row=dec.u$bio > 0, BSPARAM=IrlbaParam())
plotPCA(sce, colour_by="discarded")

So, the variance modelling is compromised. But perhaps that is not a major concern, as we can still identify the genes above the trend, and even the first two PCs still contain some information; it’s not a total write-off. For a better visualization, let’s have a look at the \(t\)-SNE:

set.seed(101)
sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="discarded")

The discarded cells effectively form their own clusters rather than being distributed within the other clusters. This is annoying as it means that we must be explicitly on the look-out for one or more artificial clusters. (The saving grace is that once we do identify them, we can just remove the offenders in one go.)

g <- buildSNNGraph(sce, use.dimred="PCA")
sce$cluster <- factor(igraph::cluster_walktrap(g)$membership)
table(sce$cluster, sce$discarded)
##     
##      FALSE TRUE
##   1    207    0
##   2    355    1
##   3     65  306
##   4    270    1
##   5     43    4
##   6    123    0
##   7     64    0
##   8     16    6
##   9     64    0
##   10    27   50
##   11     3   57
##   12    22    0
##   13    24    0
##   14     8    2

The fun really starts when we look at the marker genes. We see that one of the clusters of discarded cells is intermediate for INS and SST, which is pretty unusual. This is consistent with spurious “upregulation” of transcripts enriched in the ambient soup, which would have easily misled us if we did not know about the biology of the system. (You could even get unique spurious markers for this cluster, if the cell types contributing to the ambient soup did not survive. To be sure that upregulation is genuine, we would need to verify that it is present against the soup.)

markers <- findMarkers(sce, sce$cluster, direction="up", lfc=1)
as.data.frame(markers[[3]][1:20,1:3])
##                    Top      p.value          FDR
## MTRNR2L8__chr11      1 2.447952e-14 4.911570e-10
## UGDH-AS1__chr4       1 7.243847e-13 7.267027e-09
## KCNQ1OT1__chr11      1 1.129684e-12 7.555325e-09
## REG1A__chr2          1 1.961390e-10 9.838333e-07
## REG3A__chr2          1 2.582754e-10 1.036408e-06
## INS__chr11           1 4.409887e-07 1.474666e-03
## SST__chr3            1 1.297444e-04 3.253991e-01
## TTR__chr18           1 6.872370e-04 1.000000e+00
## MTRNR2L2__chr5       3 1.892970e-03 1.000000e+00
## MAB21L3__chr1        3 8.978784e-03 1.000000e+00
## GCG__chr2            3 1.390494e-01 1.000000e+00
## CHGB__chr20          3 1.000000e+00 1.000000e+00
## PRSS1__chr7          4 4.883271e-06 1.399685e-02
## PPY__chr17           4 5.898493e-01 1.000000e+00
## NLRP12__chr19        5 1.000000e+00 1.000000e+00
## REG1B__chr2          6 1.360128e-02 1.000000e+00
## LOC100131257__chr7   6 1.916348e-01 1.000000e+00
## LOC643406__chr20     6 8.581646e-01 1.000000e+00
## FBLIM1__chr1         6 1.000000e+00 1.000000e+00
## MALAT1__chr11        7 4.692586e-01 1.000000e+00
plotExpression(sce, x="cluster", 
    features=c("INS__chr11", "SST__chr3", "REG1A__chr2", "REG3A__chr2"),
    colour_by="discarded")

This nonsense doesn’t appear after the discarded cells are actually discarded. So in this scenario, the primary motivation for QC is to simplify the downstream interpretation, even if it might result in the loss of cell types that the QC metrics discriminate against. The amount of cross-checking required to verify that a cluster is not just full of low-quality cells is comparable to the cross-checking required to determine if a cell type was discarded by QC, so we might as well take the more stringent route to avoid false positives.

set.seed(100)
library(BiocSingular)
filtered <- runPCA(filtered, subset_row=dec.u$bio > 0, BSPARAM=IrlbaParam())

set.seed(101)
filtered <- runTSNE(filtered, dimred="PCA")

g <- buildSNNGraph(filtered, use.dimred="PCA")
filtered$cluster <- factor(igraph::cluster_walktrap(g)$membership)

plotExpression(filtered, x="cluster", 
    features=c("INS__chr11", "GCG__chr2"),
    colour_by=I(log10(filtered$sum)))

3 10X PBMC, unfiltered

This isn’t entirely unfiltered as all 10X datasets will have undergone some filtering due to cell calling. Nonetheless, we can have a look at the types of cells that we get out from CellRanger.

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
out <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com/samples/cell-exp",
    "3.1.0/5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.tar.gz"))

exdir <- tempfile()
untar(out, exdir=exdir)

library(DropletUtils)
sce <- read10xCounts(file.path(exdir, "filtered_feature_bc_matrix"))
sce <- splitAltExps(sce, rowData(sce)$Type)
sce
## class: SingleCellExperiment 
## dim: 33538 5247 
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(3): ID Symbol Type
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## spikeNames(0):
## altExpNames(1): Antibody Capture

In this case, we only identify cells to be discarded on the basis of their mitochondrial content, assuming that the cell calling has filtered out most of the low-coverage droplets.

library(scater)
stats <- perCellQCMetrics(sce, subsets=list(Mt=grep("MT", rowData(sce)$Symbol)))
discarded <- isOutlier(stats$subsets_Mt_percent, nmads=3, type="higher")
sce$discarded <- discarded
sum(discarded)
## [1] 1080

The mean-variance trend in the unfiltered dataset keeps going up, consistent with our previous observations in the Grun pancreas data.

library(scran)
filtered <- sce[,!sce$discarded]
filtered <- logNormCounts(filtered, use_altexps=FALSE)
dec.f <- modelGeneVar(filtered)
plot(metadata(dec.f)$mean, metadata(dec.f)$var)
curve(metadata(dec.f)$trend(x), col="dodgerblue", add=TRUE)

sce <- logNormCounts(sce, use_altexps=FALSE)
dec.u <- modelGeneVar(sce)
plot(metadata(dec.u)$mean, metadata(dec.u)$var)
curve(metadata(dec.u)$trend(x), col="dodgerblue", add=TRUE)

The domination of the discarded cells in the PCA and \(t\)-SNE is even more obvious.

set.seed(100)
library(BiocSingular)
sce <- runPCA(sce, subset_row=dec.u$bio > 0, BSPARAM=IrlbaParam())
plotPCA(sce, colour_by="discarded")

set.seed(101)
sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="discarded")

At least the discarded cells cluster together, and fortunately, the relevant cluster is identifiable based on the strong upregulation of mitochondrial genes. (Though this does raise the question of why we didn’t just filter them out in the first place, if were going to use that metric to identify them.)

g <- buildSNNGraph(sce, use.dimred="PCA")
sce$cluster <- factor(igraph::cluster_walktrap(g)$membership)
table(sce$cluster, sce$discarded)
##     
##      FALSE TRUE
##   1    331    2
##   2    922   12
##   3   1166   12
##   4    300    2
##   5     78 1002
##   6    142    2
##   7    127    1
##   8   1024    2
##   9      1   44
##   10    49    0
##   11    27    1
rownames(sce) <- uniquifyFeatureNames(
    rowData(sce)$ID, rowData(sce)$Symbol)
markers <- findMarkers(sce, sce$cluster, direction="up", lfc=1, full.stats=TRUE)
as.data.frame(markers[[5]][1:20,1:3])
##           Top       p.value           FDR
## MT-ATP6     1  0.000000e+00  0.000000e+00
## MT-ND1      1 2.415162e-308 1.157139e-304
## MTRNR2L12   1 5.716652e-238 1.917251e-234
## MT-ND3      1 7.263951e-213 2.214713e-209
## LTB         1  1.153075e-18  2.762274e-15
## MT-CYB      2  0.000000e+00  0.000000e+00
## MT-CO1      2  0.000000e+00  0.000000e+00
## MT-CO2      2  0.000000e+00  0.000000e+00
## MT-ND5      2  0.000000e+00  0.000000e+00
## MT-ND2      2 5.754655e-201 1.608330e-197
## IL7R        2  9.418994e-09  1.858201e-05
## MT-CO3      3  0.000000e+00  0.000000e+00
## CD69        3  1.426385e-02  1.000000e+00
## MT-ND4      4 1.045994e-278 4.385070e-275
## ACAP1       4  5.223391e-11  1.167881e-07
## MALAT1      5 1.291745e-240 4.813615e-237
## RPS2        5  4.495392e-01  1.000000e+00
## RPS27       6  6.322929e-01  1.000000e+00
## RPS27A      7  6.405263e-01  1.000000e+00
## RPS12       8  1.000000e+00  1.000000e+00

You might say that the mitochondrial proportion is not a good way of identifying low-quality cells. Fair enough, but if we had blinded ourselves to it, we wouldn’t be able to identify this cluster as being low-quality based on its gene expression. There’s at least a couple of genes that make it seems kind of biologically interesting.

plotExpression(sce, x="cluster", 
    features=c("IL7R", "CD69", "LTB", "MALAT1"),
    colour_by="discarded")

4 Session information

sessionInfo()
## R version 3.6.0 Patched (2019-05-10 r76483)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib
## LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] DropletUtils_1.5.7          BiocFileCache_1.9.1        
##  [3] dbplyr_1.4.2                BiocSingular_1.1.5         
##  [5] scran_1.13.18               scater_1.13.18             
##  [7] ggplot2_3.2.1               scRNAseq_1.99.5            
##  [9] SingleCellExperiment_1.7.9  SummarizedExperiment_1.15.8
## [11] DelayedArray_0.11.4         BiocParallel_1.19.2        
## [13] matrixStats_0.55.0          Biobase_2.45.1             
## [15] GenomicRanges_1.37.15       GenomeInfoDb_1.21.1        
## [17] IRanges_2.19.14             S4Vectors_0.23.22          
## [19] BiocGenerics_0.31.5         BiocStyle_2.13.2           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] httr_1.4.1                    tools_3.6.0                  
##  [5] backports_1.1.4               R6_2.4.0                     
##  [7] irlba_2.3.3                   HDF5Array_1.13.7             
##  [9] vipor_0.4.5                   DBI_1.0.0                    
## [11] lazyeval_0.2.2                colorspace_1.4-1             
## [13] withr_2.1.2                   tidyselect_0.2.5             
## [15] gridExtra_2.3                 bit_1.1-14                   
## [17] curl_4.0                      compiler_3.6.0               
## [19] BiocNeighbors_1.3.3           labeling_0.3                 
## [21] bookdown_0.13                 scales_1.0.0                 
## [23] rappdirs_0.3.1                stringr_1.4.0                
## [25] digest_0.6.20                 R.utils_2.9.0                
## [27] rmarkdown_1.15                XVector_0.25.0               
## [29] pkgconfig_2.0.2               htmltools_0.3.6              
## [31] limma_3.41.16                 rlang_0.4.0                  
## [33] RSQLite_2.1.2                 shiny_1.3.2                  
## [35] DelayedMatrixStats_1.7.2      R.oo_1.22.0                  
## [37] dplyr_0.8.3                   RCurl_1.95-4.12              
## [39] magrittr_1.5                  GenomeInfoDbData_1.2.1       
## [41] Matrix_1.2-17                 Rhdf5lib_1.7.4               
## [43] Rcpp_1.0.2                    ggbeeswarm_0.6.0             
## [45] munsell_0.5.0                 viridis_0.5.1                
## [47] R.methodsS3_1.7.1             stringi_1.4.3                
## [49] yaml_2.2.0                    edgeR_3.27.13                
## [51] zlibbioc_1.31.0               rhdf5_2.29.3                 
## [53] Rtsne_0.15                    AnnotationHub_2.17.9         
## [55] grid_3.6.0                    blob_1.2.0                   
## [57] promises_1.0.1                dqrng_0.2.1                  
## [59] ExperimentHub_1.11.6          crayon_1.3.4                 
## [61] lattice_0.20-38               cowplot_1.0.0                
## [63] locfit_1.5-9.1                zeallot_0.1.0                
## [65] knitr_1.24                    pillar_1.4.2                 
## [67] igraph_1.2.4.1                glue_1.3.1                   
## [69] evaluate_0.14                 BiocManager_1.30.4           
## [71] vctrs_0.2.0                   httpuv_1.5.2                 
## [73] gtable_0.3.0                  purrr_0.3.2                  
## [75] assertthat_0.2.1              xfun_0.9                     
## [77] rsvd_1.0.2                    mime_0.7                     
## [79] xtable_1.8-4                  later_0.8.0                  
## [81] viridisLite_0.3.0             tibble_2.1.3                 
## [83] AnnotationDbi_1.47.1          beeswarm_0.2.3               
## [85] memoise_1.1.0                 statmod_1.4.32               
## [87] interactiveDisplayBase_1.23.0