1 Comments on dot plots

For anyone who doesn’t know what I’m talking about, it’s the Seurat-style dot plots here:

###########################
### Setting up the data ###
###########################

library(scRNAseq)
sce <- ZeiselBrainData()

library(scran)
library(scater)
sce <- logNormCounts(sce)
markers <- pairwiseTTests(logcounts(sce), sce$level1class)
output <- getTopMarkers(markers[[1]], markers[[2]], n=2)

features <- unlist(unlist(output))
object <- sce
group <- sce$level1class
                                                                        
#################################
### Setting up the statistics ###
#################################
                                                                        
num <- numDetectedAcrossCells(object, ids = group, 
    subset_row = features, average = TRUE)
ave <- sumCountsAcrossCells(object, ids = group, subset_row = features,
    exprs_values="logcounts", average = TRUE)
logfc <- ave - rowMeans(ave)

#####################
### Making a plot ###
#####################

rn <- factor(rownames(logfc))
cn <- factor(colnames(logfc))
evals_long <- data.frame(
    Row = rep(rn, ncol(logfc)),
    Col = rep(cn, each = nrow(logfc)), 
    LogFC = as.numeric(logfc),
    Percent = as.numeric(num)
)

ggplot(evals_long) + 
    geom_point(aes(x=Row, y=Col, colour=LogFC, size=Percent)) +
    scale_color_gradient2(low="blue", high="red")

The idea is to use the color to capture the log-fold change from the mean (or any other statistic) while using the size to represent the number of cells with detectable expression. This allows both statistics to be represented compactly for each gene and cluster in a single plot. The eye is naturally drawn towards the large red circles, allowing the reader to rapidly focus on candidate markers with strong positive log-fold changes and a high proportion of detected cells. More importantly, it looks pretty.

However, the dot plot has a few deficiencies as an effective visualization tool. Many of them stem from the use of the point area to represent… well, anything, really. Humans are notoriously bad at judging areas (see Tufte, The Visual Display of Quantitative Information) so we cannot rely on two dimensions to accurately represent a one-dimensional quantity. Even worse is that the size of the point actively interferes with visualization of the log-fold change. We cannot easily see colors for small points, reducing the effectiveness of visualization for genes that are sporadically expressed but critical for cell type identity (e.g., Cd4).

A related problem is that the use of size implicitly introduces a secondary color scale involving the background color. For a low-abundance gene in the above example, the plot transitions from red to grey/white as expression decreases. If this secondary scale overlaps with the primary color scale, it becomes difficult to interpret, e.g., does a white region of the plot represent a lack of any expression or detectable expression with a zero log-fold change? The issue arises even if the overlap in color scales is not exact - here, any light color that is similar to the background grey would be bad enough, especially if variations in screen/printer/eye quality are taken into account.

One might argue that all of these issues can be swept aside if we are only interested in identifying high-quality marker genes. In such cases, readers only need to look for the presence of big red dots and avoid interpreting relative areas or conflicts with the background color. This is a valid perspective but the presence of the additional colors becomes an unnecessary distraction; an uninteresting gene/cluster combination can manifest either as a large blue dot or a small dot dominated by background color. We are left with a confusing color scale that progresses from red (most expressed) to white (less expression but many non-zeroes) to blue (even less expression, still many non-zeroes) to white (few non-zeroes). The reader should not have to consider two different visual effects that have the same meaning.

To ensure that readers interpret the dot plot in the “correct” way, we suggest abandoning any attempt to represent the uninteresting parts of the plot faithfully. Rather, we cap the minimum log-fold change at zero and synchronize the color at zero with the background color. Any interpretation of the plot then collapses to a simple question - is it a big red dot or is it empty? There is no need to distinguish between zero log-fold changes and lack of expression, because we no longer care to do so. (Some might complain that this discards information but, as we have discussed, there was no way to visualize that information effectively in the first place.)

2 Session information

sessionInfo()
## R Under development (unstable) (2019-10-31 r77342)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/luna/Software/R/trunk/lib/libRblas.so
## LAPACK: /home/luna/Software/R/trunk/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scater_1.15.7               ggplot2_3.2.1              
##  [3] scran_1.15.9                scRNAseq_2.1.3             
##  [5] SingleCellExperiment_1.9.0  SummarizedExperiment_1.17.0
##  [7] DelayedArray_0.13.0         BiocParallel_1.21.0        
##  [9] matrixStats_0.55.0          Biobase_2.47.1             
## [11] GenomicRanges_1.39.1        GenomeInfoDb_1.23.0        
## [13] IRanges_2.21.2              S4Vectors_0.25.0           
## [15] BiocGenerics_0.33.0         BiocStyle_2.15.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] httr_1.4.1                    tools_4.0.0                  
##  [5] backports_1.1.5               R6_2.4.1                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.0.0                     lazyeval_0.2.2               
## [11] colorspace_1.4-1              withr_2.1.2                  
## [13] gridExtra_2.3                 tidyselect_0.2.5             
## [15] bit_1.1-14                    curl_4.2                     
## [17] compiler_4.0.0                BiocNeighbors_1.5.1          
## [19] labeling_0.3                  bookdown_0.16                
## [21] scales_1.1.0                  rappdirs_0.3.1               
## [23] stringr_1.4.0                 digest_0.6.23                
## [25] rmarkdown_1.18                XVector_0.27.0               
## [27] pkgconfig_2.0.3               htmltools_0.4.0              
## [29] dbplyr_1.4.2                  fastmap_1.0.1                
## [31] limma_3.43.0                  rlang_0.4.2                  
## [33] RSQLite_2.1.2                 shiny_1.4.0                  
## [35] DelayedMatrixStats_1.9.0      farver_2.0.1                 
## [37] dplyr_0.8.3                   RCurl_1.95-4.12              
## [39] magrittr_1.5                  BiocSingular_1.3.0           
## [41] GenomeInfoDbData_1.2.2        Matrix_1.2-18                
## [43] Rcpp_1.0.3                    ggbeeswarm_0.6.0             
## [45] munsell_0.5.0                 viridis_0.5.1                
## [47] lifecycle_0.1.0               stringi_1.4.3                
## [49] yaml_2.2.0                    edgeR_3.29.0                 
## [51] zlibbioc_1.33.0               BiocFileCache_1.11.3         
## [53] AnnotationHub_2.19.2          grid_4.0.0                   
## [55] blob_1.2.0                    promises_1.1.0               
## [57] dqrng_0.2.1                   ExperimentHub_1.13.4         
## [59] crayon_1.3.4                  lattice_0.20-38              
## [61] locfit_1.5-9.1                zeallot_0.1.0                
## [63] knitr_1.26                    pillar_1.4.2                 
## [65] igraph_1.2.4.2                glue_1.3.1                   
## [67] BiocVersion_3.11.1            evaluate_0.14                
## [69] BiocManager_1.30.10           vctrs_0.2.0                  
## [71] httpuv_1.5.2                  gtable_0.3.0                 
## [73] purrr_0.3.3                   assertthat_0.2.1             
## [75] xfun_0.11                     rsvd_1.0.2                   
## [77] mime_0.7                      xtable_1.8-4                 
## [79] later_1.0.0                   viridisLite_0.3.0            
## [81] tibble_2.1.3                  AnnotationDbi_1.49.0         
## [83] beeswarm_0.2.3                memoise_1.1.0                
## [85] statmod_1.4.32                interactiveDisplayBase_1.25.0