scran
C++ library for basic single-cell RNA-seq analyses
|
The libscran library takes core parts of the scran Bioconductor package (as well as other useful bits from other packages) and implements them in C++. The idea is to provide a light-weight library that can be easily embedded into other applications without including the entire R/Bioconductor runtime. For example, we can compile libscran to WebAssembly to perform single-cell analyses in the browser; or we can wrap libscran into an R package for a minimal-dependency version of the basic Bioconductor single-cell analysis stack. The library itself is compatible with any CMake-based build system and can be turned into a fully header-only library for easy deployment.
The example below demonstrates how to use libscran to run a standard analysis of single-cell RNA-seq data.
Each class represents a step in the analysis and has tunable parameters, e.g., RunPCA::set_rank
to set the number of PCs. See the reference documentation for more details.
Most of the functions are motivated by the theory in the Orchestrating single-cell analysis with Bioconductor book.
Identification and filtering of low-quality cells are performed using an outlier-based approach. The PerCellQCMetrics
class will compute common QC metrics, the PerCellQCFilters
class will identify filtering thresholds from the distribution of such metrics, and the FilterCells
class will apply those filters to the count matrix.
Log-transformed normalized expression values are computed from the count matrix, using size factors derived from the library size. This is performed using the LogNormCounts
class.
Variance modelling and selection of highly variable genes is performed on the log-expression values. The ModelGeneVar
class will fit a mean-dependent trend to the variances across genes, while the ChooseHVGs
class will choose the top set of HVGs based on the residuals from the trend.
Principal component analysis is used to compress and denoise the data based on the first few PCs. The RunPCA
class will use an approximate PCA algorithm to efficiently compute the top PCs from the HVG-subsetted matrix. Alternatively, the BlockedPCA
and MultiBatchPCA
classes can be used when dealing with multiple batches.
Clustering of cells is performed using the per-cell PC scores. We provide several flavors of graph-based clustering from a shared-nearest neighbor graph, using community detection algorithms such as multi-level (ClusterSnnGraphMultiLevel
), Leiden (ClusterSnnGraphLeiden
) or Walktrap clustering (ClusterSnnGraphWalktrap
). Developers can also easily apply other algorithms, e.g., k-means.
Per-cluster marker detection is performed based on pairwise comparisons between clusters. The ScoreMarkers
class will aggregate the set of pairwise comparisons into a single suite of summary statistics for each cluster. Users can then rank by a statistic of interest to obtain a marker listing for each cluster.
The output of PCA is also directly compatible with UMAP and t-SNE C++ implementations. Readers are referred to the documentation for those libraries for more details.
Compile the minimal.cpp
example by running the following commands at the root of the libscran directory:
Download and decompress a Matrix Market file containing a scRNA-seq count matrix:
Run the minimal pipeline:
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to libscran to make the headers available during compilation:
Developers are responsible for linking to the igraph C library themselves, either with find_package()
or FetchContent
. We expect igraph versions from the 0.10 series - see tests/CMakeLists.txt
for the specific version being tested.