scran
C++ library for basic single-cell RNA-seq analyses
|
Compute PCA after adjusting for differences between batch sizes. More...
#include <MultiBatchPca.hpp>
Classes | |
struct | Defaults |
Default parameter settings. More... | |
struct | Results |
Container for the PCA results. More... | |
Compute PCA after adjusting for differences between batch sizes.
In multi-batch scenarios, we may wish to compute a PCA involving data from multiple batches. However, if one batch has many more cells, it will dominate the PCA by driving the definition of the rotation vectors. This may mask interesting aspects of variation in the smaller batches. To overcome this problem, we scale each batch in inverse proportion to its size. This ensures that each batch contributes equally to the (conceptual) gene-gene covariance matrix, the eigenvectors of which are used as the rotation vectors. Cells are then projected to the subspace defined by these rotation vectors to obtain PC coordinates.
Alternatively, we can compute rotation vectors from the residuals, i.e., after centering each batch. The gene-gene covariance matrix will thus focus on variation within each batch, ensuring that the top PCs capture biological heterogeneity instead of batch effects. (This is particularly important in applications with many batches, where batch effects might otherwise displace biology from the top PCs.) However, unlike ResidualPca
, it is important to note that the residuals are only used here for calculating the rotation vectors. We still project the input matrix to obtain the PCs, so batch effects will likely still be present (though hopefully less pronounced) and must be removed with methods like MNN correction.
Finally, we can combine these mechanisms to compute rotation vectors from residuals with equal weighting. This gives us the benefits of both approaches as described above.
|
inline |
r | Number of PCs to compute. This should be no greater than the maximum number of PCs, i.e., the smaller dimension of the input matrix; otherwise, only the maximum number of PCs will be reported in the Results . |
MultiBatchPca
instance.
|
inline |
s | Should genes be scaled to unit variance? |
MultiBatchPca
instance.
|
inline |
t | Should the PC matrix be transposed on output? If true , the output matrix is column-major with cells in the columns, which is compatible with downstream libscran steps. |
MultiBatchPca
instance.
|
inline |
u | Whether to compute the rotation vectors from the residuals after centering each batch. |
MultiBatchPca
instance.
|
inline |
w | Policy to use for weighting batches of different size. |
MultiBatchPca
instance.
|
inline |
v | Parameters for the variable block weights, see variable_block_weight() for more details. Only used when the block weight policy is set to WeightPolicy::VARIABLE . |
MultiBatchPca
instance.
|
inline |
r | Should the rotation matrix be returned in the output? |
MultiBatchPca
instance.
|
inline |
r | Should the center vector be returned in the output? |
MultiBatchPca
instance.
|
inline |
r | Should the scale vector be returned in the output? |
MultiBatchPca
instance.
|
inline |
n | Number of threads to use. |
MultiBatchPca
instance.
|
inline |
Run the multi-batch PCA on an input gene-by-cell matrix.
T | Floating point type for the data. |
IDX | Integer type for the indices. |
Batch | Integer type for the batch assignments. |
[in] | mat | Pointer to the input matrix. Columns should contain cells while rows should contain genes. |
[in] | batch | Pointer to an array of length equal to the number of cells. This should contain a 0-based batch assignment for each cell (i.e., for n batches, batch identities should run from 0 to n-1 with at least one entry for each batch.) |
Results
object containing the PCs and the variance explained.
|
inline |
Run the multi-batch PCA on an input gene-by-cell matrix after filtering for genes of interest. We typically use the set of highly variable genes from ChooseHVGs
, with the aim being to improve computational efficiency and avoid random noise by removing lowly variable genes.
T | Floating point type for the data. |
IDX | Integer type for the indices. |
Batch | Integer type for the batch assignments |
X | Integer type for the feature filter. |
[in] | mat | Pointer to the input matrix. Columns should contain cells while rows should contain genes. |
[in] | batch | Pointer to an array of length equal to the number of cells. This should contain a 0-based batch assignment for each cell (i.e., for n batches, batch identities should run from 0 to n-1 with at least one entry for each batch.) |
[in] | features | Pointer to an array of length equal to the number of genes. Each entry treated as a boolean specifying whether the corresponding genes should be used in the PCA. |
Results
object containing the PCs and the variance explained.