scran
C++ library for basic single-cell RNA-seq analyses
|
Scale multi-modal embeddings to adjust for differences in variance. More...
#include <ScaleByNeighbors.hpp>
Classes | |
struct | Defaults |
Default parameter settings. More... | |
Public Member Functions | |
ScaleByNeighbors & | set_neighbors (int n=Defaults::neighbors) |
ScaleByNeighbors & | set_approximate (int a=Defaults::approximate) |
ScaleByNeighbors & | set_num_threads (int n=Defaults::num_threads) |
std::pair< double, double > | compute_distance (int ndim, size_t ncells, const double *data) const |
template<class Search > | |
std::pair< double, double > | compute_distance (const Search *search) const |
Static Public Member Functions | |
static std::vector< double > | compute_scale (const std::vector< std::pair< double, double > > &distances) |
static double | compute_scale (const std::pair< double, double > &ref, const std::pair< double, double > &target) |
template<typename Embed > | |
static void | combine_scaled_embeddings (const std::vector< int > &ndims, size_t ncells, const std::vector< Embed > &embeddings, const std::vector< double > &scaling, double *output) |
Scale multi-modal embeddings to adjust for differences in variance.
The premise is that we have multiple embeddings for the same set of cells, usually generated from different data modalities (e.g., RNA, protein, and so on). We would like to combine these embeddings into a single embedding for downstream analyses such as clustering and t-SNE/UMAP. The easiest combining strategy is to just concatenate the matrices together into a single embedding that contains information from all modalities. However, this is complicated by the differences in the variance between modalities, whereby higher noise in one modality might drown out biological signal in another embedding.
This class implements a scaling approach to equalize noise across embeddings prior to concatenation. We compute the median distance to the $k$-th nearest neighbor in each embedding; this is used as a proxy for the noise within a subpopulation containing at least $k$ cells. We then compute a scaling factor as the ratio of the medians for a "target" embedding compared to a "reference" embedding. The idea is to scale the target embedding by this factor before concatenation of matrices.
This approach aims to remove differences in the magnitude of noise while preserving genuine differences in biological signal. Embedding-specific subpopulations are preserved in the concatenation, provided that the difference between subpopulations is greater than that within subpopulations. By contrast, a naive scaling based on the total variance would penalize embeddings with strong biological heterogeneity.
|
inline |
n | Number of neighbors used in the nearest neighbor search. This can be interpreted as the minimum size of each subpopulation. |
ScaleByNeighbors
instance.
|
inline |
a | Whether to perform an approximate neighbor search. |
ScaleByNeighbors
instance.This parameter only has an effect when an array is passed to run()
, otherwise the search uses the algorithm specified by the knncolle class.
|
inline |
n | Number of threads to use. |
ScaleByNeighbors
object.
|
inline |
ndim | Number of dimensions in the embedding. | |
ncells | Number of cells in the embedding. | |
[in] | data | Pointer to an array containing the embedding matrix. This should be stored in column-major layout where each row is a dimension and each column is a cell. |
compute_scale()
.
|
inline |
Search | Search index class, typically a knncolle::Base subclass. |
search | Search index for the embedding. |
compute_scale()
.
|
inlinestatic |
Compute the scaling factors for a group of embeddings, given the neighbor distances computed by compute_distance()
. This aims to scale each embedding so that the within-population variances are equal across embeddings. The "reference" embedding is defined as the first embedding with a non-zero RMSD; other than this requirement, the exact choice of reference has no actual impact on the relative values of the scaling factors.
distances | Vector of distances for embeddings, as computed by compute_scale() on each embedding. |
distances
, to be applied to each embedding.
|
inlinestatic |
Compute the scaling factor to be applied to a target embedding relative to a reference. This aims to scale the target so that the within-population variance is equal to that of the reference.
ref | Output of compute_distance() for the reference embedding. The first value contains the median distance while the second value contains the RMSD. |
target | Output of compute_distance() for the target embedding. |
If either of the median distances is zero, this function automatically switches to the root-mean-square-distance to the $k$-th neighbor. The scaling factor is then defined as the ratio of the RMSDs between embeddings. If the reference RMSDs is zero, this function will return zero; if the target RMSD is zero, this function will return positive infinity.
|
inlinestatic |
Combine multiple embeddings into a single embedding matrix, possibly after scaling each embedding. This is done row-wise, i.e., the coordinates are concatenated across embeddings for each column.
Embed | Pointer type to the input embeddings. |
ndims | Vector containing the number of dimensions in each embedding. | |
ncells | Number of cells in each embedding. | |
embeddings | Vector of pointers of length equal to that of ndims . Each pointer refers to an array containing an embedding matrix, which should be in column-major format with dimensions in rows and cells in columns. The number of rows should be equal to the corresponding element in ndims and the number of columns should be equal to ncells . | |
scaling | Scaling to apply to each embedding, usually from compute_scale() . This should be of length equal to that of ndims . | |
[out] | output | Pointer to the output array. This should be of length equal to the product of ncells and the sum of ndims . On completion, output is filled with the combined embeddings in column-major format. Each row corresponds to a dimension while each column corresponds to a cell. |