scran
C++ library for basic single-cell RNA-seq analyses
Loading...
Searching...
No Matches
Classes | Public Member Functions | Static Public Member Functions | List of all members
scran::ScaleByNeighbors Class Reference

Scale multi-modal embeddings to adjust for differences in variance. More...

#include <ScaleByNeighbors.hpp>

Classes

struct  Defaults
 Default parameter settings. More...
 

Public Member Functions

ScaleByNeighborsset_neighbors (int n=Defaults::neighbors)
 
ScaleByNeighborsset_approximate (int a=Defaults::approximate)
 
ScaleByNeighborsset_num_threads (int n=Defaults::num_threads)
 
std::pair< double, double > compute_distance (int ndim, size_t ncells, const double *data) const
 
template<class Search >
std::pair< double, double > compute_distance (const Search *search) const
 

Static Public Member Functions

static std::vector< double > compute_scale (const std::vector< std::pair< double, double > > &distances)
 
static double compute_scale (const std::pair< double, double > &ref, const std::pair< double, double > &target)
 
template<typename Embed >
static void combine_scaled_embeddings (const std::vector< int > &ndims, size_t ncells, const std::vector< Embed > &embeddings, const std::vector< double > &scaling, double *output)
 

Detailed Description

Scale multi-modal embeddings to adjust for differences in variance.

The premise is that we have multiple embeddings for the same set of cells, usually generated from different data modalities (e.g., RNA, protein, and so on). We would like to combine these embeddings into a single embedding for downstream analyses such as clustering and t-SNE/UMAP. The easiest combining strategy is to just concatenate the matrices together into a single embedding that contains information from all modalities. However, this is complicated by the differences in the variance between modalities, whereby higher noise in one modality might drown out biological signal in another embedding.

This class implements a scaling approach to equalize noise across embeddings prior to concatenation. We compute the median distance to the $k$-th nearest neighbor in each embedding; this is used as a proxy for the noise within a subpopulation containing at least $k$ cells. We then compute a scaling factor as the ratio of the medians for a "target" embedding compared to a "reference" embedding. The idea is to scale the target embedding by this factor before concatenation of matrices.

This approach aims to remove differences in the magnitude of noise while preserving genuine differences in biological signal. Embedding-specific subpopulations are preserved in the concatenation, provided that the difference between subpopulations is greater than that within subpopulations. By contrast, a naive scaling based on the total variance would penalize embeddings with strong biological heterogeneity.

Member Function Documentation

◆ set_neighbors()

ScaleByNeighbors & scran::ScaleByNeighbors::set_neighbors ( int  n = Defaults::neighbors)
inline
Parameters
nNumber of neighbors used in the nearest neighbor search. This can be interpreted as the minimum size of each subpopulation.
Returns
A reference to this ScaleByNeighbors instance.

◆ set_approximate()

ScaleByNeighbors & scran::ScaleByNeighbors::set_approximate ( int  a = Defaults::approximate)
inline
Parameters
aWhether to perform an approximate neighbor search.
Returns
A reference to this ScaleByNeighbors instance.

This parameter only has an effect when an array is passed to run(), otherwise the search uses the algorithm specified by the knncolle class.

◆ set_num_threads()

ScaleByNeighbors & scran::ScaleByNeighbors::set_num_threads ( int  n = Defaults::num_threads)
inline
Parameters
nNumber of threads to use.
Returns
A reference to this ScaleByNeighbors object.

◆ compute_distance() [1/2]

std::pair< double, double > scran::ScaleByNeighbors::compute_distance ( int  ndim,
size_t  ncells,
const double *  data 
) const
inline
Parameters
ndimNumber of dimensions in the embedding.
ncellsNumber of cells in the embedding.
[in]dataPointer to an array containing the embedding matrix. This should be stored in column-major layout where each row is a dimension and each column is a cell.
Returns
Pair containing the median distance to the $k$-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale().

◆ compute_distance() [2/2]

template<class Search >
std::pair< double, double > scran::ScaleByNeighbors::compute_distance ( const Search *  search) const
inline
Template Parameters
SearchSearch index class, typically a knncolle::Base subclass.
Parameters
searchSearch index for the embedding.
Returns
Pair containing the median distance to the $k$-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale().

◆ compute_scale() [1/2]

static std::vector< double > scran::ScaleByNeighbors::compute_scale ( const std::vector< std::pair< double, double > > &  distances)
inlinestatic

Compute the scaling factors for a group of embeddings, given the neighbor distances computed by compute_distance(). This aims to scale each embedding so that the within-population variances are equal across embeddings. The "reference" embedding is defined as the first embedding with a non-zero RMSD; other than this requirement, the exact choice of reference has no actual impact on the relative values of the scaling factors.

Parameters
distancesVector of distances for embeddings, as computed by compute_scale() on each embedding.
Returns
Vector of scaling factors of length equal to that of distances, to be applied to each embedding.

◆ compute_scale() [2/2]

static double scran::ScaleByNeighbors::compute_scale ( const std::pair< double, double > &  ref,
const std::pair< double, double > &  target 
)
inlinestatic

Compute the scaling factor to be applied to a target embedding relative to a reference. This aims to scale the target so that the within-population variance is equal to that of the reference.

Parameters
refOutput of compute_distance() for the reference embedding. The first value contains the median distance while the second value contains the RMSD.
targetOutput of compute_distance() for the target embedding.
Returns
A scaling factor to apply to the target embedding. Scaling all values in the target matrix by the factor will equalize the magnitude of the noise to that of the reference embedding.

If either of the median distances is zero, this function automatically switches to the root-mean-square-distance to the $k$-th neighbor. The scaling factor is then defined as the ratio of the RMSDs between embeddings. If the reference RMSDs is zero, this function will return zero; if the target RMSD is zero, this function will return positive infinity.

◆ combine_scaled_embeddings()

template<typename Embed >
static void scran::ScaleByNeighbors::combine_scaled_embeddings ( const std::vector< int > &  ndims,
size_t  ncells,
const std::vector< Embed > &  embeddings,
const std::vector< double > &  scaling,
double *  output 
)
inlinestatic

Combine multiple embeddings into a single embedding matrix, possibly after scaling each embedding. This is done row-wise, i.e., the coordinates are concatenated across embeddings for each column.

Template Parameters
EmbedPointer type to the input embeddings.
Parameters
ndimsVector containing the number of dimensions in each embedding.
ncellsNumber of cells in each embedding.
embeddingsVector of pointers of length equal to that of ndims. Each pointer refers to an array containing an embedding matrix, which should be in column-major format with dimensions in rows and cells in columns. The number of rows should be equal to the corresponding element in ndims and the number of columns should be equal to ncells.
scalingScaling to apply to each embedding, usually from compute_scale(). This should be of length equal to that of ndims.
[out]outputPointer to the output array. This should be of length equal to the product of ncells and the sum of ndims. On completion, output is filled with the combined embeddings in column-major format. Each row corresponds to a dimension while each column corresponds to a cell.

The documentation for this class was generated from the following file: