Scale multi-modal embeddings to adjust for differences in variance. More...

#include <ScaleByNeighbors.hpp>

Classes
struct	Defaults
	Default parameter settings. More...

Public Member Functions
ScaleByNeighbors &	set_neighbors (int n=Defaults::neighbors)

ScaleByNeighbors &	set_approximate (int a=Defaults::approximate)

ScaleByNeighbors &	set_num_threads (int n=Defaults::num_threads)

std::pair< double, double >	compute_distance (int ndim, size_t ncells, const double *data) const

template<class Search >
std::pair< double, double >	compute_distance (const Search *search) const

Static Public Member Functions
static std::vector< double >	compute_scale (const std::vector< std::pair< double, double > > &distances)

static double	compute_scale (const std::pair< double, double > &ref, const std::pair< double, double > &target)

template<typename Embed >
static void	combine_scaled_embeddings (const std::vector< int > &ndims, size_t ncells, const std::vector< Embed > &embeddings, const std::vector< double > &scaling, double *output)

Detailed Description

Scale multi-modal embeddings to adjust for differences in variance.

The premise is that we have multiple embeddings for the same set of cells, usually generated from different data modalities (e.g., RNA, protein, and so on). We would like to combine these embeddings into a single embedding for downstream analyses such as clustering and t-SNE/UMAP. The easiest combining strategy is to just concatenate the matrices together into a single embedding that contains information from all modalities. However, this is complicated by the differences in the variance between modalities, whereby higher noise in one modality might drown out biological signal in another embedding.

This class implements a scaling approach to equalize noise across embeddings prior to concatenation. We compute the median distance to the $k$-th nearest neighbor in each embedding; this is used as a proxy for the noise within a subpopulation containing at least $k$ cells. We then compute a scaling factor as the ratio of the medians for a "target" embedding compared to a "reference" embedding. The idea is to scale the target embedding by this factor before concatenation of matrices.

This approach aims to remove differences in the magnitude of noise while preserving genuine differences in biological signal. Embedding-specific subpopulations are preserved in the concatenation, provided that the difference between subpopulations is greater than that within subpopulations. By contrast, a naive scaling based on the total variance would penalize embeddings with strong biological heterogeneity.

Member Function Documentation

◆ set_neighbors()

ScaleByNeighbors & scran::ScaleByNeighbors::set_neighbors ( int n = Defaults::neighbors )

inline

Parameters

n	Number of neighbors used in the nearest neighbor search. This can be interpreted as the minimum size of each subpopulation.

Returns: A reference to this ScaleByNeighbors instance.

◆ set_approximate()

ScaleByNeighbors & scran::ScaleByNeighbors::set_approximate ( int a = Defaults::approximate )

inline

Parameters

a	Whether to perform an approximate neighbor search.

Returns: A reference to this ScaleByNeighbors instance.

This parameter only has an effect when an array is passed to run(), otherwise the search uses the algorithm specified by the knncolle class.

◆ set_num_threads()

ScaleByNeighbors & scran::ScaleByNeighbors::set_num_threads ( int n = Defaults::num_threads )

inline

Parameters

n	Number of threads to use.

Returns: A reference to this ScaleByNeighbors object.

◆ compute_distance() [1/2]

std::pair< double, double > scran::ScaleByNeighbors::compute_distance	(	int	ndim,
		size_t	ncells,
		const double *	data
	)		const

inline

Parameters

	ndim	Number of dimensions in the embedding.
	ncells	Number of cells in the embedding.
[in]	data	Pointer to an array containing the embedding matrix. This should be stored in column-major layout where each row is a dimension and each column is a cell.

Returns: Pair containing the median distance to the $k$-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale().

◆ compute_distance() [2/2]

template<class Search >

std::pair< double, double > scran::ScaleByNeighbors::compute_distance ( const Search * search ) const

inline

Template Parameters

Search Search index class, typically a knncolle::Base subclass.

Parameters

search Search index for the embedding.

Returns: Pair containing the median distance to the $k$-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale().

◆ compute_scale() [1/2]

static std::vector< double > scran::ScaleByNeighbors::compute_scale ( const std::vector< std::pair< double, double > > & distances )

inlinestatic

Compute the scaling factors for a group of embeddings, given the neighbor distances computed by compute_distance(). This aims to scale each embedding so that the within-population variances are equal across embeddings. The "reference" embedding is defined as the first embedding with a non-zero RMSD; other than this requirement, the exact choice of reference has no actual impact on the relative values of the scaling factors.

Parameters

distances Vector of distances for embeddings, as computed by compute_scale() on each embedding.

Returns: Vector of scaling factors of length equal to that of distances, to be applied to each embedding.

◆ compute_scale() [2/2]

static double scran::ScaleByNeighbors::compute_scale	(	const std::pair< double, double > &	ref,
		const std::pair< double, double > &	target
	)

inlinestatic

Compute the scaling factor to be applied to a target embedding relative to a reference. This aims to scale the target so that the within-population variance is equal to that of the reference.

Parameters

ref	Output of `compute_distance()` for the reference embedding. The first value contains the median distance while the second value contains the RMSD.
target	Output of `compute_distance()` for the target embedding.

Returns: A scaling factor to apply to the target embedding. Scaling all values in the target matrix by the factor will equalize the magnitude of the noise to that of the reference embedding.

If either of the median distances is zero, this function automatically switches to the root-mean-square-distance to the $k$-th neighbor. The scaling factor is then defined as the ratio of the RMSDs between embeddings. If the reference RMSDs is zero, this function will return zero; if the target RMSD is zero, this function will return positive infinity.

◆ combine_scaled_embeddings()

template<typename Embed >

static void scran::ScaleByNeighbors::combine_scaled_embeddings	(	const std::vector< int > &	ndims,
		size_t	ncells,
		const std::vector< Embed > &	embeddings,
		const std::vector< double > &	scaling,
		double *	output
	)

inlinestatic

Combine multiple embeddings into a single embedding matrix, possibly after scaling each embedding. This is done row-wise, i.e., the coordinates are concatenated across embeddings for each column.

Template Parameters

Embed Pointer type to the input embeddings.

Parameters

	ndims	Vector containing the number of dimensions in each embedding.
	ncells	Number of cells in each embedding.
	embeddings	Vector of pointers of length equal to that of `ndims`. Each pointer refers to an array containing an embedding matrix, which should be in column-major format with dimensions in rows and cells in columns. The number of rows should be equal to the corresponding element in `ndims` and the number of columns should be equal to `ncells`.
	scaling	Scaling to apply to each embedding, usually from `compute_scale()`. This should be of length equal to that of `ndims`.
[out]	output	Pointer to the output array. This should be of length equal to the product of `ncells` and the sum of `ndims`. On completion, `output` is filled with the combined embeddings in column-major format. Each row corresponds to a dimension while each column corresponds to a cell.

The documentation for this class was generated from the following file:

scran/dimensionality_reduction/ScaleByNeighbors.hpp

Classes

Public Member Functions

Static Public Member Functions

Detailed Description

Member Function Documentation

◆ set_neighbors()

◆ set_approximate()

◆ set_num_threads()

◆ compute_distance() [1/2]

◆ compute_distance() [2/2]

◆ compute_scale() [1/2]

◆ compute_scale() [2/2]

◆ combine_scaled_embeddings()