scran
C++ library for basic single-cell RNA-seq analyses
Loading...
Searching...
No Matches
Classes | Public Member Functions | List of all members
scran::DownsampleByNeighbors Class Reference

Downsample a dataset based on its neighbors. More...

#include <DownsampleByNeighbors.hpp>

Classes

struct  Defaults
 Default parameter settings. More...
 

Public Member Functions

DownsampleByNeighborsset_num_neighbors (int k=Defaults::num_neighbors)
 
DownsampleByNeighborsset_num_threads (int n=Defaults::num_threads)
 
DownsampleByNeighborsset_approximate (int a=Defaults::approximate)
 
template<typename Index , typename Float >
std::vector< Index > run (const std::vector< std::vector< std::pair< Index, Float > > > &neighbors, Index *assigned) const
 
template<typename Index = int, typename Float >
std::vector< Index > run (int ndim, size_t nobs, const Float *data, Index *assigned) const
 
template<typename Index , typename Float >
std::vector< Index > run (const knncolle::Base< Index, Float > *index, Index *assigned) const
 

Detailed Description

Downsample a dataset based on its neighbors.

This function generates a deterministic downsampling of a dataset based on nearest neighbors. To do so, we identify the k-nearest neighbors of each cell and use that to define its local neighborhood. We find the cell that does not belong in the local neighborhood of any previously retained cell, and has the fewest neighbors in any of the local neighborhoods of previously retained cells; ties are broken using the smallest distance to the cell's k-th neighbor (i.e., the densest region of space). This cell is retained in the downsampled subset and we repeat this process until all cells have been processed.

Each retained cell serves as a representative for up to k of its nearest neighboring cells. This approach ensures that the downsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsetted dataset.

Member Function Documentation

◆ set_num_neighbors()

DownsampleByNeighbors & scran::DownsampleByNeighbors::set_num_neighbors ( int  k = Defaults::num_neighbors)
inline
Parameters
kNumber of neighbors to use for downsampling. Larger values result in more downsampling, at the cost of some speed.
Returns
A reference to this DownsampleByNeighbors object.

Note that this is only used in run() when a list of neighbors is not supplied.

◆ set_num_threads()

DownsampleByNeighbors & scran::DownsampleByNeighbors::set_num_threads ( int  n = Defaults::num_threads)
inline
Parameters
nNumber of threads to use for neighbor detection.
Returns
A reference to this DownsampleByNeighbors object.

Note that this is only used in run() when a list of neighbors is not supplied.

◆ set_approximate()

DownsampleByNeighbors & scran::DownsampleByNeighbors::set_approximate ( int  a = Defaults::approximate)
inline
Parameters
aWhether approximate neighbor detection should be used.
Returns
A reference to this DownsampleByNeighbors object.

Note that this is only used in run() when a data matrix is supplied.

◆ run() [1/3]

template<typename Index , typename Float >
std::vector< Index > scran::DownsampleByNeighbors::run ( const std::vector< std::vector< std::pair< Index, Float > > > &  neighbors,
Index *  assigned 
) const
inline
Template Parameters
IndexInteger type for the indices.
FloatFloating point type for the distances.
Parameters
neighborsVector of vector of neighbors for each cell. Each entry of the outer vector corresponds to a cell, and each inner vector contains the index and distance of its nearest neighbors. It is assumed that each inner vector is sorted by increasing distance.
[out]assignedVector of length equal to the number of cells in neighbors. On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned.
Returns
Vector of indices of the chosen representative cells. The length of this vector depends on the dataset and the specified number of neighbors in set_num_neighbors(). Indices are sorted in increasing order.

◆ run() [2/3]

template<typename Index = int, typename Float >
std::vector< Index > scran::DownsampleByNeighbors::run ( int  ndim,
size_t  nobs,
const Float *  data,
Index *  assigned 
) const
inline
Template Parameters
IndexInteger type for the indices.
FloatFloating point type for the distances.
Parameters
ndimNumber of dimensions.
nobsNumber of observations, i.e., cells.
dataPointer to a column-major array of dimensions (rows) by cells (columns) containing coordinates for each cell, typically in some kind of embedding.
[out]assignedVector of length equal to the number of cells in neighbors. On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned.
Returns
Vector of indices of the chosen representative cells. The length of this vector depends on the dataset and the specified number of neighbors in set_num_neighbors(). Indices are sorted in increasing order.

◆ run() [3/3]

template<typename Index , typename Float >
std::vector< Index > scran::DownsampleByNeighbors::run ( const knncolle::Base< Index, Float > *  index,
Index *  assigned 
) const
inline
Template Parameters
IndexInteger type for the indices.
FloatFloating point type for the distances.
Parameters
indexPointer to a knncolle::Base index object, containing a pre-built neighbor index for a dataset.
[out]assignedVector of length equal to the number of cells in neighbors. On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned.
Returns
Vector of indices of the chosen representative cells. The length of this vector depends on the dataset and the specified number of neighbors in set_num_neighbors(). Indices are sorted in increasing order.

The documentation for this class was generated from the following file: