scran
C++ library for basic single-cell RNA-seq analyses
|
Downsample a dataset based on its neighbors. More...
#include <DownsampleByNeighbors.hpp>
Classes | |
struct | Defaults |
Default parameter settings. More... | |
Public Member Functions | |
DownsampleByNeighbors & | set_num_neighbors (int k=Defaults::num_neighbors) |
DownsampleByNeighbors & | set_num_threads (int n=Defaults::num_threads) |
DownsampleByNeighbors & | set_approximate (int a=Defaults::approximate) |
template<typename Index , typename Float > | |
std::vector< Index > | run (const std::vector< std::vector< std::pair< Index, Float > > > &neighbors, Index *assigned) const |
template<typename Index = int, typename Float > | |
std::vector< Index > | run (int ndim, size_t nobs, const Float *data, Index *assigned) const |
template<typename Index , typename Float > | |
std::vector< Index > | run (const knncolle::Base< Index, Float > *index, Index *assigned) const |
Downsample a dataset based on its neighbors.
This function generates a deterministic downsampling of a dataset based on nearest neighbors. To do so, we identify the k
-nearest neighbors of each cell and use that to define its local neighborhood. We find the cell that does not belong in the local neighborhood of any previously retained cell, and has the fewest neighbors in any of the local neighborhoods of previously retained cells; ties are broken using the smallest distance to the cell's k
-th neighbor (i.e., the densest region of space). This cell is retained in the downsampled subset and we repeat this process until all cells have been processed.
Each retained cell serves as a representative for up to k
of its nearest neighboring cells. This approach ensures that the downsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsetted dataset.
|
inline |
k | Number of neighbors to use for downsampling. Larger values result in more downsampling, at the cost of some speed. |
DownsampleByNeighbors
object.Note that this is only used in run()
when a list of neighbors is not supplied.
|
inline |
n | Number of threads to use for neighbor detection. |
DownsampleByNeighbors
object.Note that this is only used in run()
when a list of neighbors is not supplied.
|
inline |
a | Whether approximate neighbor detection should be used. |
DownsampleByNeighbors
object.Note that this is only used in run()
when a data matrix is supplied.
|
inline |
Index | Integer type for the indices. |
Float | Floating point type for the distances. |
neighbors | Vector of vector of neighbors for each cell. Each entry of the outer vector corresponds to a cell, and each inner vector contains the index and distance of its nearest neighbors. It is assumed that each inner vector is sorted by increasing distance. | |
[out] | assigned | Vector of length equal to the number of cells in neighbors . On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned. |
set_num_neighbors()
. Indices are sorted in increasing order.
|
inline |
Index | Integer type for the indices. |
Float | Floating point type for the distances. |
ndim | Number of dimensions. | |
nobs | Number of observations, i.e., cells. | |
data | Pointer to a column-major array of dimensions (rows) by cells (columns) containing coordinates for each cell, typically in some kind of embedding. | |
[out] | assigned | Vector of length equal to the number of cells in neighbors . On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned. |
set_num_neighbors()
. Indices are sorted in increasing order.
|
inline |
Index | Integer type for the indices. |
Float | Floating point type for the distances. |
index | Pointer to a knncolle::Base index object, containing a pre-built neighbor index for a dataset. | |
[out] | assigned | Vector of length equal to the number of cells in neighbors . On completion, this contains the index of the representative for each cell in the original dataset. assigned may also be a null pointer, in which case nothing is returned. |
set_num_neighbors()
. Indices are sorted in increasing order.