nclist-cpp
C++ implementation of nested containment lists
|
This header-only library implements the nested containment list (NCList) algorithm for interval queries. Specifically, the aim is to find all "subject" intervals that overlap a "query" interval, which is a common task when analyzing genomics data. The interface and capabilities of this library are based on the findOverlaps()
function from the IRanges R package. We support overlaps with a maximum gap or minimum overlap, as well as overlaps of different types - matching starts/ends, within/extends, etc.
Given an array of starts/ends for the subject intervals, we build the nested containment list:
Then we check for overlaps with our query interval:
Check out the reference documentation for more details.
Say we only want to report overlaps where the length of the overlapping subinterval is not below some threshold. We can do so with the min_overlaps=
parameter:
We can also detect "overlaps" where the query and subject intervals are separated by a gap that is no greater than some threshold:
In some cases, we only want to determine if any overlap exists, without concern for the identity of the overlapping subject intervals. This can be achieved with the quit_on_first=
parameter, which will return upon finding the first overlapping interval for greater efficiency.
The overlaps_any()
function will look for any overlaps between the query and subject intervals. However, other functions can also be used to report different types of overlaps. Perhaps we want to consider overlaps where the query interval lies "within" (i.e., is a subinterval of) a subject interval:
Or we only care about those subject intervals with the same start position:
And so on. This functionality is inspired by the type=
argument in the findOverlaps()
function from the IRanges package. Note that the interpretation of some parameters (e.g., max_gap
) depends on the type of overlap, so be sure to consult the relevant documentation.
This library will work with double-precision coordinates for the interval coordinates:
The subject interval indices (used to store the overlap results in matches
) can also be changed from int
to other integer types like std::size_t
. Larger types may be preferred if there are more intervals than can be represented by int
, at the cost of some increased memory usage.
The build_custom()
function accepts subject interval coordinates in formats other than a C-style array. For example, we might have stored the coordinates in a std::deque
for more efficient expansion:
A more interesting application involves adjusting the coordinates without allocating a new array. For example, many genomic intervals are reported with inclusive ends (e.g., in GFF and SAM files) but build()
expects exclusive ends. This is accommodated in build_custom()
by creating a custom class that increments the end position on the fly:
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to nclist to make the headers available during compilation:
find_package()
You can install the library by cloning a suitable version of this repository and running the following commands:
Then you can use find_package()
as usual:
If you're not using CMake, the simple approach is to just copy the files in the include/
subdirectory - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
.