This library provides a collection of sampling algorithms, including:
- mod-sampling,
- lr-minimizers (a "context-sensitive" version of closed syncmers),
- mod-minimizers,
- "classic" minimizers,
- miniception,
- rotational-minimizers,
- decycling set based minimizers,
- closed-syncmers,
- open-syncmers,
- open-closed-syncmers.
The code has been used for the experiments of the paper "The mod-minimizer: a simple and efficient sampling algorithm for long k-mers", published in WABI 2024.
To reproduce the experiments in the paper: first compile the code as explained below and then run the scripts here.
Before compiling, pull all dependencies with
git submodule update --init --recursive
Compile the code with
mkdir build
cd build
cmake ..
make
After compilation, generate some random sequence (in the following example, of 1 million nucleotides) with
./generate_random_sequence -o test.bin -n 1000000 -s 4
and evaluate density of different methods with the tool density
.
Some examples below.
./density -i test.bin -k 63 -w 8 -a minimizer --stream
num_sampled_kmers = 222189
num_kmers = 999938
num_windows = 999931
density = 0.222203
1.77762X away from lower bound 1/w = 0.125
calculation using closed formulas:
density = 0.222222
1.77778X away from lower bound 1/w = 0.125
./density -i test.bin -k 63 -w 8 -a lr-minimizer --stream
num_sampled_kmers = 176521
num_kmers = 999938
num_windows = 999931
density = 0.176532
1.41226X away from lower bound 1/w = 0.125
calculation using closed formulas:
density = 0.176471
1.41176X away from lower bound 1/w = 0.125
./density -i test.bin -k 63 -w 8 -a mod-minimizer --stream
num_sampled_kmers = 138477
num_kmers = 999938
num_windows = 999931
density = 0.138486
1.10788X away from lower bound 1/w = 0.125
calculation using closed formulas:
density = 0.138462
1.10769X away from lower bound 1/w = 0.125