Blight is an associative data structure, similar to a hashtable, able to index a kmer set. It associates each kmer with a unique identifier and can identify alien kmers (kmer absent from the index) associated with the identifier -1. The kmer identifiers are in[0, N[ where N is the number of kmer in the index. Hence, one can see Blight as a Minimal Perfect Hash Function (MPHF) that handle alien kmers.
A Blight index is
- Deterministic, it produces no false positive or false negative.
- Built from a compacted de Bruijn graph
- Static, once constructed, the indexed kmer set cannot be modified
- Fast and memory-efficient even for the most extensive kmer set
To be constructed, a Blight index needs a Fasta file, whose sequence contains the kmer to index with no duplicate. An efficient way to build such a file is to construct a compacted de Bruijn graph from interest sequences. We recommend using BCALM2 (github.com/GATB/bcalm) to construct such a graph from a set of sequences in Fasta, Fastq, gzipped or not.
The bcalm2 command should look like this:
bcalm -in my_fasta_file.fa -kmer-size 31 -abundance-min 1 -out my_graph
Note that by default, bcalm2 will remove unique kmer if the parameter abundance-min is not set.
After the simple installation using:
git clone --depth 1 https://github.com/Malfoy/Blight
make -j 4
We get a snippet binary that plays with the Blight API and a benchmark binary bench_blight.
We can test the blight index's performance on our system using bench_blight on a compacted de Bruijn graph.
We can start to use the API by modifying the snippet.cpp file.
The makefile shows how to compile Blight and lz4 object files and link them to a program.
We can tune the Blight index with several parameters impacting its overall performances.
All the words of length k present in the fasta file will be indexed.
This parameter defines the number of threads used by Blight to perform queries and construct its index. The default value is one thread.
The minimizer of a kmer is its minimal word of length m (according to a hash function) Blight will use 4^m partitions to index the kmer according to their minimizer. A larger minimizer size reduces the memory fingerprint up to the point where the constant overhead of O(4^m) is too large. A larger minimizer size can also accelerate the query time by improving cache coherency. The default value is 10 for roughly 1M partitions.
Blight will use 4^s temporary files to construct the index. Allowing more files can accelerate the construction and diminish the memory used during this step. The default value is 4 for roughly 256 files.
On most Linux system
ulimit -n 2000
Will allow the use of 2000 open files.
Blight will subsample the kmer position in the index. This option will save b bits per kmer, but each query will perform at most 2^b verification to find a kmer_Set_Light. With low b values, the query time's impact is low because of cache mechanisms, but for high b values (b>5), an increment double the expected query time. The default value is 0 (no subsampling)
The snippet toy program includes all the useful functions. We will describe them in detail here.
A blight index is an object with the given constructor setting the index parameter described previously:
kmer_Set_Light(uint64_t kmer_size ,uint64_t core_Number=1, uint64_t minimizer_size=10, uint64_t file_number_exponent=4, uint64_t bits_to_save=0);
Note that another constructor allows to load an index directly from an index file containing a previous index dumped on disk (See index on disk)
kmer_Set_Light(const string& index_file);
Once created, we can construct the index from a fasta file with the construct_index function. Blight creates its temporary files in a working directory that MUST exist.
void construct_index(const string& input_file, const string& wdir = "");
Once constructed, a blight index can perform two kinds of queries from a sequence. Both functions output a vector with a value for each kmer of the query sequence. The first element of the vector is the value associated with the leftmost kmer.
Membership queries are perfomed with get_presence_query
vector<bool> get_presence_query(const string& seq);
True meaning that the kmer is present in the index, False otherwise.
Associative queries with get_hashes_query
vector<int64_t> get_hashes_query(const string& seq);
The associated indices are in the range [0, get_kmer_number()] for the kmer present in the index, -1 otherwise
Once constructed, the blight index can be saved on the disk to be reused later with the function dump_disk
void dump_disk(const string& output_file);
It can be loaded later with a load constructor
kmer_Set_Light(const string& output_file);
Blight combined with an array of values is similar to a static hashtable. As show in the snippet, an array of value can be allocated using the function get_kmer_number() that indicate the amount of kmer present in the index.
int abundance[blight_index_6.get_kmer_number()]={0};
This way, Blight will associate each indexed kmer to a position in the abundance array.
It can be useful to be able to iterate on all the kmer of the index. This can be done using an iterator, as shown in the snippet.
kmer_Set_Light_iterator it(&blight_index);
do{
//We can obtain a binary representation of the kmer as a integer
kmer kmer_binary(it.get_kmer());
//Or as a sequence
string kmer_sequence(it.get_kmer_str());
//If iterator.next() return false it mean that there is no more kmer in the index
}while(it.next());