This project aims to make DNA and RNA k-mer return time distribution analysis simple, fast, and generalizable.
Consider the DNA sequences AAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTT
and ATATATATATATATATATATATATATATATATAT
.
Normal k-mer frequency based analysis methods would treat these sequences identically.
For k=1, the number of A
1-mers is precisely equal to the number of T
1-mers in both sequences.
However, k-mer return time methods ask the following question:
For a given k-mer, how close (in base pairs) is it to the next occurrence of another k-mer (usually the same one).
So, going back to the example above, for k=1, the return times the first sequence for A
would be 1, 1, 1... since each A
k-mer is one base away from the next A
k-mer. For the second sequence, it would be 2, 2, 2... since each A
is takes two bases to become an A
again.
In librtd
, we have generalized the concept of k-mer return time to include the distance of a k-mer not only to the next occurrence of itself but also to the next occurrence of another k-mer. In the first sequence, the return times from A
to T
are 17, 16, ..., 2, 1. This is useful for studying the relationship between the location of various pairs of k-mers, not just individual k-mers.
Once the k-mer return times have been calculated, librtd
can automatically compute the mean and standard deviation of the return times for each k-mer, allowing the distribution to be easily summarized.
In the first example, the mean distance between A
and T
is 9 with a standard deviation of 5. whereas in the second example, the mean distance is 1 with a standard deviation of 0.
This technique is useful in applications wherever alignment-free sequence analysis is used, from phylogeny to metagenomics.
Give librtd
a try!