(compressed) suffic arrays over collections #28

h-2 · 2017-09-20T14:26:10Z

Another very central feature beside #27 that we require is to be able to create indexes over collections of strings/vectors. We also discussed this in march with @simongog and it seemed like he already had some ideas for this.
We could theoretically wrap around your index structure, but it might make more sense to work on this as part of the SDSL.

@mpetri @simongog Do you have any preferences here?

@cpockrandt Can you create a proof-of-concept wrapping the data structures to show how this could work?

mpetri · 2017-09-20T21:32:56Z

We already discussed this. Some of the issues we had were mainly how the input format should look like.

Any ideas?

h-2 · 2017-09-28T16:18:35Z

Any ideas?

Well, the main decision from my POV is:

treat collections as collections and store 2-dimensional positions; or
abstract the collection away behind a virtual concatenated string, index that, re-compute positions on access

2. is probably much easier to implement, but likely has a strong impact on the performance. This would need to be evaluated... You would need something like seqan/seqan3#104 if you want to prevent copying the input sequences into one sequence. This further increases the impact on performance (or increases the size overhead if copying).

1. will be slightly more work, but has the advantages:

faster, because no transformation of the positions
smaller, because: indexing with a pair of positions might allow smaller types; in bioinformatics the sizes of the two dimensions are often known at compile time and are vastly different so e.g. storing chromosome can be achieved in a pair of uint8_t and uint32_t resulting in a packed size of only 40bits, instead of 64bits, a significant difference

cpockrandt · 2017-09-30T00:58:30Z

@h-2 Size should not be an issue here. Afaik, the CSA is using bitcompressed vectors anyway (so there are no bits wasted).

h-2 · 2017-09-30T16:54:58Z

@h-2 Size should not be an issue here. Afaik, the CSA is using bitcompressed vectors anyway (so there are no bits wasted).

Hm, that may be true for the BWT, but for the sampled SA, as well? And what about the full SA during construction?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(compressed) suffic arrays over collections #28

(compressed) suffic arrays over collections #28

h-2 commented Sep 20, 2017

mpetri commented Sep 20, 2017

h-2 commented Sep 28, 2017

cpockrandt commented Sep 30, 2017

h-2 commented Sep 30, 2017

(compressed) suffic arrays over collections #28

(compressed) suffic arrays over collections #28

Comments

h-2 commented Sep 20, 2017

mpetri commented Sep 20, 2017

h-2 commented Sep 28, 2017

cpockrandt commented Sep 30, 2017

h-2 commented Sep 30, 2017