Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(compressed) suffic arrays over collections #28

Open
h-2 opened this issue Sep 20, 2017 · 4 comments
Open

(compressed) suffic arrays over collections #28

h-2 opened this issue Sep 20, 2017 · 4 comments

Comments

@h-2
Copy link

h-2 commented Sep 20, 2017

Another very central feature beside #27 that we require is to be able to create indexes over collections of strings/vectors. We also discussed this in march with @simongog and it seemed like he already had some ideas for this.
We could theoretically wrap around your index structure, but it might make more sense to work on this as part of the SDSL.

@mpetri @simongog Do you have any preferences here?

@cpockrandt Can you create a proof-of-concept wrapping the data structures to show how this could work?

@mpetri
Copy link

mpetri commented Sep 20, 2017

We already discussed this. Some of the issues we had were mainly how the input format should look like.

Any ideas?

@h-2
Copy link
Author

h-2 commented Sep 28, 2017

Any ideas?

Well, the main decision from my POV is:

  1. treat collections as collections and store 2-dimensional positions; or
  2. abstract the collection away behind a virtual concatenated string, index that, re-compute positions on access

2. is probably much easier to implement, but likely has a strong impact on the performance. This would need to be evaluated... You would need something like seqan/seqan3#104 if you want to prevent copying the input sequences into one sequence. This further increases the impact on performance (or increases the size overhead if copying).

1. will be slightly more work, but has the advantages:

  • faster, because no transformation of the positions
  • smaller, because: indexing with a pair of positions might allow smaller types; in bioinformatics the sizes of the two dimensions are often known at compile time and are vastly different so e.g. storing chromosome can be achieved in a pair of uint8_t and uint32_t resulting in a packed size of only 40bits, instead of 64bits, a significant difference

@cpockrandt
Copy link
Collaborator

@h-2 Size should not be an issue here. Afaik, the CSA is using bitcompressed vectors anyway (so there are no bits wasted).

@h-2
Copy link
Author

h-2 commented Sep 30, 2017

@h-2 Size should not be an issue here. Afaik, the CSA is using bitcompressed vectors anyway (so there are no bits wasted).

Hm, that may be true for the BWT, but for the sampled SA, as well? And what about the full SA during construction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants