Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-DNA alphabet #3

Open
ekg opened this issue Aug 5, 2017 · 1 comment
Open

non-DNA alphabet #3

ekg opened this issue Aug 5, 2017 · 1 comment

Comments

@ekg
Copy link

ekg commented Aug 5, 2017

Does bwt-merge work with BWTs from non-DNA alphabets?

Given that it takes SGA's BWT format as input this would presumably require SGA to support non-DNA as well or another BWT format?

I'd like to build a full-text index of all wikipedia revisions. This is something like 100TB uncompressed and it's not clear to me if there is a better way to do this than using an approach based on BWT merging. But if there is I'd be curious to know.

@jltsiren
Copy link
Owner

jltsiren commented Aug 6, 2017

BWT-merge assumes alphabet size 6. Byte alphabet would require changes to the BWT encoding, the rank structure, and the optimizations in trie traversal.

Building a BWT for 100 terabytes of data would probably take around 10 CPU years. If you want a single FM-index, you also need enough memory for it on a single system, and the construction will take months. If you can live with p indexes, you can distribute the construction to p systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants