non-DNA alphabet #3

ekg · 2017-08-05T19:26:41Z

Does bwt-merge work with BWTs from non-DNA alphabets?

Given that it takes SGA's BWT format as input this would presumably require SGA to support non-DNA as well or another BWT format?

I'd like to build a full-text index of all wikipedia revisions. This is something like 100TB uncompressed and it's not clear to me if there is a better way to do this than using an approach based on BWT merging. But if there is I'd be curious to know.

jltsiren · 2017-08-06T16:47:58Z

BWT-merge assumes alphabet size 6. Byte alphabet would require changes to the BWT encoding, the rank structure, and the optimizations in trie traversal.

Building a BWT for 100 terabytes of data would probably take around 10 CPU years. If you want a single FM-index, you also need enough memory for it on a single system, and the construction will take months. If you can live with p indexes, you can distribute the construction to p systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-DNA alphabet #3

non-DNA alphabet #3

ekg commented Aug 5, 2017

jltsiren commented Aug 6, 2017

non-DNA alphabet #3

non-DNA alphabet #3

Comments

ekg commented Aug 5, 2017

jltsiren commented Aug 6, 2017