-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Burrows-Wheeler transform #360
Comments
@TimothyStiles can you elaborate on what exactly it is we need here and triage the issue here + in the roadmap? |
I'd love to take a whack at this! @TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists. This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides. It is very memory efficient and exact matches on reads are fast. There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists. Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important? |
On the alignment side, I think this basically becomes bwa https://bio-bwa.sourceforge.net . What I think could be neat is if you could somehow make it get 98% matching working, so it could be used for #396 in auto-annotating features. Right now that is done in plannotate with BLAST, but I'm pretty sure it is done that way because BLAST is an easy tool to just import and use. This could be a good application! For actual large-scale sequence alignment, minimap2 is probably a better route - tried and true with nanopore-type alignment, which IMO is the best upcoming DNA sequencing method. (Also it's what I use right now) |
@TwFlem I'd love to see you take a crack at this. At a conference on my phone rn. What we're ultimately looking for is a fast BWT implementation that essentially compresses our sequence to be used for search, alignment, etc. BWT is common enough that LLMs can create a decent naive implementation that can be easily unit tested (I've actually done this). What I'd suggest is that we get a naive implementation going that we can merge early with high test coverage and documentation then optimize that implementation with @matiasinsaurralde's or @soypat's, or @/josharian's (on discord) input. That way we don't get stuck on optimization early and different people can focus on optimization, and downstream applications at the same time without blocking each other. |
* basic bitvector * jacobsons start and refactor to uint for accurate machine words * confident that jacobson rank is working * reusing the incoming bitvector instead of copying everyithing for jacobson rank * access and bounds checking * just do uint64 for simplicity. bound checking and access * bit vector fixes, rsa good enough, wavelet start * Simple wavelet tree with access * wavelet fix access, add select, fix rsa bitvector select * got count working, but had to throw out jacobsons * rsa fixes and refactors * bwt locate * extract * doc BWT, refactor, and return a possible error during construction * add TODO about sorting and the nullChar * bwt examples * wavelet tree doc * wavelet tree explanation * doc and note for waveletTree * add bwt high level. move wavelet tree's some rsa bv docs * simplify bitvector, docs for bitvector and rsaBitvector * Cite Ben Langmead. --------- Co-authored-by: Willow Carretero Chavez <sandiegobutterflies@gmail.com> Co-authored-by: Timothy Stiles <tim@stiles.io>
This issue has had no activity in the past 2 months. Marking as |
No description provided.
The text was updated successfully, but these errors were encountered: