Implement Burrows-Wheeler transform #360

carreter · 2023-09-23T04:38:04Z

No description provided.

carreter · 2023-09-23T05:22:14Z

@TimothyStiles can you elaborate on what exactly it is we need here and triage the issue here + in the roadmap?

TwFlem · 2023-11-09T18:04:47Z

I'd love to take a whack at this!

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

Koeng101 · 2023-11-09T18:18:25Z

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

On the alignment side, I think this basically becomes bwa https://bio-bwa.sourceforge.net . What I think could be neat is if you could somehow make it get 98% matching working, so it could be used for #396 in auto-annotating features. Right now that is done in plannotate with BLAST, but I'm pretty sure it is done that way because BLAST is an easy tool to just import and use. This could be a good application!

For actual large-scale sequence alignment, minimap2 is probably a better route - tried and true with nanopore-type alignment, which IMO is the best upcoming DNA sequencing method. (Also it's what I use right now)

TimothyStiles · 2023-11-09T22:06:01Z

@TimothyStiles might have some other intentions in mind, but Burrows Wheeler Transform (BWT) can be used as a data structure to query whether or not a sub sequence within a sequence exists.

This can be useful for determining whether or not a sub sequence of nucleotides exist within a sequence of nucleotides.

It is very memory efficient and exact matches on reads are fast.

There is also some room for backtracking on seeming mismatches for some fuzzy sub sequence searching and maybe some other opportunities to tune the BWT like reporting the N number of locations where this sub sequence exists.

Maybe the exact case is a good first step? Do we want to report the location of the match as well- seems important?

@TwFlem I'd love to see you take a crack at this. At a conference on my phone rn.

What we're ultimately looking for is a fast BWT implementation that essentially compresses our sequence to be used for search, alignment, etc.

BWT is common enough that LLMs can create a decent naive implementation that can be easily unit tested (I've actually done this).

What I'd suggest is that we get a naive implementation going that we can merge early with high test coverage and documentation then optimize that implementation with @matiasinsaurralde's or @soypat's, or @/josharian's (on discord) input. That way we don't get stuck on optimization early and different people can focus on optimization, and downstream applications at the same time without blocking each other.

* basic bitvector * jacobsons start and refactor to uint for accurate machine words * confident that jacobson rank is working * reusing the incoming bitvector instead of copying everyithing for jacobson rank * access and bounds checking * just do uint64 for simplicity. bound checking and access * bit vector fixes, rsa good enough, wavelet start * Simple wavelet tree with access * wavelet fix access, add select, fix rsa bitvector select * got count working, but had to throw out jacobsons * rsa fixes and refactors * bwt locate * extract * doc BWT, refactor, and return a possible error during construction * add TODO about sorting and the nullChar * bwt examples * wavelet tree doc * wavelet tree explanation * doc and note for waveletTree * add bwt high level. move wavelet tree's some rsa bv docs * simplify bitvector, docs for bitvector and rsaBitvector * Cite Ben Langmead. --------- Co-authored-by: Willow Carretero Chavez <sandiegobutterflies@gmail.com> Co-authored-by: Timothy Stiles <tim@stiles.io>

github-actions · 2024-02-06T18:28:16Z

This issue has had no activity in the past 2 months. Marking as stale.

TimothyStiles added this to poly development roadmap Feb 14, 2023

carreter converted this from a draft issue Sep 23, 2023

carreter changed the title ~~Implement BurrowsWheeler~~ Implement Burrows-Wheeler transform Sep 23, 2023

carreter added this to the v1.0 milestone Sep 23, 2023

carreter added the needs-triage An issue that needs to be triaged label Sep 23, 2023

TimothyStiles added good first issue Good for newcomers proposal A proposed feature or enhancement and removed needs-triage An issue that needs to be triaged labels Nov 29, 2023

TwFlem mentioned this issue Dec 6, 2023

#360 Bwt #411

Merged

6 tasks

TimothyStiles assigned TwFlem Dec 8, 2023

github-actions bot added the stale label Feb 6, 2024

TimothyStiles closed this as completed Feb 6, 2024

github-project-automation bot moved this from Todo to Done in poly development roadmap Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Burrows-Wheeler transform #360

Implement Burrows-Wheeler transform #360

carreter commented Sep 23, 2023

carreter commented Sep 23, 2023

TwFlem commented Nov 9, 2023

Koeng101 commented Nov 9, 2023

TimothyStiles commented Nov 9, 2023 •

edited

Loading

github-actions bot commented Feb 6, 2024

Implement Burrows-Wheeler transform #360

Implement Burrows-Wheeler transform #360

Comments

carreter commented Sep 23, 2023

carreter commented Sep 23, 2023

TwFlem commented Nov 9, 2023

Koeng101 commented Nov 9, 2023

TimothyStiles commented Nov 9, 2023 • edited Loading

github-actions bot commented Feb 6, 2024

TimothyStiles commented Nov 9, 2023 •

edited

Loading