Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Allow to alignment between protein or nucleotide sequences #16

Closed
Tracked by #35
notalfredo opened this issue Dec 31, 2022 · 7 comments
Closed
Tracked by #35
Assignees
Labels
lvl-1-easy Easy-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework

Comments

@notalfredo
Copy link
Member

This can be done with Needleman–Wunsch algorithm. Like the title mentions its an algorithm that allowed you to align protein or nucleotide sequences. This algorithm will be in its own file to follow the standard of the project.

@notalfredo notalfredo added lvl-1-easy Easy-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework labels Dec 31, 2022
@notalfredo notalfredo changed the title feature: Allow to alignment between protein or nucleotide sequences #12 feature: Allow to alignment between protein or nucleotide sequences Dec 31, 2022
@neoncitylights
Copy link
Contributor

neoncitylights commented Dec 31, 2022

Thanks for submitting! It's an interesting idea, and it's definitely a use case for using the Levenshtein distance algorithm. Is this algorithm purely for biology?

From the perspective of a library user (not developer), the Hamming & Levenshtein distance algorithms have a various/wide set of applications to use them in. This includes biology, but it's not solely biology. Ideally, a library should only ship what will be used. Those 2 algorithms (at least as of right now) are the main focus, but Needleman-Wunsch is biology focused.

I do like the idea though, and I think it'd make better sense if we turn this repository into a monorepo of related crates. We can do this by using a "Cargo workspace" (https://doc.rust-lang.org/book/ch14-03-cargo-workspaces.html). If you look at the GitHub repository for the serde crate, there's multiple crates in there like the main serde crate, serde_derive, and serde_derive_internals (crate purely internal for developers).

I think we can do something like this, except move it to where all crates are in a /crates directory. So, we could have like:

  • differ: Library for just the pure distance/similarity algorithms
  • needleman_wunsch: Library for the Needleman Wunsch algorithm, which can have differ as a dependency (if it needs it)

And then in the future, having a workspace would also give way for a crate like semantic_differ (example/placeholder name). I think you remember us talking about this, it would be semantic-like diffing which can compute the difference between two words in a linguistic manner. e.g "were" and "was" are technically 1/4 similar, but they're just two differences. Another is "person" and "people", which would give sort of low-ish scores, even though semantically they're similar, it just became plural.

@neoncitylights
Copy link
Contributor

If this is something you're interested in, then we should create an issue first to setup the repository for a monorepo, and then we can create a crate for the Needleman-Wunsch algorithm.

@notalfredo
Copy link
Member Author

As of right now there are two algorithms I would like to implement on bio_diff that being

  • Needleman–Wunsch algorithm
  • Smith–Waterman algorithm

Both have to do with aligning protien or nucleotide sequences. Each algorithm will have their own file similar to how differ is structured. I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.

@neoncitylights
Copy link
Contributor

I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.

By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have bio_diff depend on the differ crate. By doing this, you won't have to re-implement anything.

@notalfredo
Copy link
Member Author

I am thinking of re using the same enums and structs EXECPT I plan on implementing the memory optimizations on issue #25 from the start.

By the way, I mentioned earlier you can have a crate as a dependency for another crate :) So, you can have bio_diff depend on the differ crate. By doing this, you won't have to re-implement anything.

If I crate depends on another crate does this have any performance downsides ? Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?

@neoncitylights
Copy link
Contributor

neoncitylights commented Jan 2, 2023

If I crate depends on another crate does this have any performance downsides ?

No performance downsides here. Think of it this way; it would be a performance downside by having both libraries duplicate code if a user used both libraries, assuming bio_diff didn't depend on differ. It'd also be a burden on the software developer to maintain duplicate code.

Also if bio_diff depends on differ does the user just have access to bio_diff or also differ ?

They'd just have access to bio_diff, but the user can specify differ as an explicit dependency. Rust has a feature called dependency resolving in the situation where a project has common dependencies, to keep the binary size as small as possible, so this is not a worry. :) There's an official page on this which is a longer read, if you want to learn more about the internal details: https://doc.rust-lang.org/cargo/reference/resolver.html

notalfredo added a commit that referenced this issue Jan 5, 2023
@neoncitylights
Copy link
Contributor

neoncitylights commented Jul 1, 2023

Declining for now, see #42, #43. This can be written inside a separate repository

@neoncitylights neoncitylights closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lvl-1-easy Easy-ranking issue p1-low Priority 1: Generally no one plans to work on the task, but it would be nice if someone decides to. t-feature-request Type: Idea/request of an enhancement towards a library/framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants