Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poly n Plannotate #396

Closed
Koeng101 opened this issue Nov 7, 2023 · 5 comments
Closed

Poly n Plannotate #396

Koeng101 opened this issue Nov 7, 2023 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed intermediate Will take some time to fix

Comments

@Koeng101
Copy link
Contributor

Koeng101 commented Nov 7, 2023

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8262757/

I'd like to get the plannotate auto annotation suite working with poly.

Here's a link to the code: https://github.com/mmcguffi/pLannotate/tree/master

Basically, this would let us auto-annotate plasmids. A very useful task!

@Koeng101 Koeng101 added enhancement New feature or request help wanted Extra attention is needed intermediate Will take some time to fix labels Nov 7, 2023
@abondrn
Copy link
Contributor

abondrn commented Nov 7, 2023

Would be happy to take this on! Here's a 2 ways we could approach it:

  1. Closely integrate with the plannotate batch CLI, which takes fasta files and produces output files (genbank or csv). This has the obvious benefit of being quicker to implement, test, and review; and because this calls out to Python via the CLI, future updates from plannotate would not have to be ported to go in order to be utilized. This is what I am leaning towards.
  2. Faithfully port the core logic, which uses several CLI tools (blastn, diamond, infernal) to query several databases that are distributed with plannotate found here and are then aggregated with pandas. This option would result in fast annotation, and by adding local alignment search natively to poly it unlocks future functionality such as CRISPR gRNA design.

In short, both require calling out to external CLI tooling, but option 2 does not have a Python dependency but requires additional work as a result.

@abondrn
Copy link
Contributor

abondrn commented Nov 7, 2023

One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.

@Koeng101
Copy link
Contributor Author

Koeng101 commented Nov 7, 2023

One callout: plannotate is distributed under the GNU GPL v3 license. This shouldn't impact option 1, as we do not plan to distribute poly with pLannotate, but it will impact users that may want to use poly + plannotate. Option 2 may be impacted, as poly may become a derivative work, which we don't want if we want to keep using the MIT license.

One bit here: DNA cannot be copyrighted. The most important thing that they've made, in my opinion, is the sweet,sweet database of part features. The raw sequences we should be able to use without infringing on any copyright. Translation to a whole new language means it probably isn't derivative work on the code-level.

I suppose I should be more specific with the desire here: I would like the abilities of plannotate, regardless of implementation. So option 2, though I don't think we have to care much about faithfully reproducing the core logic! We just need 98% matching to the full sequence - ie, table 1.

If the goal is to write no new code, you can probably just select down the possible matches using mash, then do a Needleman-Wunsch alignment using align. I've found it's really really really slow, though, but could work. There is a reason blast is a thing

The other option would be getting blast or minimap2 or the like integrated into Poly. I've been looking at doing this with biowasm, but there are some annoying points around getting that to work (been posting my work on discord). It could also be done with cgo, but again, that is also annoying.

I would love this to be done and am very willing to help!

@Koeng101
Copy link
Contributor Author

Koeng101 commented Nov 7, 2023

I've never really used DIAMOND, but I'd also be fine with just looking for perfect amino acid matches right now. The nucleotide matching is the important part for a version 1 IMO

@TimothyStiles
Copy link
Collaborator

I'm not sure the approach but this may be a good external thing to start?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed intermediate Will take some time to fix
Projects
None yet
Development

No branches or pull requests

3 participants