Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add protein database generation workflow #5

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 15, 2024

We don't have wort for protein sigs (yet). So this PR uses a new plugin, sourmash_plugin_directsketch to download and sketch proteomes, checking the md5sum along the way. When no proteome was found, we download the genome, predict proteins with prodigal, then sketch.

Steps to this workflow:

  • check old database to find missing (new) accession
  • download and sketch proteomes
  • download genomes for failed proteome downloads
  • prodigal genomes --> proteins
  • sketch prodigal proteomes
  • cat 3 db together (prior release, direct downloads, prodigal proteomes)

To avoid repeating steps, this workflow use the taxonomy/metadata parsing done in the gtdb-rs214.genomic workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant