Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

Open
taylorreiter opened this issue Dec 5, 2022 · 2 comments

Comments

@taylorreiter
Copy link
Contributor

Hello! I'm interested in building a database for the lineages not already built and released by sourmash devs that are on genbank: invertebrate, plant, vertebrate_mammalian, vertebrate_other. What is the current recommended best practice to do this? I've been wading through the issues to try and piece together the best strategy and it's a bit confusing.

I saw in #2015 that sourmash/database-releases records the build process for databases. However, for the 2022.03 genbank databases, there is a missing file, '../entire.mar29.csv' that is reference in the Snakefile. What is this file?

I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?

Or am I thinking about this totally the wrong way? I took a look at sourmash/database-examples and I felt like this question was out of the scope of the examples there, but I'm happy to be told i'm wrong on that front!

future proofing for anyone who sees this issue and is also interested in these types of databases: I am not aware that anyone has done benchmarking for false positives for things like sourmash gather due to repetitive elements or contamination in eukaryotic genomes. Historically we've gotten around this by using a database built from CDS regions or RNA...masked genomes could also potentially work.

@ctb
Copy link
Contributor

ctb commented Dec 6, 2022

I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?

No, almost certainly not ;). But @luizirber might disagree?


The way I'd recommend doing it is to start with something like genome_updater and assembly_summary.txt - see sourmash-bio/databases#5 for some discussion.

I'll write more when I have a chance to investigate, but it should be something like:

  • get the genomes downloaded somehow (genome_updater)
  • get the genome "metadata" (assembly_summary.txt)
  • use our script (below) to build lineages file for taxonomy

From a question @bluegenes asked recently on slack, here are some additional resources:

#2015

https://github.com/ctb/2022-assembly-summary-to-lineages


The file ./entire.mar29.csv is in /home/ctbrown/scratch/fromfile on farm and is the manifest of all the wort sigs, to be used with sourmash sig check as in the Snakefile. See entire.2022-04-26.sqlmf in that same directory for a SQLite-based manifest that should work with current sourmash; document a bit in #1965.

@taylorreiter
Copy link
Contributor Author

thanks @ctb! I'll wait a couple days to see if @luizirber has anything to say re: wort for eukaryotic sigs, and if not I'll explore the route you've outlined below.

My end goal are covers for the databases, so WIP here: https://github.com/Arcadia-Science/build-cover-dbs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants