recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

taylorreiter · 2022-12-05T17:46:26Z

Hello! I'm interested in building a database for the lineages not already built and released by sourmash devs that are on genbank: invertebrate, plant, vertebrate_mammalian, vertebrate_other. What is the current recommended best practice to do this? I've been wading through the issues to try and piece together the best strategy and it's a bit confusing.

I saw in #2015 that sourmash/database-releases records the build process for databases. However, for the 2022.03 genbank databases, there is a missing file, '../entire.mar29.csv' that is reference in the Snakefile. What is this file?

I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?

Or am I thinking about this totally the wrong way? I took a look at sourmash/database-examples and I felt like this question was out of the scope of the examples there, but I'm happy to be told i'm wrong on that front!

future proofing for anyone who sees this issue and is also interested in these types of databases: I am not aware that anyone has done benchmarking for false positives for things like sourmash gather due to repetitive elements or contamination in eukaryotic genomes. Historically we've gotten around this by using a database built from CDS regions or RNA...masked genomes could also potentially work.

ctb · 2022-12-06T15:29:54Z

I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?

No, almost certainly not ;). But @luizirber might disagree?

The way I'd recommend doing it is to start with something like genome_updater and assembly_summary.txt - see sourmash-bio/databases#5 for some discussion.

I'll write more when I have a chance to investigate, but it should be something like:

get the genomes downloaded somehow (genome_updater)
get the genome "metadata" (assembly_summary.txt)
use our script (below) to build lineages file for taxonomy

From a question @bluegenes asked recently on slack, here are some additional resources:

#2015

https://github.com/ctb/2022-assembly-summary-to-lineages

The file ./entire.mar29.csv is in /home/ctbrown/scratch/fromfile on farm and is the manifest of all the wort sigs, to be used with sourmash sig check as in the Snakefile. See entire.2022-04-26.sqlmf in that same directory for a SQLite-based manifest that should work with current sourmash; document a bit in #1965.

taylorreiter · 2022-12-07T13:22:31Z

thanks @ctb! I'll wait a couple days to see if @luizirber has anything to say re: wort for eukaryotic sigs, and if not I'll explore the route you've outlined below.

My end goal are covers for the databases, so WIP here: https://github.com/Arcadia-Science/build-cover-dbs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

taylorreiter commented Dec 5, 2022

ctb commented Dec 6, 2022

taylorreiter commented Dec 7, 2022

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

Comments

taylorreiter commented Dec 5, 2022

ctb commented Dec 6, 2022

taylorreiter commented Dec 7, 2022