You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I'm interested in building a database for the lineages not already built and released by sourmash devs that are on genbank: invertebrate, plant, vertebrate_mammalian, vertebrate_other. What is the current recommended best practice to do this? I've been wading through the issues to try and piece together the best strategy and it's a bit confusing.
I saw in #2015 that sourmash/database-releases records the build process for databases. However, for the 2022.03 genbank databases, there is a missing file, '../entire.mar29.csv' that is reference in the Snakefile. What is this file?
I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?
Or am I thinking about this totally the wrong way? I took a look at sourmash/database-examples and I felt like this question was out of the scope of the examples there, but I'm happy to be told i'm wrong on that front!
future proofing for anyone who sees this issue and is also interested in these types of databases: I am not aware that anyone has done benchmarking for false positives for things like sourmash gather due to repetitive elements or contamination in eukaryotic genomes. Historically we've gotten around this by using a database built from CDS regions or RNA...masked genomes could also potentially work.
The text was updated successfully, but these errors were encountered:
I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?
No, almost certainly not ;). But @luizirber might disagree?
The way I'd recommend doing it is to start with something like genome_updater and assembly_summary.txt - see sourmash-bio/databases#5 for some discussion.
I'll write more when I have a chance to investigate, but it should be something like:
get the genomes downloaded somehow (genome_updater)
get the genome "metadata" (assembly_summary.txt)
use our script (below) to build lineages file for taxonomy
From a question @bluegenes asked recently on slack, here are some additional resources:
The file ./entire.mar29.csv is in /home/ctbrown/scratch/fromfile on farm and is the manifest of all the wort sigs, to be used with sourmash sig check as in the Snakefile. See entire.2022-04-26.sqlmf in that same directory for a SQLite-based manifest that should work with current sourmash; document a bit in #1965.
thanks @ctb! I'll wait a couple days to see if @luizirber has anything to say re: wort for eukaryotic sigs, and if not I'll explore the route you've outlined below.
Hello! I'm interested in building a database for the lineages not already built and released by sourmash devs that are on genbank: invertebrate, plant, vertebrate_mammalian, vertebrate_other. What is the current recommended best practice to do this? I've been wading through the issues to try and piece together the best strategy and it's a bit confusing.
I saw in #2015 that sourmash/database-releases records the build process for databases. However, for the 2022.03 genbank databases, there is a missing file, '../entire.mar29.csv' that is reference in the Snakefile. What is this file?
I also saw in #2015 that sourmash/database-releases takes advantage of genomes downloaded by wort. Would it be straightforward for me to make my own implementation of wort to get sigs for e.g. plant?
Or am I thinking about this totally the wrong way? I took a look at sourmash/database-examples and I felt like this question was out of the scope of the examples there, but I'm happy to be told i'm wrong on that front!
future proofing for anyone who sees this issue and is also interested in these types of databases: I am not aware that anyone has done benchmarking for false positives for things like sourmash gather due to repetitive elements or contamination in eukaryotic genomes. Historically we've gotten around this by using a database built from CDS regions or RNA...masked genomes could also potentially work.
The text was updated successfully, but these errors were encountered: