sourmash database construction - current status and future thoughts #2015

ctb · 2022-05-01T13:45:58Z

With the merge of StandaloneManifestIndex #1891 and sourmash sig check #1907, and the advent of sourmash sketch fromfile #1885 and associated scripts in https://github.com/ctb/2022-sourmash-sketchfrom/, a new day is dawning for sourmash database construction 🎉

tl;dr `sig check` and `sketch fromfile`

The two key commands introduced in sourmash v4.4.0 are sourmash sig check and sourmash sketch fromfile. sig check helps identify and retrieve relevant signatures that match to lists of identifiers, while sketch fromfile helps coordinate the bulk construction of sketches.

Some functional examples of sourmash sketch fromfile are here. This includes examples of databases with private identifiers and also databases with NCBI-formatted identifiers.

sig check is used to find and extract signatures from wort collections based on identifier prefix matching, and can also be used to verify that all desired identifiers are in a database:

sourmash sig check <wort collection> \
    --picklist <picklist> \
    -o missing.csv \
    --save-manifest matching.csv

This issue is a consolidation of unresolved elements from previous issues, as well as a reference point for closure of previous issues and PRs.

Relevant issues:

Building genbank/refseq databases from assembly_summary.txt databases#7 - using assembly summary files
upgrade sourmash_databases #970 - upgrading sourmash_databases
split database construction and release processes; provide database catalogs #1569 - previous summary issue
rework database construction and release process to use manifests #1652 - use manifests!

standarized database build and release process

This is now in a new repository, https://github.com/sourmash-bio/database-releases.

The idea is that for each new release, the scripts and/or workflow for building that release will be placed in a new directory in database-releases, and then a new release of that repository will be made along with the database update. This will provide a zenodo DOI for each database script update.

constructing Genbank databases (DNA)

We do this from wort-genomes, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).

See https://github.com/sourmash-bio/database-releases/tree/main/genbank-2022.03

constructing GTDB databases (DNA)

We do this from wort-genomes, using collection manifests (to track the genomes) and assembly summary files (to identify signature names to put in the collections).

See https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic-reps and https://github.com/sourmash-bio/database-releases/tree/main/gtdb-rs207.genomic.

building taxonomy CSVs

Genbank

I updated the scripts from the previous lineage stuff to use the assembly summary files; scripts here: https://github.com/ctb/2022-assembly-summary-to-lineages.

GTDB

@taylorreiter provided an R script for producing taxonomy spreadsheets from GTDB's taxonomy TSVs, here: #1941 (comment)

Follow-up questions

Questions:

do we want to update old databases?

probably not worth the trouble, but it should be as simple as sourmash sig cat OLD_DB -o NEW_DB.

can we / should we provide metadata for databases?

see #1847

planning for protein databases

soon-ish we will be releasing protein databases... building these is more difficult because we don't have them in wort genomes. I think we are planning to use the sourmash sketch fromfile approach together with a custom workflow for building and/or retrieving the protein files cc @bluegenes.

Other future things to think about

#991 - using BD Bags, and/or the datasets tool, and/or supporting incremental download of data releases.

Consider just providing the .zip files, together with workflows for constructing the other files as needed? First raised in #1511.

Provide a list of available databases in a computationally accessible format (along with, presumably, tools for retrieving them?) - #1005

The text was updated successfully, but these errors were encountered:

ctb · 2022-05-01T14:04:19Z

more -

should we build RefSeq representative databases, per sourmash-bio/databases#13?

provide better database benchmarks - #2014

upgrading wort manifests - #1965

providing links in taxonomy databases - #1969

building less redundant databases with minimum set covers - #1852

old IMG databases that we actually could upgrade - #385

bluegenes · 2022-05-05T17:06:37Z

Sketching protein databases using fromfile took ~ 15hours and ~ 1G RAM after all faa.gz files were available. Since not all genomes have protein fastas available, the workflow requires checking for empty/missing proteome files, and using prodigal to generate protein fastas from genomes as necessary.

note - suppressed record issue: #2037

ctb mentioned this issue May 4, 2022

How do I build a lineage spreadsheet for GTDB taxonomy and signatures? #1095

Open

taylorreiter mentioned this issue Dec 5, 2022

recommended way to build a database from a genbank lineage not already built and released by sourmash devs? #2395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sourmash database construction - current status and future thoughts #2015

sourmash database construction - current status and future thoughts #2015

ctb commented May 1, 2022 •

edited

Loading

ctb commented May 1, 2022

bluegenes commented May 5, 2022

sourmash database construction - current status and future thoughts #2015

sourmash database construction - current status and future thoughts #2015

Comments

ctb commented May 1, 2022 • edited Loading

tl;dr sig check and sketch fromfile

standarized database build and release process

constructing Genbank databases (DNA)

constructing GTDB databases (DNA)

building taxonomy CSVs

Genbank

GTDB

Follow-up questions

planning for protein databases

Other future things to think about

ctb commented May 1, 2022

bluegenes commented May 5, 2022

ctb commented May 1, 2022 •

edited

Loading

tl;dr `sig check` and `sketch fromfile`