rework database construction and release process to use manifests #1652

ctb · 2021-07-03T15:52:46Z

Our latest database release is pretty nice, but life is also getting much more complicated ;). The process @bluegenes (mostly) and I are using to build/release GTDB looks something like this:

get latest GTDB release spreadsheet
find DNA signatures in wort, as available (tessa)
build protein signatures as needed, and also build DNA signatures that aren't in wort (tessa)
construct new .zip collections from those signatures (tessa)
build SBTs and LCA databases, along with catalogs

With sourmash 4.2.0, we can now start using picklists with sourmash sig cat to construct the zipfile collections, and manifests are automatically produced from that point on. Future improvements such as lazy signature loading using manifests/manifests-of-manifests can also make the actual disk I/O etc much simpler when selecting from large collections.

Separately, @luizirber has a different database building process that builds the "genbank microbial" databases, based (I think) mostly on the wort output as well as the assembly_report file.

This is all getting to be a lot to manage, and partly as a result we haven't produced a new genbank microbial database in a while.

I chatted briefly with tessa about the idea of starting to use manifests as a starting point for building databases.

The basic idea goes something like this -

produce manifests for existing directories of signatures (e.g. all of wort, protein databases, custom "patch update" directories, etc)
build some kind of custom script that takes in a list of directories + manifests for files underneath them and builds a zip file
have this script do some pre-scanning so that we can say "we want " and it will quickly tell us which signatures are missing

The text was updated successfully, but these errors were encountered:

ctb · 2021-07-05T18:22:25Z

I've been working on this on and off, as part of #1654 and #1619 and the scripts in #1641 (comment). I think the ManifestOfManifests that I prototyped in #1619 might be the right approach - we would have it manage a sqlite database that contained the file locations and their last-scanned mtime, and then update that as we go.

I attach an early draft script that does the file finding and the manifest reloading/updating, but it doesn't actually update the database.

update-sqlite3-mom-dirs.py.txt

Other than fleshing out the ManifestOfManifests class, I think the main feature to add here would be something that let you add manifests from a small list of files to the database, without having to rescan the entire directory, which is going to be quite slow for large collections...

ctb · 2021-07-06T16:44:53Z

(come to think of it, this is an excellent situation where plugins might come in handy. Most people using sourmash will probably not be constructing databases with hundreds of thousands of files!)

ctb · 2021-07-06T22:35:34Z

In chatting with @bluegenes, we broke it down into two different issues:

scanning large directories for new files
scanning existing files to see if they've been updated

The latter is pretty straightforward, but the former is going to be pretty slow.

Tessa suggested that we chunk the signatures into (let's say) 20k signatures, and store them in zipfiles or directories. That seems pretty workable - 50 such files would be a million sigs! - but we'd need some infrastructure around that, too...

ctb · 2021-07-07T22:01:16Z

More conversations with @bluegenes - I think our first attempt to improve database construction will end with:

collecting wort files into chunks of ~10k signature files, in zip files;
building a manifest-of-manifests (MoMs) for wort based on those files, and keeping it updated;
building MoMs for ad hoc additions
writing a script that, given a list of MoMs and a picklist of identifiers, tells you (a) whether all the identifiers are found and then if they are, (b) gives you the MoMs and indices that load those locations and lets you do whatever (e.g. create a specific zipfile collection).

ctb · 2021-07-09T00:24:36Z

🎉 well, that was easy! https://github.com/ctb/2021-sourmash-mom

ctb · 2021-07-10T14:34:20Z

OK, got it all working, it seems? Some of the output numbers are incorrect so I'll fix that :)

tl;dr ~1 minute and ~1 GB to get my grubby little paws on all the GTDB genome signatures for RS202.

I loaded all of the signatures from /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs into 120 chunked zipfiles, each containing 10k accessions/30k signatures, under 2021-sourmash-mom/wort-genomes.zips/. (292 GB currently.)

Using the Manifest Of Manifests codebase I then created manifests-of-manifests (MoMs or moms) containing the combined manifests of all the zip files, as well as a (much) smaller collection of signatures that @bluegenes created to round out things that wort didn't have.

% ./create-mom.py wort-genomes.zips.db wort-genomes.zips/
% ./create-mom.py tessa.db tessa.sigs/

This produced two sqlite databases that are not terribly large:

-rw-r--r-- 1 ctbrown ctbrown  44K Jul 10 07:22 tessa.db
-r--r--r-- 1 ctbrown ctbrown 804M Jul 10 07:13 wort-genomes.zips.db

and then I grabbed the latest set of GTDB accessions:

% gunzip -c /group/ctbrowngrp/gtdb/gtdb-rs202.metadata.csv.gz | csvtk cut -f accession | cut -c 4- > gtdb-rs202.idents.csv

(and then had to unmangle the column header, but whatever).

Finally, I asked for all matching signatures across all mom databases (in this case, I didn't actually extract them, as that would have taken an hour or two :).

% /usr/bin/time -v ./mom-extract-
sigs.py --picklist gtdb-rs202.idents.csv:accession:identprefix wort-genomes.zips
.db tessa.db
picking column 'accession' of type 'identprefix' from 'gtdb-rs202.idents.csv'
loaded 258406 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Selecting ksize/moltype/picklist...
...776310 matches remaining for 'wort-genomes.zips.db' (50.6s)
Loading MoM sqlite database tessa.db...
tessa.db contains 201 rows total. Selecting ksize/moltype/picklist...
...201 matches remaining for 'tessa.db' (0.0s)
---
loaded 776511 rows total from 2 databases.
for given picklist, found 258406 matches to 258406 distinct values
There are 201 distinct rows across all MoMs.
No output options; exiting.
        Command being timed: "./mom-extract-sigs.py --picklist gtdb-rs202.idents.csv:accession:identprefix wort-genomes.zips.db tessa.db"
        User time (seconds): 42.93
        System time (seconds): 12.92
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:55.57
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1122912
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 640557
        Voluntary context switches: 1379
        Involuntary context switches: 309667
        Swaps: 0
        File system inputs: 3292944
        File system outputs: 1894776
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

ctb · 2021-07-12T13:23:40Z

One mildly neat realization coming out of #1664 is that for this kind of manifest stuff, the size of the underlying data doesn't matter - we have about the same number of signatures for the SRA as we do for genbank genomes, so all of the manifest stuff will work just fine. It's only the actual search that will be slower for the SRA data because it's so much bigger than the genbank genomes.

ctb · 2021-07-14T14:58:57Z

trying out using the NCBI assembly_summary.txt files, ref sourmash-bio/databases#7, it all seems pretty straightforward --

csvtk cut -f 1 -t assembly_summary.txt > idents.csv
...
./mom-extract-sigs.py -k 31 --dna --picklist ../genbank_build/idents.csv:ident:ident wort-genomes.zips.db  \
    --save-unmatched=../genbank_build/xxx.csv

which gave

picking column 'ident' of type 'ident' from '../genbank_build/idents.csv'
loaded 390 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Running select......
...355 matches remaining for 'wort-genomes.zips.db' (12.1s)
---
loaded 355 rows total from 1 databases.
Wrote 35 unmatched values from picklist to '../genbank_build/xxx.csv'
for given picklist, found 355 matches to 390 distinct values
WARNING: 35 missing picklist values.
There are 355 distinct rows across all MoMs.
No output options; exiting.

note the added feature,

Wrote 35 unmatched values from picklist to '../genbank_build/xxx.csv'

which will be important for automation :)

ctb · 2021-07-14T15:45:11Z

all of genbank => 88k missing signatures from wort, it seems.

% ./mom-extract-sigs.py --picklist ../genbank_build/gb.idents.csv:ident:identprefix -k 31 --save-unmatched ../genbank_build/gb.nomatch.csv wort-genomes.zips.db
picking column 'ident' of type 'identprefix' from '../genbank_build/gb.idents.csv'
loaded 1033057 distinct values into picklist.
Loading MoM sqlite database wort-genomes.zips.db...
wort-genomes.zips.db contains 3617967 rows total. Running select......
...947203 matches remaining for 'wort-genomes.zips.db' (13.8s)
---
loaded 947203 rows total from 1 databases.
Wrote 88439 unmatched values from picklist to '../genbank_build/gb.nomatch.csv'
for given picklist, found 944618 matches to 1033057 distinct values
WARNING: 88439 missing picklist values.
There are 947203 distinct rows across all MoMs.
No output options; exiting.

ctb · 2022-03-30T00:59:15Z

closed by #1907 which combined with #1891 to make it very straightforward to build databases out of wort!

ctb mentioned this issue Jul 12, 2021

parallelizing SRA search via snakemake #1664

Open

ctb mentioned this issue Jul 14, 2021

make a sourmash sketch fromfile to support large scale sketching. #1671

Closed

ctb mentioned this issue Jul 24, 2021

revisiting MAGsearch/searching all the SRA, and manifests of manifests #1685

Closed

ctb mentioned this issue Feb 14, 2022

[MRG] add sqlite3 implementations for Index, CollectionManifest, and LCA_Database #1808

Merged

33 tasks

This was referenced Mar 26, 2022

[MRG] implement sourmash sketch fromfile #1885

Merged

[MRG] add sourmash sig check for comparing picklists and databases #1907

Merged

ctb closed this as completed in #1907 Mar 30, 2022

ctb mentioned this issue May 1, 2022

sourmash database construction - current status and future thoughts #2015

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework database construction and release process to use manifests #1652

rework database construction and release process to use manifests #1652

ctb commented Jul 3, 2021

ctb commented Jul 5, 2021

ctb commented Jul 6, 2021 •

edited

Loading

ctb commented Jul 6, 2021

ctb commented Jul 7, 2021

ctb commented Jul 9, 2021

ctb commented Jul 10, 2021

ctb commented Jul 12, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Mar 30, 2022

rework database construction and release process to use manifests #1652

rework database construction and release process to use manifests #1652

Comments

ctb commented Jul 3, 2021

ctb commented Jul 5, 2021

ctb commented Jul 6, 2021 • edited Loading

ctb commented Jul 6, 2021

ctb commented Jul 7, 2021

ctb commented Jul 9, 2021

ctb commented Jul 10, 2021

ctb commented Jul 12, 2021

ctb commented Jul 14, 2021

ctb commented Jul 14, 2021

ctb commented Mar 30, 2022

ctb commented Jul 6, 2021 •

edited

Loading