what databases do we want to provide, and how? #1511

ctb · 2021-05-10T13:37:21Z

re sourmash 4.1 release in particular #1481 #1391

we now effectively have three types of databases we are supporting and could distribute -

SBTs
LCAs
Zipfile collections

and we can supply genbank, refseq, GTDB genomic reps (~35k genomes), and/or GTDB all (~300k genomes)

and we can supply DNA and/or protein

and for LCA databases we can provide a taxonomy, too - NCBI or GTDB.

😅

A few misc thoughts -

the SBTs are too big for anything but the GTDB genomic reps
same with the LCAs
since GTDB genomic reps are "representative" subsets of genbank I think we could safely provide SBTs and LCAs only for them
we should provide LCA databases with NCBI and GTDB taxonomies both.
the Zip file collections will presumably be more useful when greyhound is fully available, but they are already useful;
we should provide protein at some point, but it is less critical than updated DNA

Current status: GTDB zipfile collections are available on OSF (under Google Drive) for the latest RS202, courtesy of @bluegenes.

ctb · 2021-05-10T13:40:26Z

if I build .sbt.zip for the SBTs, should I use default sourmash index parameters?

ctb · 2021-05-10T13:53:24Z

ref GTDB databases

luizirber · 2021-05-10T15:51:01Z

Connected to #985 (comment): I think we should only distribute the Zipfile collections, but make them compatible with SBTs (same internal structure, like putting signatures inside a signatures/ path inside the Zipfile).

With the Zipfile it is possible to create a local index (SBT or LCA). If it is a gigantic collection (where building the SBT is prohibitive), we can also add the SBT description (the .sbt.json file), but not the internal nodes, so it is a very low space overhead in the Zipfile collection. And then a prepare command (or something similar) to build/save the internal nodes.

LCA/RevIndex/greyhound is more complicated, because

our serialized format contains an implicit representation of the signatures (needing to rebuild them in memory during loading);
it is expensive to build (in memory consumption), because it has to keep it all in memory (the large hash -> [dataset] mapping).
Even in this case the Zipfile collection is useful, because it can serve as a disk-backed solution to avoid rebuilding the sigs during loading.

At this point LCA/RevIndex/greyhound is better for situations where a lot of memory is available, speed is paramount AND many queries will be executed against the index (https://greyhound.sourmash.bio is one example). For other cases (low mem, one query against an index, gigantic collections) it has too much initialization overhead (due to loading/deserializing the index into memory). One possible solution here is using a no-copy serialization format that can be mmaped from disk, which would avoid the serialization overhead and possibly limit the memory consumption a bit.

ctb · 2021-05-10T15:54:59Z

Connected to #985 (comment): I think we should only distribute the Zipfile collections, but make them compatible with SBTs (same internal structure, like putting signatures inside a signatures/ path inside the Zipfile).

Yes, this is currently what we are doing 🎉 .

With the Zipfile it is possible to create a local index (SBT or LCA). If it is a gigantic collection (where building the SBT is prohibitive), we can also add the SBT description (the .sbt.json file), but not the internal nodes, so it is a very low space overhead in the Zipfile collection. And then a prepare command (or something similar) to build/save the internal nodes.

I like this idea; is it feasible as of v4.1? ISTR this functionality is already present somewhere.

ctb · 2021-05-11T11:29:51Z

gotta say, the new zipfile format sure makes some things easy -

for i in 21 31 51
do
   sourmash index -k $i gtdb-rs202.genomic-reps.k$i.sbt.zip gtdb-rs202.genomic-reps.k$i.zip
done

I think they're a good starting point for database building - create the zipfile collections, then build everything else!

ctb · 2021-05-11T11:50:53Z

for k=31, takes about 10 minutes and 4 GB of RAM for about 48000 genomic sigs, and produces a .sbt.zip file that's about 2.6 GB in size. Pretty good!

        Command being timed: "sourmash index -k 31 ctb/gtdb-rs202.genomic-reps.k31.sbt.zip gtdb-rs202.genomic-reps.k31.zip"
        User time (seconds): 526.16
        System time (seconds): 4.46
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 8:53.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4027756
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2157298
        Voluntary context switches: 3215
        Involuntary context switches: 52993
        Swaps: 0
        File system inputs: 2662384
        File system outputs: 5391976
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

luizirber · 2021-05-11T15:27:28Z

I think they're a good starting point for database building - create the zipfile collections, then build everything else!

Yup, that's what I've been thinking too. Especially if we lay out the zipfile collection in a way that can be reused as storage in the SBT.

(On the Rust side this can also become an implementation of the From trait, taking a collection/LinearIndex and building an SBT or RevIndex out of it)

ctb · 2021-05-12T15:56:22Z

ugh, I frequently forget that GTDB is only for bac/arc. What about making a merged NCBI+GTDB taxonomy for a database that includes non-bac non-arc? or can we enable this by providing multiple lineage files? (yes.)

ctb · 2021-05-12T15:56:59Z

actually, "query merged ncbi and gtdb taxonomies" is an ...interesting selling point...

ctb · 2021-05-15T12:21:15Z

I think we should provide taxonomy CSVs with each database update, too.

ctb · 2021-05-18T13:45:26Z

note to self:

rclone copy -P gtdb-rs202.genomic.$i.sbt.zip gdrive-ucd:sourmash_databases/gtdb-rs202

ctb · 2021-05-21T17:15:46Z

As a random aside, when running:

for i in 21 31 51;
do
   sourmash lca index ../../gtdb-rs202.taxonomy.v2.csv gtdb-rs202.genomic.k$i.lca.json.gz gtdb-rs202.genomic.k$i.sbt.zip --scaled=10000 --require-taxonomy --fail-on --split-identifier -k $i 
done

I get the following:

== This is sourmash version 4.0.1.dev37+gad03e1ec. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Building LCA database with ksize=51 scaled=10000 moltype=DNA.
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
258406 distinct identities in spreadsheet out of 258406 rows.
47894 distinct lineages in spreadsheet out of 258406 rows.
... loaded 1 signatures.
loaded 19317972 hashes at ksize=51 scaled=10000
47894 assigned lineages out of 47894 distinct lineages in spreadsheet.
252808 identifiers used out of 258406 distinct identifiers in spreadsheet.
saving to LCA DB: gtdb-rs202.genomic.k51.lca.json.gz
WARNING: 1 duplicate signatures.

WARNING: no signatures for 5598 spreadsheet rows.
WARNING: 5598 unused identifiers.

The unused identifiers are a bit weird - @bluegenes any thoughts? I can generate a list of them if you want. I'll get that started, actually.

bluegenes · 2021-05-21T17:44:09Z

The unused identifiers are a bit weird - @bluegenes any thoughts? I can generate a list of them if you want. I'll get that started, actually.

I wonder if they might be empty sigfiles -- aka some issue happened during download, etc? I tried to catch all those issues for proteins, but I really only downloaded 67 sigs for the genomic databases -- the rest were pre-downloaded/sketched via wort (which was awesome). Thoughts @luizirber?

ctb · 2021-05-21T18:14:17Z

here are some examples - all in the full 280k, none in the genomic-reps.

GCF_011602145
GCF_902386785
GCF_004634925
GCF_003076315
GCF_009898645
GCF_900988605
GCF_001986975
GCF_000727805
GCF_000827085
GCF_003477765
GCF_002447315
GCF_001831835
GCF_900049285
GCA_003672855
GCF_001328635
GCF_001988895
GCF_010994095
GCF_009708795
GCF_013296465
GCA_013330625
GCF_009762655
GCA_014634665

ctb · 2021-05-21T22:41:47Z

note to self: resolve issues b/t https://osf.io/t3fqa/ and https://osf.io/wxf9z/

bluegenes · 2021-05-22T01:22:47Z

here are some examples - all in the full 280k, none in the genomic-reps.

GCF_011602145
GCF_902386785
GCF_004634925
GCF_003076315
GCF_009898645
GCF_900988605
GCF_001986975
GCF_000727805
GCF_000827085
GCF_003477765
GCF_002447315
GCF_001831835
GCF_900049285
GCA_003672855
GCF_001328635
GCF_001988895
GCF_010994095
GCF_009708795
GCF_013296465
GCA_013330625
GCF_009762655
GCA_014634665

I ran sourmash sig describe on some of these sigs, and they were not empty.

Note that all genome sigs generated with wort can be found here: /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs/

ctb · 2021-06-03T22:30:09Z

I dug into the catalog produced by sourmash sig describe -- it looks like GCF_011602145 is in the .zip but not in the .sbt.zip file!?

(sourmash-dev) ctbrown@farm:/tmp/ctb-gtdb$ grep GCF_011602145 gtdb-rs202.genomic.k31.zip.catalog.txt
signature: GCF_011602145.1 Acinetobacter baumannii strain=BP9590, ASM1160214v1
(sourmash-dev) ctbrown@farm:/tmp/ctb-gtdb$ grep GCF_011602145 gtdb-rs202.genomic.k31.sbt.zip.catalog.txt

ctb · 2021-06-04T13:37:49Z

(figured out the issue with the missing signatures - they were content duplicates. see #1568 and #1171)

ctb · 2021-06-05T16:20:41Z

with latest SBTs & branch in #1568, when constructing LCA databases I now get:

47894 assigned lineages out of 47894 distinct lineages in spreadsheet.
248924 identifiers used out of 258406 distinct identifiers in spreadsheet.

the former is good, the latter is WTF, so I'll need to look into the latter, I guess!

ctb · 2021-06-06T20:05:41Z

OK, the missing identifiers are because of signatures with duplicate md5sums. See #1573.

ctb · 2021-06-07T17:44:36Z

Leaving this here for posterity -

% /usr/bin/time -v sourmash search --containment tests/test-data/47.fa.sig \
     /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.json.gz

== This is sourmash version 4.1.2.dev22+g2af87b6e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: NC_009665.1 Shewanella baltica... (k=31, DNA)
loading from /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.jsloaded 1 databases.

23 matches; showing first 3:
similarity   match
----------   -----
100.0%       GCF_000017325.1 Shewanella baltica OS185 strain=OS185, AS...
 64.6%       GCA_003516645.1 Shewanella baltica, ASM351664v1
 57.0%       GCF_000018765.1 Shewanella baltica OS195 strain=OS195, AS...
        Command being timed: "sourmash search --containment tests/test-data/47.fa.sig /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.json.gz"
        User time (seconds): 163.24
        System time (seconds): 28.23
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:13.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 11205592
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 11
        Minor (reclaiming a frame) page faults: 13297264
        Voluntary context switches: 2563
        Involuntary context switches: 1944
        Swaps: 0
        File system inputs: 606960
        File system outputs: 336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

3 minutes and 11 GB of RAM to do a full "pangenome" search of GTDB 280k genomes with an LCA database.

Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there...

luizirber · 2021-06-07T18:01:53Z

Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there...

This should also be split into "time loading the JSON" and "time reconstructing signatures", which is non-trivial once you have 280k of them =]

ctb · 2021-06-07T18:12:09Z

On Mon, Jun 07, 2021 at 11:02:07AM -0700, Luiz Irber wrote: > Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there... This should also be split into "time loading the JSON" and "time reconstructing signatures", which is non-trivial once you have 280k of them =]

well, the search routine doesn't reconstruct all of the signatures, so...?

ctb · 2021-06-24T15:59:35Z

note: starting /group/ctbrowngrp/gtdb/databases/ctb/Snakefile to construct release databases (.sbt.zip, LCA database, and associated manifests). Not yet in version control 😰

ctb · 2021-10-20T18:16:32Z

I did the following to convert @luizirber genbank SBT.zip files into regular ol' zipfiles, then upload them to google drive:

conda activate sourmash-dev
for i in /path/to/genbank*x1e5*k31*.sbt.zip
do
   sourmash sig cat $i -o $(basename $i .sbt.zip).zip
done

# rename things here, then put in list.txt and run:

conda activate rclone
for i in $(cat list.txt)
do
rclone copy -P $i gdrive-ucd:sourmash_databases/genbank_07_2020
done

Note, the genbank .zip files and taxonomy CSV are now available on farm under /group/ctbrowngrp/genbank:

% ls -lh /group/ctbrowngrp/genbank
total 24G
-rw-r--r-- 1 ntpierce ctbrowngrp  89M Jul 22 13:38 all_genbank_lineages.20200727.csv
-r--r--r-- 1 ctbrown  ctbrown     79M Oct 20 07:03 genbank-2020.07-archaea-k31.zip
-r--r--r-- 1 ctbrown  ctbrown     22G Oct 20 08:44 genbank-2020.07-bacteria-k31.zip
-r--r--r-- 1 ctbrown  ctbrown    1.6G Oct 20 08:52 genbank-2020.07-fungi-k31.zip
-r--r--r-- 1 ctbrown  ctbrown    316M Oct 20 08:53 genbank-2020.07-protozoa-k31.zip
-r--r--r-- 1 ctbrown  ctbrown     34M Oct 20 08:53 genbank-2020.07-viral-k31.zip

ctb · 2022-03-31T13:22:34Z

Now uploading genbank builds from assembly_summary files as of 3/28/22 to https://drive.google.com/drive/folders/1Jk5z4fQtsyqyJWCcNmtn4WyE2jZsejrZ

ctb · 2022-03-31T23:01:55Z

from @luizirber on slack:

Can I grab them and put on the same IPFS storage I'm using for wort, so we have a public link?

genbank-2022.03-viral-k21.zip
genbank-2022.03-viral-k31.zip
genbank-2022.03-viral-k51.zip

genbank-2022.03-archaea-k21.zip
genbank-2022.03-archaea-k31.zip
genbank-2022.03-archaea-k51.zip

genbank-2022.03-protozoa-k21.zip
genbank-2022.03-protozoa-k31.zip
genbank-2022.03-protozoa-k51.zip

genbank-2022.03-fungi-k21.zip
genbank-2022.03-fungi-k31.zip
genbank-2022.03-fungi-k51.zip

genbank-2022.03-bacteria-k21.zip
genbank-2022.03-bacteria-k31.zip
genbank-2022.03-bacteria-k51.zip

gtdb-rs207.genomic.dna.k21.zip
gtdb-rs207.genomic.dna.k31.zip
gtdb-rs207.genomic.dna.k51.zip

ctb · 2022-05-01T13:46:14Z

closing in favor of #2015.

This was referenced Jun 4, 2021

duplication of signatures seen in large SBT databases #1171

Closed

[MRG] Fix SBT duplicates bug #1568

Merged

ctb mentioned this issue Jun 5, 2021

split database construction and release processes; provide database catalogs #1569

Closed

ctb mentioned this issue Jun 6, 2021

sourmash lca index eliminates identical signatures before recording their identifiers #1573

Open

taylorreiter mentioned this issue Jun 11, 2021

[MRG] Add new db description to docs and start legacy databases page #1581

Merged

ctb mentioned this issue Jul 3, 2021

rework database construction and release process to use manifests #1652

Closed

ctb mentioned this issue Oct 20, 2021

Preconfigured databases with both eukaryotes and prokaryotes? #1749

Closed

ctb mentioned this issue Apr 10, 2022

release new GTDB databases for R07-RS207 #1941

Closed

ctb mentioned this issue May 1, 2022

sourmash database construction - current status and future thoughts #2015

Open

ctb closed this as completed May 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what databases do we want to provide, and how? #1511

what databases do we want to provide, and how? #1511

ctb commented May 10, 2021 •

edited

Loading

ctb commented May 10, 2021

ctb commented May 10, 2021

luizirber commented May 10, 2021

ctb commented May 10, 2021

ctb commented May 11, 2021

ctb commented May 11, 2021

luizirber commented May 11, 2021

ctb commented May 12, 2021

ctb commented May 12, 2021

ctb commented May 15, 2021

ctb commented May 18, 2021

ctb commented May 21, 2021

bluegenes commented May 21, 2021 •

edited

Loading

ctb commented May 21, 2021

ctb commented May 21, 2021

bluegenes commented May 22, 2021

ctb commented Jun 3, 2021 •

edited

Loading

ctb commented Jun 4, 2021

ctb commented Jun 5, 2021

ctb commented Jun 6, 2021

ctb commented Jun 7, 2021

luizirber commented Jun 7, 2021

ctb commented Jun 7, 2021 via email

ctb commented Jun 24, 2021

ctb commented Oct 20, 2021 •

edited

Loading

ctb commented Mar 31, 2022

ctb commented Mar 31, 2022 •

edited by luizirber

Loading

ctb commented May 1, 2022

what databases do we want to provide, and how? #1511

what databases do we want to provide, and how? #1511

Comments

ctb commented May 10, 2021 • edited Loading

ctb commented May 10, 2021

ctb commented May 10, 2021

luizirber commented May 10, 2021

ctb commented May 10, 2021

ctb commented May 11, 2021

ctb commented May 11, 2021

luizirber commented May 11, 2021

ctb commented May 12, 2021

ctb commented May 12, 2021

ctb commented May 15, 2021

ctb commented May 18, 2021

ctb commented May 21, 2021

bluegenes commented May 21, 2021 • edited Loading

ctb commented May 21, 2021

ctb commented May 21, 2021

bluegenes commented May 22, 2021

ctb commented Jun 3, 2021 • edited Loading

ctb commented Jun 4, 2021

ctb commented Jun 5, 2021

ctb commented Jun 6, 2021

ctb commented Jun 7, 2021

luizirber commented Jun 7, 2021

ctb commented Jun 7, 2021 via email

ctb commented Jun 24, 2021

ctb commented Oct 20, 2021 • edited Loading

ctb commented Mar 31, 2022

ctb commented Mar 31, 2022 • edited by luizirber Loading

ctb commented May 1, 2022

ctb commented May 10, 2021 •

edited

Loading

bluegenes commented May 21, 2021 •

edited

Loading

ctb commented Jun 3, 2021 •

edited

Loading

ctb commented Oct 20, 2021 •

edited

Loading

ctb commented Mar 31, 2022 •

edited by luizirber

Loading