Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what databases do we want to provide, and how? #1511

Closed
ctb opened this issue May 10, 2021 · 28 comments
Closed

what databases do we want to provide, and how? #1511

ctb opened this issue May 10, 2021 · 28 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 10, 2021

re sourmash 4.1 release in particular #1481 #1391

we now effectively have three types of databases we are supporting and could distribute -

  • SBTs
  • LCAs
  • Zipfile collections

and we can supply genbank, refseq, GTDB genomic reps (~35k genomes), and/or GTDB all (~300k genomes)

and we can supply DNA and/or protein

and for LCA databases we can provide a taxonomy, too - NCBI or GTDB.

😅

A few misc thoughts -

  • the SBTs are too big for anything but the GTDB genomic reps
  • same with the LCAs
  • since GTDB genomic reps are "representative" subsets of genbank I think we could safely provide SBTs and LCAs only for them
  • we should provide LCA databases with NCBI and GTDB taxonomies both.
  • the Zip file collections will presumably be more useful when greyhound is fully available, but they are already useful;
  • we should provide protein at some point, but it is less critical than updated DNA

Current status: GTDB zipfile collections are available on OSF (under Google Drive) for the latest RS202, courtesy of @bluegenes.

@ctb
Copy link
Contributor Author

ctb commented May 10, 2021

if I build .sbt.zip for the SBTs, should I use default sourmash index parameters?

@ctb
Copy link
Contributor Author

ctb commented May 10, 2021

ref GTDB databases

@luizirber
Copy link
Member

Connected to #985 (comment): I think we should only distribute the Zipfile collections, but make them compatible with SBTs (same internal structure, like putting signatures inside a signatures/ path inside the Zipfile).

With the Zipfile it is possible to create a local index (SBT or LCA). If it is a gigantic collection (where building the SBT is prohibitive), we can also add the SBT description (the .sbt.json file), but not the internal nodes, so it is a very low space overhead in the Zipfile collection. And then a prepare command (or something similar) to build/save the internal nodes.

LCA/RevIndex/greyhound is more complicated, because

  • our serialized format contains an implicit representation of the signatures (needing to rebuild them in memory during loading);
  • it is expensive to build (in memory consumption), because it has to keep it all in memory (the large hash -> [dataset] mapping).
    Even in this case the Zipfile collection is useful, because it can serve as a disk-backed solution to avoid rebuilding the sigs during loading.

At this point LCA/RevIndex/greyhound is better for situations where a lot of memory is available, speed is paramount AND many queries will be executed against the index (https://greyhound.sourmash.bio is one example). For other cases (low mem, one query against an index, gigantic collections) it has too much initialization overhead (due to loading/deserializing the index into memory). One possible solution here is using a no-copy serialization format that can be mmaped from disk, which would avoid the serialization overhead and possibly limit the memory consumption a bit.

@ctb
Copy link
Contributor Author

ctb commented May 10, 2021

Connected to #985 (comment): I think we should only distribute the Zipfile collections, but make them compatible with SBTs (same internal structure, like putting signatures inside a signatures/ path inside the Zipfile).

Yes, this is currently what we are doing 🎉 .

With the Zipfile it is possible to create a local index (SBT or LCA). If it is a gigantic collection (where building the SBT is prohibitive), we can also add the SBT description (the .sbt.json file), but not the internal nodes, so it is a very low space overhead in the Zipfile collection. And then a prepare command (or something similar) to build/save the internal nodes.

I like this idea; is it feasible as of v4.1? ISTR this functionality is already present somewhere.

@ctb
Copy link
Contributor Author

ctb commented May 11, 2021

gotta say, the new zipfile format sure makes some things easy -

for i in 21 31 51
do
   sourmash index -k $i gtdb-rs202.genomic-reps.k$i.sbt.zip gtdb-rs202.genomic-reps.k$i.zip
done

I think they're a good starting point for database building - create the zipfile collections, then build everything else!

@ctb
Copy link
Contributor Author

ctb commented May 11, 2021

for k=31, takes about 10 minutes and 4 GB of RAM for about 48000 genomic sigs, and produces a .sbt.zip file that's about 2.6 GB in size. Pretty good!

        Command being timed: "sourmash index -k 31 ctb/gtdb-rs202.genomic-reps.k31.sbt.zip gtdb-rs202.genomic-reps.k31.zip"
        User time (seconds): 526.16
        System time (seconds): 4.46
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 8:53.64
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4027756
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 2157298
        Voluntary context switches: 3215
        Involuntary context switches: 52993
        Swaps: 0
        File system inputs: 2662384
        File system outputs: 5391976
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@luizirber
Copy link
Member

I think they're a good starting point for database building - create the zipfile collections, then build everything else!

Yup, that's what I've been thinking too. Especially if we lay out the zipfile collection in a way that can be reused as storage in the SBT.

(On the Rust side this can also become an implementation of the From trait, taking a collection/LinearIndex and building an SBT or RevIndex out of it)

@ctb
Copy link
Contributor Author

ctb commented May 12, 2021

ugh, I frequently forget that GTDB is only for bac/arc. What about making a merged NCBI+GTDB taxonomy for a database that includes non-bac non-arc? or can we enable this by providing multiple lineage files? (yes.)

@ctb
Copy link
Contributor Author

ctb commented May 12, 2021

actually, "query merged ncbi and gtdb taxonomies" is an ...interesting selling point...

@ctb
Copy link
Contributor Author

ctb commented May 15, 2021

I think we should provide taxonomy CSVs with each database update, too.

@ctb
Copy link
Contributor Author

ctb commented May 18, 2021

note to self:

rclone copy -P gtdb-rs202.genomic.$i.sbt.zip gdrive-ucd:sourmash_databases/gtdb-rs202

@ctb
Copy link
Contributor Author

ctb commented May 21, 2021

As a random aside, when running:

for i in 21 31 51;
do
   sourmash lca index ../../gtdb-rs202.taxonomy.v2.csv gtdb-rs202.genomic.k$i.lca.json.gz gtdb-rs202.genomic.k$i.sbt.zip --scaled=10000 --require-taxonomy --fail-on --split-identifier -k $i 
done

I get the following:

== This is sourmash version 4.0.1.dev37+gad03e1ec. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Building LCA database with ksize=51 scaled=10000 moltype=DNA.
examining spreadsheet headers...
** assuming column 'ident' is identifiers in spreadsheet
258406 distinct identities in spreadsheet out of 258406 rows.
47894 distinct lineages in spreadsheet out of 258406 rows.
... loaded 1 signatures.
loaded 19317972 hashes at ksize=51 scaled=10000
47894 assigned lineages out of 47894 distinct lineages in spreadsheet.
252808 identifiers used out of 258406 distinct identifiers in spreadsheet.
saving to LCA DB: gtdb-rs202.genomic.k51.lca.json.gz
WARNING: 1 duplicate signatures.

WARNING: no signatures for 5598 spreadsheet rows.
WARNING: 5598 unused identifiers.

The unused identifiers are a bit weird - @bluegenes any thoughts? I can generate a list of them if you want. I'll get that started, actually.

@bluegenes
Copy link
Contributor

bluegenes commented May 21, 2021

The unused identifiers are a bit weird - @bluegenes any thoughts? I can generate a list of them if you want. I'll get that started, actually.

I wonder if they might be empty sigfiles -- aka some issue happened during download, etc? I tried to catch all those issues for proteins, but I really only downloaded 67 sigs for the genomic databases -- the rest were pre-downloaded/sketched via wort (which was awesome). Thoughts @luizirber?

@ctb
Copy link
Contributor Author

ctb commented May 21, 2021

here are some examples - all in the full 280k, none in the genomic-reps.

GCF_011602145
GCF_902386785
GCF_004634925
GCF_003076315
GCF_009898645
GCF_900988605
GCF_001986975
GCF_000727805
GCF_000827085
GCF_003477765
GCF_002447315
GCF_001831835
GCF_900049285
GCA_003672855
GCF_001328635
GCF_001988895
GCF_010994095
GCF_009708795
GCF_013296465
GCA_013330625
GCF_009762655
GCA_014634665

@ctb
Copy link
Contributor Author

ctb commented May 21, 2021

note to self: resolve issues b/t https://osf.io/t3fqa/ and https://osf.io/wxf9z/

@bluegenes
Copy link
Contributor

here are some examples - all in the full 280k, none in the genomic-reps.

GCF_011602145
GCF_902386785
GCF_004634925
GCF_003076315
GCF_009898645
GCF_900988605
GCF_001986975
GCF_000727805
GCF_000827085
GCF_003477765
GCF_002447315
GCF_001831835
GCF_900049285
GCA_003672855
GCF_001328635
GCF_001988895
GCF_010994095
GCF_009708795
GCF_013296465
GCA_013330625
GCF_009762655
GCA_014634665

I ran sourmash sig describe on some of these sigs, and they were not empty.

Note that all genome sigs generated with wort can be found here: /group/ctbrowngrp/irber/data/wort-data/wort-genomes/sigs/

@ctb
Copy link
Contributor Author

ctb commented Jun 3, 2021

I dug into the catalog produced by sourmash sig describe -- it looks like GCF_011602145 is in the .zip but not in the .sbt.zip file!?

(sourmash-dev) ctbrown@farm:/tmp/ctb-gtdb$ grep GCF_011602145 gtdb-rs202.genomic.k31.zip.catalog.txt
signature: GCF_011602145.1 Acinetobacter baumannii strain=BP9590, ASM1160214v1
(sourmash-dev) ctbrown@farm:/tmp/ctb-gtdb$ grep GCF_011602145 gtdb-rs202.genomic.k31.sbt.zip.catalog.txt

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

(figured out the issue with the missing signatures - they were content duplicates. see #1568 and #1171)

@ctb
Copy link
Contributor Author

ctb commented Jun 5, 2021

with latest SBTs & branch in #1568, when constructing LCA databases I now get:

47894 assigned lineages out of 47894 distinct lineages in spreadsheet.
248924 identifiers used out of 258406 distinct identifiers in spreadsheet.

the former is good, the latter is WTF, so I'll need to look into the latter, I guess!

@ctb
Copy link
Contributor Author

ctb commented Jun 6, 2021

OK, the missing identifiers are because of signatures with duplicate md5sums. See #1573.

@ctb
Copy link
Contributor Author

ctb commented Jun 7, 2021

Leaving this here for posterity -

% /usr/bin/time -v sourmash search --containment tests/test-data/47.fa.sig \
     /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.json.gz

== This is sourmash version 4.1.2.dev22+g2af87b6e. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=31 automatically.
loaded query: NC_009665.1 Shewanella baltica... (k=31, DNA)
loading from /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.jsloaded 1 databases.

23 matches; showing first 3:
similarity   match
----------   -----
100.0%       GCF_000017325.1 Shewanella baltica OS185 strain=OS185, AS...
 64.6%       GCA_003516645.1 Shewanella baltica, ASM351664v1
 57.0%       GCF_000018765.1 Shewanella baltica OS195 strain=OS195, AS...
        Command being timed: "sourmash search --containment tests/test-data/47.fa.sig /group/ctbrowngrp/gtdb/databases/ctb2/gtdb-rs202.genomic.k31.lca.json.gz"
        User time (seconds): 163.24
        System time (seconds): 28.23
        Percent of CPU this job got: 98%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:13.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 11205592
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 11
        Minor (reclaiming a frame) page faults: 13297264
        Voluntary context switches: 2563
        Involuntary context switches: 1944
        Swaps: 0
        File system inputs: 606960
        File system outputs: 336
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

3 minutes and 11 GB of RAM to do a full "pangenome" search of GTDB 280k genomes with an LCA database.

Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there...

@luizirber
Copy link
Member

Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there...

This should also be split into "time loading the JSON" and "time reconstructing signatures", which is non-trivial once you have 280k of them =]

@ctb
Copy link
Contributor Author

ctb commented Jun 7, 2021 via email

@ctb
Copy link
Contributor Author

ctb commented Jun 24, 2021

note: starting /group/ctbrowngrp/gtdb/databases/ctb/Snakefile to construct release databases (.sbt.zip, LCA database, and associated manifests). Not yet in version control 😰

@ctb
Copy link
Contributor Author

ctb commented Oct 20, 2021

I did the following to convert @luizirber genbank SBT.zip files into regular ol' zipfiles, then upload them to google drive:

conda activate sourmash-dev
for i in /path/to/genbank*x1e5*k31*.sbt.zip
do
   sourmash sig cat $i -o $(basename $i .sbt.zip).zip
done

# rename things here, then put in list.txt and run:

conda activate rclone
for i in $(cat list.txt)
do
rclone copy -P $i gdrive-ucd:sourmash_databases/genbank_07_2020
done

Note, the genbank .zip files and taxonomy CSV are now available on farm under /group/ctbrowngrp/genbank:

% ls -lh /group/ctbrowngrp/genbank
total 24G
-rw-r--r-- 1 ntpierce ctbrowngrp  89M Jul 22 13:38 all_genbank_lineages.20200727.csv
-r--r--r-- 1 ctbrown  ctbrown     79M Oct 20 07:03 genbank-2020.07-archaea-k31.zip
-r--r--r-- 1 ctbrown  ctbrown     22G Oct 20 08:44 genbank-2020.07-bacteria-k31.zip
-r--r--r-- 1 ctbrown  ctbrown    1.6G Oct 20 08:52 genbank-2020.07-fungi-k31.zip
-r--r--r-- 1 ctbrown  ctbrown    316M Oct 20 08:53 genbank-2020.07-protozoa-k31.zip
-r--r--r-- 1 ctbrown  ctbrown     34M Oct 20 08:53 genbank-2020.07-viral-k31.zip

@ctb
Copy link
Contributor Author

ctb commented Mar 31, 2022

Now uploading genbank builds from assembly_summary files as of 3/28/22 to https://drive.google.com/drive/folders/1Jk5z4fQtsyqyJWCcNmtn4WyE2jZsejrZ

@ctb
Copy link
Contributor Author

ctb commented May 1, 2022

closing in favor of #2015.

@ctb ctb closed this as completed May 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants