-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
what databases do we want to provide, and how? #1511
Comments
if I build .sbt.zip for the SBTs, should I use default |
ref GTDB databases |
Connected to #985 (comment): I think we should only distribute the Zipfile collections, but make them compatible with SBTs (same internal structure, like putting signatures inside a With the Zipfile it is possible to create a local index (SBT or LCA). If it is a gigantic collection (where building the SBT is prohibitive), we can also add the SBT description (the LCA/RevIndex/greyhound is more complicated, because
At this point LCA/RevIndex/greyhound is better for situations where a lot of memory is available, speed is paramount AND many queries will be executed against the index (https://greyhound.sourmash.bio is one example). For other cases (low mem, one query against an index, gigantic collections) it has too much initialization overhead (due to loading/deserializing the index into memory). One possible solution here is using a no-copy serialization format that can be |
Yes, this is currently what we are doing 🎉 .
I like this idea; is it feasible as of v4.1? ISTR this functionality is already present somewhere. |
gotta say, the new zipfile format sure makes some things easy -
I think they're a good starting point for database building - create the zipfile collections, then build everything else! |
for k=31, takes about 10 minutes and 4 GB of RAM for about 48000 genomic sigs, and produces a
|
Yup, that's what I've been thinking too. Especially if we lay out the zipfile collection in a way that can be reused as storage in the SBT. (On the Rust side this can also become an implementation of the |
ugh, I frequently forget that GTDB is only for bac/arc. What about making a merged NCBI+GTDB taxonomy for a database that includes non-bac non-arc? or can we enable this by providing multiple lineage files? (yes.) |
actually, "query merged ncbi and gtdb taxonomies" is an ...interesting selling point... |
I think we should provide taxonomy CSVs with each database update, too. |
note to self:
|
As a random aside, when running: for i in 21 31 51;
do
sourmash lca index ../../gtdb-rs202.taxonomy.v2.csv gtdb-rs202.genomic.k$i.lca.json.gz gtdb-rs202.genomic.k$i.sbt.zip --scaled=10000 --require-taxonomy --fail-on --split-identifier -k $i
done I get the following:
The unused identifiers are a bit weird - @bluegenes any thoughts? I can generate a list of them if you want. I'll get that started, actually. |
I wonder if they might be empty sigfiles -- aka some issue happened during download, etc? I tried to catch all those issues for proteins, but I really only downloaded 67 sigs for the |
here are some examples - all in the full 280k, none in the genomic-reps.
|
note to self: resolve issues b/t https://osf.io/t3fqa/ and https://osf.io/wxf9z/ |
I ran Note that all genome sigs generated with |
I dug into the catalog produced by
|
with latest SBTs & branch in #1568, when constructing LCA databases I now get:
the former is good, the latter is WTF, so I'll need to look into the latter, I guess! |
OK, the missing identifiers are because of signatures with duplicate md5sums. See #1573. |
Leaving this here for posterity -
3 minutes and 11 GB of RAM to do a full "pangenome" search of GTDB 280k genomes with an LCA database. Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there... |
This should also be split into "time loading the JSON" and "time reconstructing signatures", which is non-trivial once you have 280k of them =] |
On Mon, Jun 07, 2021 at 11:02:07AM -0700, Luiz Irber wrote:
> Something like 95% of the time is spent loading the JSON into memory, so clear room for improvement there...
This should also be split into "time loading the JSON" and "time reconstructing signatures", which is non-trivial once you have 280k of them =]
well, the search routine doesn't reconstruct all of the signatures, so...?
|
note: starting |
I did the following to convert @luizirber genbank SBT.zip files into regular ol' zipfiles, then upload them to google drive:
Note, the genbank .zip files and taxonomy CSV are now available on farm under
|
Now uploading genbank builds from |
closing in favor of #2015. |
re sourmash 4.1 release in particular #1481 #1391
we now effectively have three types of databases we are supporting and could distribute -
and we can supply genbank, refseq, GTDB genomic reps (~35k genomes), and/or GTDB all (~300k genomes)
and we can supply DNA and/or protein
and for LCA databases we can provide a taxonomy, too - NCBI or GTDB.
😅
A few misc thoughts -
Current status: GTDB zipfile collections are available on OSF (under Google Drive) for the latest RS202, courtesy of @bluegenes.
The text was updated successfully, but these errors were encountered: