Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic usability problems with genbank SBT #716

Closed
ctb opened this issue Aug 24, 2019 · 6 comments
Closed

Basic usability problems with genbank SBT #716

ctb opened this issue Aug 24, 2019 · 6 comments

Comments

@ctb
Copy link
Contributor

ctb commented Aug 24, 2019

This is going to be a bit of a meta issue, but it is also a separate UX issue.

With the genbank-d2 update, it's become quite hard to actually use the genbank SBT. A few issues -

  • it's too big to unpack on Jetstream - it's in the ~30 GB range
  • it's too slow to search on my laptop, even with sbt gather.

I know there are various things being worked on that could help with these issues, and I want to collect them here.

@ctb
Copy link
Contributor Author

ctb commented Aug 24, 2019

One note is that 'sourmash gather -n ` could truncate the search at that point. Right now it does not - which makes running sbt gather on many files kind of annoying.

(CTB note: fixed in #1042)

@ctb
Copy link
Contributor Author

ctb commented Aug 24, 2019

Supporting direct compressed sbt.json archives (.zip or .tar.gz?) would be a big step. See #648.

@luizirber
Copy link
Member

Biggest issue with #648 is that... it's kind of slow (but I think there are some ZIP tricks than can be played to make it faster).

The size of the SBT is due to changes on internal node sizes, but we can revert that. Clustering the SBT will also help (#710 (comment)), because we can compress the internal nodes better.

Oh, and I don't think we are compressing the internal nodes, which is also something that is relevant to #648 (stored nodes compressed inside the ZIP, and use the ZIP format just for single file distribution).

@ctb
Copy link
Contributor Author

ctb commented Oct 22, 2019

ref #646

@ctb
Copy link
Contributor Author

ctb commented May 4, 2020

With #799 and #648 now merged, many of these issues should be resolved - will regenerate databases and see what happens :). #925 is the next likely candidate for performance improvements on SBTs and large databases, specifically.

@ctb ctb closed this as completed Jul 18, 2020
@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2020

the remaining usability problems are mostly around NCBI taxonomy issues... see sourmash-bio/databases#8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants