Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash index running out of memory #1284

Closed
ctb opened this issue Jan 18, 2021 · 1 comment
Closed

sourmash index running out of memory #1284

ctb opened this issue Jan 18, 2021 · 1 comment

Comments

@ctb
Copy link
Contributor

ctb commented Jan 18, 2021

a user reports via e-mail that they have a custom database of about 300k genomes, and their sourmash index job is being killed - presumably due to being out of memory. I responded with the following suggestions:

what version of sourmash are you using?

this is likely due to running out of memory; a few thoughts -

  • first, try increasing the --scaled with which you are indexing the database. For example, you can do sourmash index --scaled=10000 … and it should reduce the memory usage.
  • second, try decreasing the size of the bloom filters you are using for sourmash index - maybe try sourmash index -x 1e4, I think. On the downside this will decrease speed of database search ;(.
  • third, you can split the signatures up. It’s not that much slower to search 6 databases of 50k signatures each, and the build process should be much faster/smaller for those smaller databases. Just put them all on the command line for sourmash search, gather, etc.
  • fourth, for running sourmash gather on large databases, we have a stop-gap approach that is faster and doesn’t require indexing the database. It’s implemented in genome_grist, here - https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L506 and here https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L513. Happy to explain further if you’re interested!
  • finally, if you are using sourmash gather we have some new approaches coming along that you can try out that don’t require building an LCA or SBT index. They won’t be out for a few months officially, but you can track other makeshift strategies for large scale database search - the "greyhound" issue #1226 if interested.
@ctb
Copy link
Contributor Author

ctb commented May 4, 2022

We have several new database formats (including Zip files and on-disk SQLite dbs) that should solve this problem; documentation added in #2025, available at https://sourmash.readthedocs.io/en/latest/databases-advanced.html.

@ctb ctb closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant