`sourmash index` running out of memory #1284

ctb · 2021-01-18T18:29:57Z

a user reports via e-mail that they have a custom database of about 300k genomes, and their sourmash index job is being killed - presumably due to being out of memory. I responded with the following suggestions:

what version of sourmash are you using?

this is likely due to running out of memory; a few thoughts -

first, try increasing the --scaled with which you are indexing the database. For example, you can do sourmash index --scaled=10000 … and it should reduce the memory usage.

second, try decreasing the size of the bloom filters you are using for sourmash index - maybe try sourmash index -x 1e4, I think. On the downside this will decrease speed of database search ;(.

third, you can split the signatures up. It’s not that much slower to search 6 databases of 50k signatures each, and the build process should be much faster/smaller for those smaller databases. Just put them all on the command line for sourmash search, gather, etc.

fourth, for running sourmash gather on large databases, we have a stop-gap approach that is faster and doesn’t require indexing the database. It’s implemented in genome_grist, here - https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L506 and here https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile#L513. Happy to explain further if you’re interested!

finally, if you are using sourmash gather we have some new approaches coming along that you can try out that don’t require building an LCA or SBT index. They won’t be out for a few months officially, but you can track other makeshift strategies for large scale database search - the "greyhound" issue #1226 if interested.

The text was updated successfully, but these errors were encountered:

ctb · 2022-05-04T14:57:38Z

We have several new database formats (including Zip files and on-disk SQLite dbs) that should solve this problem; documentation added in #2025, available at https://sourmash.readthedocs.io/en/latest/databases-advanced.html.

ctb closed this as completed May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`sourmash index` running out of memory #1284

`sourmash index` running out of memory #1284

ctb commented Jan 18, 2021 •

edited

Loading

ctb commented May 4, 2022

sourmash index running out of memory #1284

sourmash index running out of memory #1284

Comments

ctb commented Jan 18, 2021 • edited Loading

ctb commented May 4, 2022

`sourmash index` running out of memory #1284

`sourmash index` running out of memory #1284

ctb commented Jan 18, 2021 •

edited

Loading