Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizing gather on large databases - some thoughts in re benchmarking #1957

Open
ctb opened this issue Apr 16, 2022 · 2 comments
Open

optimizing gather on large databases - some thoughts in re benchmarking #1957

ctb opened this issue Apr 16, 2022 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Apr 16, 2022

from benchmarking on 1.2m signature genbank,

even fairly simple environmental metagenomes such as SRR1976948 are matching to lots of redundancy.

we're looking at ways to deal with this at a database construction level (e.g. #1852) but that's fraught with peril - I like having complete databases. my guess is that there is something that can be done in the prefetch / counter-gather stage.

my first guess was that there are a lot of perfect duplicates in the sketches, but a quick check on md5 rules that out - of the 723k prefetch matches in SRR12324253, only 33k of them are perfect dups at a scaled of 1000:

% csvtk cut -f match_md5 SRR12324253.prefetch.csv | wc -l
723080
% csvtk cut -f match_md5 SRR12324253.prefetch.csv | sort | uniq | wc -l
689292

we are seeing a lot of memory being used in the counter-gather stage (compare memory usage for prefetch against genbank, stable at 40.2GB, vs gather, which is sometimes double or more). that is probably closely correlated with this redundancy!

I'm wondering if there are some tricks we could use in the counter gather class? can we find properly enclosing sets and ignore them? (e.g. situations where we have matches A and B, and A entirely encloses B, so we can ignore B?) dunno. but might help a lot with memory.

@luizirber
Copy link
Member

I'm wondering if there are some tricks we could use in the counter gather class? can we find properly enclosing sets and ignore them? (e.g. situations where we have matches A and B, and A entirely encloses B, so we can ignore B?) dunno. but might help a lot with memory.

In #1943 I started splitting the RevIndex into the linear part (that still can do all the parallel loading/processing) and the inverted index/colors part. Noticeable differences are

  • when building a Counter for a query: linear has to go thru all indexed sigs, while RevIndex benefits from the color mapping.
  • when performing gather: linear needs both the match and the sig for each indexed dataset still in the Counter, forcing a lot of loading from disk if it is not already loaded (or, conversely, keeping a lot in memory). RevIndex can again benefit from the color mapping and remove hashes quickly (without loading all the indexed sigs)

The SBT index is pretty good for calculating the initial Counter (because it can skip indexed sigs that don't match earlier), but still terrible for the gather part. LinearIndex is terrible for both (even if parallelized), and RevIndex is great for both (at the cost of memory consumption).

So I'm considering making prefetch return a reduced RevIndex containing only the hashes matching the query (and their associated color/dataset containing it list) instead of returning something like a LinearIndex (even if we don't explicitly call it like that, effectively a list of matching sigs is equivalent). A RevIndex with 600k sigs is still massive, but the size will be much smaller if we consider only the hashes in the query.

This is probably doable in Python with the SqliteIndex in place of RevIndex, I think?

@ctb
Copy link
Contributor Author

ctb commented Apr 17, 2022

thought unrelated to previous comment - It's surprising to me that the standing memory for prefetch on genbank is 40 GB! I wonder how much space the manifests are taking up? The 'bacteria' one is pretty big.

I know for the wort CSV manifest for 4.5m signatures, the memory consumption was in the range of 6 GB, so the manifest for genbank bacteria should be only about 2 GB... but I wonder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants