-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizing gather on large databases - some thoughts in re benchmarking #1957
Comments
In #1943 I started splitting the RevIndex into the linear part (that still can do all the parallel loading/processing) and the inverted index/colors part. Noticeable differences are
The SBT index is pretty good for calculating the initial So I'm considering making This is probably doable in Python with the |
thought unrelated to previous comment - It's surprising to me that the standing memory for prefetch on genbank is 40 GB! I wonder how much space the manifests are taking up? The 'bacteria' one is pretty big. I know for the wort CSV manifest for 4.5m signatures, the memory consumption was in the range of 6 GB, so the manifest for genbank bacteria should be only about 2 GB... but I wonder. |
from benchmarking on 1.2m signature genbank,
even fairly simple environmental metagenomes such as SRR1976948 are matching to lots of redundancy.
we're looking at ways to deal with this at a database construction level (e.g. #1852) but that's fraught with peril - I like having complete databases. my guess is that there is something that can be done in the prefetch / counter-gather stage.
my first guess was that there are a lot of perfect duplicates in the sketches, but a quick check on md5 rules that out - of the 723k prefetch matches in SRR12324253, only 33k of them are perfect dups at a scaled of 1000:
we are seeing a lot of memory being used in the counter-gather stage (compare memory usage for prefetch against genbank, stable at 40.2GB, vs gather, which is sometimes double or more). that is probably closely correlated with this redundancy!
I'm wondering if there are some tricks we could use in the counter gather class? can we find properly enclosing sets and ignore them? (e.g. situations where we have matches A and B, and A entirely encloses B, so we can ignore B?) dunno. but might help a lot with memory.
The text was updated successfully, but these errors were encountered: