optimizing gather on large databases - some thoughts in re benchmarking #1957

ctb · 2022-04-16T15:24:06Z

from benchmarking on 1.2m signature genbank,

even fairly simple environmental metagenomes such as SRR1976948 are matching to lots of redundancy.

we're looking at ways to deal with this at a database construction level (e.g. #1852) but that's fraught with peril - I like having complete databases. my guess is that there is something that can be done in the prefetch / counter-gather stage.

my first guess was that there are a lot of perfect duplicates in the sketches, but a quick check on md5 rules that out - of the 723k prefetch matches in SRR12324253, only 33k of them are perfect dups at a scaled of 1000:

% csvtk cut -f match_md5 SRR12324253.prefetch.csv | wc -l
723080
% csvtk cut -f match_md5 SRR12324253.prefetch.csv | sort | uniq | wc -l
689292

we are seeing a lot of memory being used in the counter-gather stage (compare memory usage for prefetch against genbank, stable at 40.2GB, vs gather, which is sometimes double or more). that is probably closely correlated with this redundancy!

I'm wondering if there are some tricks we could use in the counter gather class? can we find properly enclosing sets and ignore them? (e.g. situations where we have matches A and B, and A entirely encloses B, so we can ignore B?) dunno. but might help a lot with memory.

The text was updated successfully, but these errors were encountered:

luizirber · 2022-04-16T17:10:37Z

I'm wondering if there are some tricks we could use in the counter gather class? can we find properly enclosing sets and ignore them? (e.g. situations where we have matches A and B, and A entirely encloses B, so we can ignore B?) dunno. but might help a lot with memory.

In #1943 I started splitting the RevIndex into the linear part (that still can do all the parallel loading/processing) and the inverted index/colors part. Noticeable differences are

when building a Counter for a query: linear has to go thru all indexed sigs, while RevIndex benefits from the color mapping.
when performing gather: linear needs both the match and the sig for each indexed dataset still in the Counter, forcing a lot of loading from disk if it is not already loaded (or, conversely, keeping a lot in memory). RevIndex can again benefit from the color mapping and remove hashes quickly (without loading all the indexed sigs)

The SBT index is pretty good for calculating the initial Counter (because it can skip indexed sigs that don't match earlier), but still terrible for the gather part. LinearIndex is terrible for both (even if parallelized), and RevIndex is great for both (at the cost of memory consumption).

So I'm considering making prefetch return a reduced RevIndex containing only the hashes matching the query (and their associated color/dataset containing it list) instead of returning something like a LinearIndex (even if we don't explicitly call it like that, effectively a list of matching sigs is equivalent). A RevIndex with 600k sigs is still massive, but the size will be much smaller if we consider only the hashes in the query.

This is probably doable in Python with the SqliteIndex in place of RevIndex, I think?

ctb · 2022-04-17T14:00:22Z

thought unrelated to previous comment - It's surprising to me that the standing memory for prefetch on genbank is 40 GB! I wonder how much space the manifests are taking up? The 'bacteria' one is pretty big.

I know for the wort CSV manifest for 4.5m signatures, the memory consumption was in the range of 6 GB, so the manifest for genbank bacteria should be only about 2 GB... but I wonder.

ctb added the speeding-up-gather label Oct 13, 2022

luizirber mentioned this issue Feb 18, 2024

MRG: Calculate all gather stats in rust; use for rocksdb gather #2943

Merged

bluegenes mentioned this issue Feb 21, 2024

calculate all stats for rust linear gather #3020

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizing gather on large databases - some thoughts in re benchmarking #1957

optimizing gather on large databases - some thoughts in re benchmarking #1957

ctb commented Apr 16, 2022

luizirber commented Apr 16, 2022

ctb commented Apr 17, 2022

optimizing gather on large databases - some thoughts in re benchmarking #1957

optimizing gather on large databases - some thoughts in re benchmarking #1957

Comments

ctb commented Apr 16, 2022

luizirber commented Apr 16, 2022

ctb commented Apr 17, 2022