Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate new sketch type, ArgMinHash-based FracMinHash #2039

Open
ctb opened this issue May 6, 2022 · 10 comments
Open

investigate new sketch type, ArgMinHash-based FracMinHash #2039

ctb opened this issue May 6, 2022 · 10 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 6, 2022

per #267 (comment)

@dkoslicki says:

In the FracMinHash sketches, does sourmash save the k-mers that led to the hash values, or just the hash values? If the former, then you could do something like this in order to compute multiple sketch sizes. Not trying to sound like a broken record, just curious :) Though this may not be an issue after the rust-ification of sourmash given the ridiculous speed.

and me:

  • compute a "founder" FracMinHash sketch for k=51, storing k-mers sequences as well.
  • then, descendant sketches (no longer FracMinHash bottom sketches, but still comparable as long as they share the same founder ksize) can be computed for any k < 51.

I've been looking for an alternative to MinHash and FracMinHash that would let us generalize the sourmash code base while offering similar time/space/operations.

The key thing this would offer over FracMinHash is the ability to sketch at one k-mer size K and then compare at any ksize < K. This would result in pretty significant space savings when distributing multiple databases.

You could imagine that something like this could also work for creating multiple molecule types from one original sketch - store a "founder" DNA or protein k-mer, calculate derivative sketches for shorter k-mer sizes upon demand.

Would we call this ArgFracHash or something? naming naming naming...

@dkoslicki
Copy link
Collaborator

ArgFracHash is great name! ArgMinHash is the name I went for when applying it to plain ol' MinHash, so it would at least be consistent

@ctb
Copy link
Contributor Author

ctb commented May 6, 2022

thinking a bit more -

  • we choose a set of founding k-mers for founding ksize, based on scaled parameter (i.e. k-mers that hash to values below _max_hash for that scaled).
  • for each sketch, we then store the founding ksize k-mer bases, in canonical text/DNA format (lexicographic minor)
  • to construct a sketch for a given ksize < founding ksize, we then truncate each founding k-mer to that size, and we don't do any further _max_hash filtering.

Then in order to downsample for any ksize, we would downsample the founding k-mers and recalculate at that ksize.

@ctb
Copy link
Contributor Author

ctb commented May 6, 2022

ref #616

@mr-eyes
Copy link
Member

mr-eyes commented May 7, 2022

Hi @dkoslicki,

I have read the manuscript, and it's very nice being able to compress/query the multiple kmer size information in that way. I have a question, and please, correct me if I am not getting something correctly.

Given that CMash uses KTST, which operates from O(log n) to O(n) in search and insertion. How is CMash expected to perform (in time and space) on large-scale data? In other words, I want to query many large SRA samples and construct a database of large genomes/metagenomes with small scale value.

@dkoslicki
Copy link
Collaborator

The ternary search tree was just a convenient way for us to efficiently do prefix lookups, so it's not vital for the idea of truncating larger k-mers to get a sketch with smaller k-mer sizes. Nonetheless, since we are aiming for classifying bacterial genomes in a metagenomic sample (so hash the database of genomes, stream the metagenome through) and the (Frac)MinHash sketch of bacterial genomes are relatively small, we've found in practice that its relatively performant. If I recall correctly, I built a KTST database with something like 100K bacterial genomes and the database ended up being on the order of 10GB. The query performance is ok (given that we did zero optimization and it's all in python), but highly parallelizable, so we took the approach of just throwing more cores at it.

So it really depends on your use case: if you are looking to do all pairwise containment estimations between a collection of metagenomic samples for a range of k-mer sizes, the TST wouldn't be of much use. If you want to query against a (relatively static) collection of sketches, than a TST is a handy way to compress it in a way that still allows prefix searches.

Do you mind saying a few more words about your envisioned use case? I.e. will your database have both genomes and metagenomes in it? Will your goal be to find entries in your database that make up your sample, or are you doing lookups (I have an unlabeled genome, find me its label by comparing to database entries)? etc.

@ctb
Copy link
Contributor Author

ctb commented May 7, 2022

Do you mind saying a few more words about your envisioned use case? I.e. will your database have both genomes and metagenomes in it? Will your goal be to find entries in your database that make up your sample, or are you doing lookups (I have an unlabeled genome, find me its label by comparing to database entries)? etc.

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per #1958.

@mr-eyes
Copy link
Member

mr-eyes commented May 7, 2022

The ternary search tree was just a convenient way for us to efficiently do prefix lookups, so it's not vital for the idea of truncating larger k-mers to get a sketch with smaller k-mer sizes. Nonetheless, since we are aiming for classifying bacterial genomes in a metagenomic sample (so hash the database of genomes, stream the metagenome through) and the (Frac)MinHash sketch of bacterial genomes are relatively small, we've found in practice that its relatively performant. If I recall correctly, I built a KTST database with something like 100K bacterial genomes and the database ended up being on the order of 10GB. The query performance is ok (given that we did zero optimization and it's all in python), but highly parallelizable, so we took the approach of just throwing more cores at it.

Thanks @dkoslicki for the detailed explanation.

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per #1958.

This is specifically the use case in my mind. Having Genbank/GTDB as ArgFracHash sketches and being able to query them using multiple k-mer sizes (as in sourmash prefetch/gather).

@dkoslicki
Copy link
Collaborator

the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per #1958.

Agreed. And side note: I chose a TST merely since it was convenient at the time (benchmarked some prefix-query-capable data structures that were available on GitHub with Python bindings, and just picked one). I'm sure there is a more efficient way to both compress and allow for prefix searches, if someone had motivation to think about it for a bit (or code it up efficiently).

This is specifically the use case in my mind. Having Genbank/GTDB as ArgFracHash sketches and being able to query them using multiple k-mer sizes (as in sourmash prefetch/gather).

Perfect, then a TST will be your friend!

@ctb
Copy link
Contributor Author

ctb commented May 9, 2022

I am probably missing something, but I was thinking of dynamically generating specific k sketches for each search, using an approach akin to our current downsampling approach. I don't see a role for TSTs in that.

(probably at this point I should just code up the thing I'm thinking in Python.)

Another use case that is blindingly obvious (assuming we can get good space savings) - SRA-scale metagenome sketching and storage. Right now we store 3x sketches for k=21, k=31, and k=51, for every SRA metagenome; and to search at a different ksize, we need to revisit the raw data. If we could sketch at k=51 and get not only DNA but translated amino acid sketches out of it, that would be very enabling.

@dkoslicki
Copy link
Collaborator

@ctb the TST is just a convenient way to store a collection of sketches (or a sketch with low scale/tons of k-mers), so would indeed let you calculate different k sizes on the fly. If you have a single sketch, TSTs aren't of much use, but do give space/speed improvement when you have a bunch of them that you want to search against. Likely we're thinking along the same lines, just talking about it in different ways :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants