-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
investigate new sketch type, ArgMinHash
-based FracMinHash
#2039
Comments
ArgFracHash is great name! ArgMinHash is the name I went for when applying it to plain ol' MinHash, so it would at least be consistent |
thinking a bit more -
Then in order to downsample for any ksize, we would downsample the founding k-mers and recalculate at that ksize. |
ref #616 |
Hi @dkoslicki, I have read the manuscript, and it's very nice being able to compress/query the multiple kmer size information in that way. I have a question, and please, correct me if I am not getting something correctly. Given that CMash uses KTST, which operates from |
The ternary search tree was just a convenient way for us to efficiently do prefix lookups, so it's not vital for the idea of truncating larger k-mers to get a sketch with smaller k-mer sizes. Nonetheless, since we are aiming for classifying bacterial genomes in a metagenomic sample (so hash the database of genomes, stream the metagenome through) and the (Frac)MinHash sketch of bacterial genomes are relatively small, we've found in practice that its relatively performant. If I recall correctly, I built a KTST database with something like 100K bacterial genomes and the database ended up being on the order of 10GB. The query performance is ok (given that we did zero optimization and it's all in python), but highly parallelizable, so we took the approach of just throwing more cores at it. So it really depends on your use case: if you are looking to do all pairwise containment estimations between a collection of metagenomic samples for a range of k-mer sizes, the TST wouldn't be of much use. If you want to query against a (relatively static) collection of sketches, than a TST is a handy way to compress it in a way that still allows prefix searches. Do you mind saying a few more words about your envisioned use case? I.e. will your database have both genomes and metagenomes in it? Will your goal be to find entries in your database that make up your sample, or are you doing lookups (I have an unlabeled genome, find me its label by comparing to database entries)? etc. |
the single most obvious use case I see is distributing/storing just one zip file for all ksizes, for GTDB or Genbank. It wouldn't help with our indexed search collections and many of our other use cases, but the fast database search that we're working towards (tl;dr more cores) would be able to make use of it, and the single-core version is already pretty fast per #1958. |
Thanks @dkoslicki for the detailed explanation.
This is specifically the use case in my mind. Having Genbank/GTDB as ArgFracHash sketches and being able to query them using multiple k-mer sizes (as in sourmash |
Agreed. And side note: I chose a TST merely since it was convenient at the time (benchmarked some prefix-query-capable data structures that were available on GitHub with Python bindings, and just picked one). I'm sure there is a more efficient way to both compress and allow for prefix searches, if someone had motivation to think about it for a bit (or code it up efficiently).
Perfect, then a TST will be your friend! |
I am probably missing something, but I was thinking of dynamically generating specific k sketches for each search, using an approach akin to our current downsampling approach. I don't see a role for TSTs in that. (probably at this point I should just code up the thing I'm thinking in Python.) Another use case that is blindingly obvious (assuming we can get good space savings) - SRA-scale metagenome sketching and storage. Right now we store 3x sketches for k=21, k=31, and k=51, for every SRA metagenome; and to search at a different ksize, we need to revisit the raw data. If we could sketch at k=51 and get not only DNA but translated amino acid sketches out of it, that would be very enabling. |
@ctb the TST is just a convenient way to store a collection of sketches (or a sketch with low scale/tons of k-mers), so would indeed let you calculate different k sizes on the fly. If you have a single sketch, TSTs aren't of much use, but do give space/speed improvement when you have a bunch of them that you want to search against. Likely we're thinking along the same lines, just talking about it in different ways :) |
per #267 (comment)
@dkoslicki says:
and me:
The key thing this would offer over FracMinHash is the ability to sketch at one k-mer size K and then compare at any ksize < K. This would result in pretty significant space savings when distributing multiple databases.
You could imagine that something like this could also work for creating multiple molecule types from one original sketch - store a "founder" DNA or protein k-mer, calculate derivative sketches for shorter k-mer sizes upon demand.
Would we call this ArgFracHash or something? naming naming naming...
The text was updated successfully, but these errors were encountered: