Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adjust default thresholding for prefetch and gather? #2360

Open
ctb opened this issue Nov 12, 2022 · 11 comments
Open

adjust default thresholding for prefetch and gather? #2360

ctb opened this issue Nov 12, 2022 · 11 comments

Comments

@ctb
Copy link
Contributor

ctb commented Nov 12, 2022

based on some of the results for the new long-read benchmarking paper Portik et al. (link - not yet updated), @bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

right now, the "best" way to lower the threshold-bp is to set it to 0, then do filtering after. but this incurs really heavy penalties in terms of time/memory - e.g. for the nanopore Q20 on zymo, we end up matching 80% of the database with threshold-bp=0!

on slack, I wrote:

hot take is that we might want to provide an alternate thresholding mechanism that is number of hashes. 1 hash causes problems, 2 hashes should be much better. my intuition is that it should be ~irrespective of scaled factor unless scaled is very small (such that k-mers overlap).

this might also be important for virus/phage matches.

the UX is a bit hard to fathom - I feel like --threshold-bp provides some kind of useful intuition! - but I can see a few different options that might work -

  • provide --threshold-n-hashes or something, that would support thresholds of 2-3;
  • even for very low threshold-bp (e.g. --threshold-bp=0) support num-hashes filtering of 2 or 3, unless --disable-hash-thresholding is specified. non-intuitive, tho.
  • ???
@bluegenes
Copy link
Contributor

+1 to the idea of having the default 2 or 3 hash filtering in place, unless disabled

@ctb
Copy link
Contributor Author

ctb commented Nov 12, 2022

my intuition suggests that, as long as the hashes are entirely independent (i.e. come from non-overlapping k-mers), this should be very reliable. @dkoslicki, we'll pass you the long-read paper results when they are posted to bioRxiv - I think there's some interesting implications for classification, and maybe even minimizer design, that might emerge from thinking this through.

@taylorreiter
Copy link
Contributor

Oooh I also love this idea! Matches my intuition from digging around in charcoal results too, where we frequently have 2-3 matches from small contigs.

@ctb
Copy link
Contributor Author

ctb commented Nov 15, 2022

I put together some notebooks on detection sensitivity for 100bp reads and for 10kb reads.

As expected, genomes can be detected very sensitively at very, very low coverage.

I think it will be easy to show that detection of multiple independent k-mers for large k will be incredibly specific, too. But I haven't done that yet.

@ctb
Copy link
Contributor Author

ctb commented Nov 16, 2022

long read paper:

Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets: https://www.biorxiv.org/content/10.1101/2022.01.31.478527v2

Relevant Figure 4 showing that sourmash has weirdly good performance 😆 -

Screen Shot 2022-11-16 at 5 31 23 AM

@ctb
Copy link
Contributor Author

ctb commented Nov 16, 2022

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

calculation of said limit is left as an exercise to the reader, but it should be as simple as:

int(math.log(avg_genome_size, 4) + 1)

might want to pick an odd k-mer size 🤔

@dkoslicki
Copy link
Collaborator

dkoslicki commented Nov 16, 2022

Just catching up on this

@bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

Funnily enough, a math PhD student and I have been working on this exact thing this past year! The basic idea boils down to: if you set an ANI threshold (say, 0.95) by which you decide if something "matches" to your reference, you can use the stats from Mahmudur and Tessa's paper to effectively answer the question "have I seen enough k-mers such that I can claim this organism is in the sample?" This boils down to a quantile regression, so relatively straightforward. We're in the final stages of validating it on real data (annoyingly, there are normalization issues since "relative abundance" can be calculated in many different ways), but happy to discuss further if you're interested! In essence, it calculates the --threshold-bp per reference genome, based on the individual reference genomes' sketch/k-mer stats.

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

Assuming genomes are random strings, correct? Otherwise, repeat lengths are going to cause an issue.

@ctb
Copy link
Contributor Author

ctb commented Nov 17, 2022

Just catching up on this

@bluegenes and I were discussing ways to decrease the false positive rate for prefetch/gather when using very low thresholds.

Funnily enough, a math PhD student and I have been working on this exact thing this past year! The basic idea boils down to: if you set an ANI threshold (say, 0.95) by which you decide if something "matches" to your reference, you can use the stats from Mahmudur and Tessa's paper to effectively answer the question "have I seen enough k-mers such that I can claim this organism is in the sample?" This boils down to a quantile regression, so relatively straightforward. We're in the final stages of validating it on real data (annoyingly, there are normalization issues since "relative abundance" can be calculated in many different ways), but happy to discuss further if you're interested! In essence, it calculates the --threshold-bp per reference genome, based on the individual reference genomes' sketch/k-mer stats.

Awesome! I suspect we are coming at similar questions from very different directions - would be good to talk when you're ready, or I can send you stuff in a little bit!

theory: for optimal sensitivity/specificity tradeoff we want to choose the smallest k-mer size such that no one k-mer is expected to appear more than once per genome.

Assuming genomes are random strings, correct? Otherwise, repeat lengths are going to cause an issue.

Mmmh, I'm tempted to break this down in a slightly different way than I think you're thinking -

  • first: for a given query (set of hashes), do we find all possible matches? (prefetch)
  • second: did we do a good enough job of weeding out statistical false positives? (k-mer size selection, thresholding, etc.)
  • third: are we picking the "right" best match? (minimum metagenome cover/gather, currently)
  • fourth: are we finding the "right" taxonomy? (gather to tax)

My overactive imagine is guessing that you're trying to tackle (2) and (3) at the same time?

Anyway, repeats only matter for (3), not for (2). :)

More soon. I appear to be fixated on this at the moment.

@ctb
Copy link
Contributor Author

ctb commented Nov 18, 2022

@dkoslicki you might be interested in this line of thinking too - unicity distances.

ctb/2022-sourmash-sens-spec#1

@ctb
Copy link
Contributor Author

ctb commented Dec 21, 2022

one of my (our!) conclusions from digging into the reasons why sourmash performs well in the above paper is that even low thresholds for combinatorial collections of hashes are really effective. we're also coming to some really interesting taxonomic conclusions that I have to dive into before I can communicate them effectively and concisely.

but, a conversation on lab slack made me connect the question of prefetch/gather thresholding in general to the more specific question of what thresholding to use for decontamination with charcoal, and it is interesting to remember that we dropped single-hash / LCA-style methods completely from charcoal in favor of requiring 3 hashes to call the taxonomy of a contig. in brief, this is beacuse LCA-style methods were far too prone to false positives! I have to think about how to integrate that into my new understanding of things...

links:

anyway, there is no real big surprise here now, other than to say that taxonomies, genome contamination, and taxonomy quality issues loom large when you start looking at outliers in genome identification.

@ctb
Copy link
Contributor Author

ctb commented Jun 26, 2024

Answer to a question in the Element/Matrix group:

I have a question regarding the gather and search commands. I am using the default values for the threshold for gather and search, but it is not clear to me how the default values of 50kb and 0.08 were determined and how they relate to the confidence I can have that the top result is the correct match to the query signature. Will you please clarify this for me?

We have some discussion of the 'gather' threshold here, #2360, which also relates to the threshold used for search --containment and prefetch (which are the same underneath). In brief, we see very few false positives at scaled=1000 and threshold-bp=3000 for these commands. Simulations, theory, and benchmarking all suggest that at k=31 you have essentially zero false positives; the biggest problem we find is aberrant similarities due to e.g. contamination in MAGs.

The 50kb thus relates more to sensitivity and speed issues - 50kb is very low for two bacterial genomes, or a metagenome <-> bacterial genome, since you will detect overlaps at a level of 1% of an E. coli genome (5 Mb / 50kb). You should raise it if you don't care to detect that small an overlap! Lowering it may be necessary for smaller genomes like viruses, but we haven't worked out the parameters for viruses yet.

Last but not least, lowering the threshold for gather makes it much slower.

So there's no inherent reason we chose 50kb, it just matched the domain of life we were exploring (bacteria/archaea) and was a good default for speed reasons.

In terms of Jaccard, there's no real reason for the default. You can set it to 0 if you like without impacting speed much, and there's no empirical reason not to. It's just a question of how small a similarity you want to detect. The two metrics you can use to assess specificity are (1) mash distance (for which you should read the mash paper!) or (2) ANI, which relates to the biology you want to study.

I also should say I don't find Jaccard itself to be very useful, in general. ANI is a much more intuitively useful measure and you can match to genomes at 90% ANI or above with k-mer based analysis tools like sourmash.

You might also be interested in YACHT, which provides a statistical framework for all of this - https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873.

Whenever we've done any kind of detailed validation on a small scale, we've found that shared k-mers strongly imply shared nucleotide content; this is the basis of lots of software, including ganon and sylph. How much is significant depends on what question you're asking.

Hope that helps :). Apologies for all the words!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants