Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

info long reads mapping - bbtools parameters and inquiring for suggestions #89

Open
KristinaGagalova opened this issue Oct 23, 2024 · 2 comments

Comments

@KristinaGagalova
Copy link

Hi,
I am circling back to Biobloomtools, hoping to use it for a contamination screening.
Could you please provide more info on whether it's possible to use it for long reads and what the FPR and Kmer size would be in this case? Do you have any protocol and set of parameters to recommend?
Thanks
Kristina

@JustinChu
Copy link
Collaborator

JustinChu commented Oct 28, 2024

I did some testing on long reads in 2020, with some results on a BCGSC Jira ticket. I think the results were pretty good and with modern Nanopore accuracy improved compared to back then I think it will also be even better. For Hifi reads it should also be even easier since you could probably keep -k pretty big or even benefit from increasing it to increase performance.

I think the main ingredient was that I used binomial score which is the default math for scoring for miBFs. I was considering making that the default scoring method for its robustness but didn't want to screw up any workflows expecting the older scoring methods.

So basically I think I tried something like:

  • -S binomial with -s from 40 to 100 (-s becomes the minimum -10*log(minimum fpr) threshold for a match). The parameter becomes like the minimum FPR, where the -10*log scaling thing so the math becomes similar MAPQ words in aligners like BWA. Like -s 10 would mean you accept 10% of the matches to be false positives and -s 60 would mean you accept 1 out of 10^6 reads to be false positives hits etc.
  • -k 19 for error compensating for the error rate. You might get away with something smaller but with the length of reads you have might compensate so you don't need to even keep this that low. Maybe even begin testing with default -k 25 might be fine if you already have filters.
  • Using DUST filtering --dust. I'm not an expert on DUST, but I integrated an off-the-shelf implementation of it to deal with low complexity and often repetitive sequences. I'm not too sure about the effect on performance though so another strategy I can think of is repeat masking your genome fasta file before filter creation which might do a similar thing without having to use DUST.

If you get good results post them here and I'll try to add something to the readme about using long reads. If you find it too slow with what parameters you find work best from a sensitivity & specificity perspective I think there is room for some easy optimizations based on a quick look at the code.

I think I'll at least try to put something in the readme about how binomial scoring when I have time at some point too now that I think about it.

@lcoombe
Copy link
Member

lcoombe commented Oct 28, 2024

Thanks so much for all that info, @JustinChu!

I found the JIRA ticket that I think you're referring to, and yes, looks like the parameters that you mentioned above are what you suggested back in 2020:

With the current master branch and the upcoming release (2.3.3) of BBT suggested ranges for options for long reads:

-k: 18 - 25 (k 25 may be useful because that is what the standard pipeline uses, though a smaller k maybe more sensitive)
-D (dust, low complexity filter)
-S: "binomial" (New scoring method)
-s: 60 - 100 (60 is 1/10^6 FPR, 100 is 1/10^10 FPR)

But as you say, the k-mer sizes you suggested were based on the technology back then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants