-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] update gather to calculate fraction of match that was in original query #938
Conversation
Codecov Report
@@ Coverage Diff @@
## master #938 +/- ##
==========================================
+ Coverage 91.52% 91.78% +0.26%
==========================================
Files 70 70
Lines 4965 4954 -11
==========================================
+ Hits 4544 4547 +3
+ Misses 421 407 -14
Continue to review full report at Codecov.
|
Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
doc/classifying-signatures.md
Outdated
* unlike `sourmash search`, `sourmash gather` cannot easily be | ||
parallelized on a per-signature level because it is doing a greedy | ||
iterative search across all the databases at each step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of... we could do each database in parallel, and the same logic for that also applies per DB. It's just really annoying to do it properly in parallel in Python =]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, it's mostly Python that's the problem :). removed in 0d79fa0
In benchmarking work for his thesis, @luizirber pointed out that we were missing the fraction of the found match that was in the original query. This adds that calculation into the gather CSV output.
The PR also adds specific test code for gather output, and improves the documentation as to exactly what gather is doing.
TODO:
Example output
Example output (reformatted) for a situation where we create a synthetic "metagenome" signature from a bunch of overlapping genomes --
f_match
slowly decreases, as we remove things iteratively from the metagenome based on matches, whilef_match_orig
stays at 1.0, because the original query contains the entire match.(
pretty_csv
alias isfrom here, and csvtk is here - conda installable!)
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?