Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] update gather to calculate fraction of match that was in original query #938

Merged
merged 11 commits into from
Apr 15, 2020

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Apr 7, 2020

In benchmarking work for his thesis, @luizirber pointed out that we were missing the fraction of the found match that was in the original query. This adds that calculation into the gather CSV output.

The PR also adds specific test code for gather output, and improves the documentation as to exactly what gather is doing.

TODO:

  • update documentation

Example output

Example output (reformatted) for a situation where we create a synthetic "metagenome" signature from a bunch of overlapping genomes -- f_match slowly decreases, as we remove things iteratively from the metagenome based on matches, while f_match_orig stays at 1.0, because the original query contains the entire match.

% sourmash gather tests/test-data/gather/combined.sig \
       tests/test-data/gather/GCF*.sig -o xyz.csv
...
% cat xyz.csv | csvtk cut -f f_unique_to_query,f_match_orig,f_match,f_orig_query | pretty_csv
f_unique_to_query      f_match_orig  f_match                f_orig_query
0.3321964529331514     1.0           1.0                    0.3321964529331514
0.13096862210095497    1.0           1.0                    0.13096862210095497
0.11527967257844475    1.0           0.898936170212766      0.12824010914051842
0.10709413369713507    1.0           1.0                    0.10709413369713507
0.10368349249658936    1.0           0.3134020618556701     0.33083219645293316
0.06275579809004093    1.0           0.4842105263157895     0.1296043656207367
0.05184174624829468    1.0           0.16101694915254236    0.3219645293315143
0.04024556616643929    1.0           0.1257995735607676     0.3199181446111869
0.0286493860845839     1.0           0.09190371991247265    0.3117326057298772
0.021145975443383355   1.0           0.07259953161592506    0.291268758526603
0.0047748976807639835  1.0           0.014861995753715499   0.32128240109140516
0.001364256480218281   1.0           0.0044943820224719105  0.3035470668485675

(pretty_csv alias is

function pretty_csv {
    perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' "$@" | column -t -s, | less  -F -S -X -K
}

from here, and csvtk is here - conda installable!)

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@codecov
Copy link

codecov bot commented Apr 7, 2020

Codecov Report

Merging #938 into master will increase coverage by 0.26%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #938      +/-   ##
==========================================
+ Coverage   91.52%   91.78%   +0.26%     
==========================================
  Files          70       70              
  Lines        4965     4954      -11     
==========================================
+ Hits         4544     4547       +3     
+ Misses        421      407      -14     
Impacted Files Coverage Δ
sourmash/commands.py 84.40% <ø> (-0.24%) ⬇️
sourmash/search.py 93.63% <100.00%> (+0.43%) ⬆️
sourmash/sbt_storage.py 85.39% <0.00%> (-0.98%) ⬇️
sourmash/compare.py 100.00% <0.00%> (ø)
sourmash/cli/gather.py 100.00% <0.00%> (ø)
sourmash/cli/compare.py 100.00% <0.00%> (ø)
sourmash/cli/compute.py 100.00% <0.00%> (ø)
sourmash/sbt.py 86.79% <0.00%> (+0.29%) ⬆️
sourmash/lca/command_index.py 92.20% <0.00%> (+0.43%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d658969...51d1b6e. Read the comment docs.

@ctb
Copy link
Contributor Author

ctb commented Apr 7, 2020

Worst Graphic Ever, but Better Than No Graphic?
gather

sourmash/commands.py Outdated Show resolved Hide resolved
sourmash/search.py Outdated Show resolved Hide resolved
@sonarcloud
Copy link

sonarcloud bot commented Apr 13, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@ctb ctb changed the title [WIP] update gather to calculate fraction of match that was in original query [MRG] update gather to calculate fraction of match that was in original query Apr 13, 2020
Comment on lines 158 to 160
* unlike `sourmash search`, `sourmash gather` cannot easily be
parallelized on a per-signature level because it is doing a greedy
iterative search across all the databases at each step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of... we could do each database in parallel, and the same logic for that also applies per DB. It's just really annoying to do it properly in parallel in Python =]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, it's mostly Python that's the problem :). removed in 0d79fa0