[MRG] update gather to calculate fraction of match that was in original query #938

ctb · 2020-04-07T15:58:51Z

In benchmarking work for his thesis, @luizirber pointed out that we were missing the fraction of the found match that was in the original query. This adds that calculation into the gather CSV output.

The PR also adds specific test code for gather output, and improves the documentation as to exactly what gather is doing.

TODO:

update documentation

Example output

Example output (reformatted) for a situation where we create a synthetic "metagenome" signature from a bunch of overlapping genomes -- f_match slowly decreases, as we remove things iteratively from the metagenome based on matches, while f_match_orig stays at 1.0, because the original query contains the entire match.

% sourmash gather tests/test-data/gather/combined.sig \
       tests/test-data/gather/GCF*.sig -o xyz.csv
...
% cat xyz.csv | csvtk cut -f f_unique_to_query,f_match_orig,f_match,f_orig_query | pretty_csv
f_unique_to_query      f_match_orig  f_match                f_orig_query
0.3321964529331514     1.0           1.0                    0.3321964529331514
0.13096862210095497    1.0           1.0                    0.13096862210095497
0.11527967257844475    1.0           0.898936170212766      0.12824010914051842
0.10709413369713507    1.0           1.0                    0.10709413369713507
0.10368349249658936    1.0           0.3134020618556701     0.33083219645293316
0.06275579809004093    1.0           0.4842105263157895     0.1296043656207367
0.05184174624829468    1.0           0.16101694915254236    0.3219645293315143
0.04024556616643929    1.0           0.1257995735607676     0.3199181446111869
0.0286493860845839     1.0           0.09190371991247265    0.3117326057298772
0.021145975443383355   1.0           0.07259953161592506    0.291268758526603
0.0047748976807639835  1.0           0.014861995753715499   0.32128240109140516
0.001364256480218281   1.0           0.0044943820224719105  0.3035470668485675

(pretty_csv alias is

function pretty_csv {
    perl -pe 's/((?<=,)|(?<=^)),/ ,/g;' "$@" | column -t -s, | less  -F -S -X -K
}

from here, and csvtk is here - conda installable!)

Is it mergeable?
make test Did it pass the tests?
make coverage Is the new code covered?
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Was a spellchecker run on the source code and documentation after
changes were made?

codecov · 2020-04-07T16:06:03Z

Codecov Report

Merging #938 into master will increase coverage by 0.26%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #938      +/-   ##
==========================================
+ Coverage   91.52%   91.78%   +0.26%     
==========================================
  Files          70       70              
  Lines        4965     4954      -11     
==========================================
+ Hits         4544     4547       +3     
+ Misses        421      407      -14

Impacted Files	Coverage Δ
sourmash/commands.py	`84.40% <ø> (-0.24%)`	⬇️
sourmash/search.py	`93.63% <100.00%> (+0.43%)`	⬆️
sourmash/sbt_storage.py	`85.39% <0.00%> (-0.98%)`	⬇️
sourmash/compare.py	`100.00% <0.00%> (ø)`
sourmash/cli/gather.py	`100.00% <0.00%> (ø)`
sourmash/cli/compare.py	`100.00% <0.00%> (ø)`
sourmash/cli/compute.py	`100.00% <0.00%> (ø)`
sourmash/sbt.py	`86.79% <0.00%> (+0.29%)`	⬆️
sourmash/lca/command_index.py	`92.20% <0.00%> (+0.43%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d658969...51d1b6e. Read the comment docs.

ctb · 2020-04-07T16:18:02Z

Worst Graphic Ever, but Better Than No Graphic?

sourmash/commands.py

sourmash/search.py

Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>

sonarcloud · 2020-04-13T14:07:06Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
1 Code Smell

No Coverage information
0.0% Duplication

luizirber · 2020-04-13T17:51:06Z

doc/classifying-signatures.md

+* unlike `sourmash search`, `sourmash gather` cannot easily be
+  parallelized on a per-signature level because it is doing a greedy
+  iterative search across all the databases at each step.


kind of... we could do each database in parallel, and the same logic for that also applies per DB. It's just really annoying to do it properly in parallel in Python =]

good point, it's mostly Python that's the problem :). removed in 0d79fa0

doc/classifying-signatures.md

calculate fraction of match that was in original query, for gather

ecf1e56

luizirber reviewed Apr 7, 2020

View reviewed changes

sourmash/commands.py Outdated Show resolved Hide resolved

luizirber reviewed Apr 7, 2020

View reviewed changes

sourmash/search.py Outdated Show resolved Hide resolved

ctb and others added 6 commits April 10, 2020 07:06

Update sourmash/search.py

26db2af

Co-Authored-By: Luiz Irber <luizirber@users.noreply.github.com>

restore column order; fix downsampling in contained_by

b02177c

add explicit tests for the various f_ columns in gather csv output

5a030ef

fix test for py27

0372a6f

fix comments, simplify code

cb3ffe6

update gather docs

2e021b5

ctb changed the title ~~[WIP] update gather to calculate fraction of match that was in original query~~ [MRG] update gather to calculate fraction of match that was in original query Apr 13, 2020

luizirber reviewed Apr 13, 2020

View reviewed changes

doc/classifying-signatures.md Outdated Show resolved Hide resolved

update docs to be more correct and clearer

0d79fa0

ctb mentioned this pull request Apr 14, 2020

Gather performance improvements discussion #838

Closed

luizirber approved these changes Apr 15, 2020

View reviewed changes

ctb added 3 commits April 15, 2020 13:54

Merge branch 'master' into update/gather_outputs

b4389ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] update gather to calculate fraction of match that was in original query #938

[MRG] update gather to calculate fraction of match that was in original query #938

ctb commented Apr 7, 2020 •

edited

Loading

codecov bot commented Apr 7, 2020 •

edited

Loading

ctb commented Apr 7, 2020

sonarcloud bot commented Apr 13, 2020

luizirber Apr 13, 2020

ctb Apr 14, 2020

[MRG] update gather to calculate fraction of match that was in original query #938

[MRG] update gather to calculate fraction of match that was in original query #938

Conversation

ctb commented Apr 7, 2020 • edited Loading

Example output

codecov bot commented Apr 7, 2020 • edited Loading

Codecov Report

ctb commented Apr 7, 2020

sonarcloud bot commented Apr 13, 2020

luizirber Apr 13, 2020

Choose a reason for hiding this comment

ctb Apr 14, 2020

Choose a reason for hiding this comment

ctb commented Apr 7, 2020 •

edited

Loading

codecov bot commented Apr 7, 2020 •

edited

Loading