Possible inconsistency in sbt_gather output #275

claczny · 2017-06-07T09:34:17Z

Hi,

first of all, I'd like to thank you all for this nice tool.

I am playing around with sourmash from http://2017-ucsc-metagenomics.readthedocs.io/en/latest/sourmash.html, i.e., I "pip installed" it as described there as I wanted to have the sbt_gatherfunction.
For now, I just wanted to get things running, so I used the default parameters (-k 31 --scaled 2000 --track-abundance) for building the reference signatures, followed by sbt_gather of MYQUERY (representing a de novo assembled, bacterial isolate genome).
While doing so, I observed an unexpected behavior and was wondering what is going on there.

Here comes the output to stdout:

# running sourmash subcommand: sbt_gather
loaded query: MYQUERY (k=31, DNA)

overlap    p_query p_genome
-------    ------- --------
5.0 Mbp   98.7%     96.2%      SOMEREF
4.9 Mbp   96.2%      1.1%      SOMEOTHERREF
3.3 Mbp   65.9%      0.1%      YETANOTHERREF
found less than 4.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 100.0% of the query

Followed by the output (-o MYQUERY.hits):

(sourmash_env)  $ less MYQUERY.hits
intersect_bp,f_orig_query,f_found_genome,name
4980000.0,0.9617612977983777,0.9873116574147502,SOMEREF.gz
4854000.0,0.01087378640776699,0.9623314829500397,SOMEOTHERREF.gz
3324000.0,0.0011966493817311527,0.6590007930214116,YETANOTHERREF.gz

Concretely, I am confused by the column order, as it seems switched between stdout and -o: (98.7% 96.2% vs. 0.9617[...],0.987[..]), albeit the column headers' order seems to be consistent.

TIA for your support.

Best,

Cedric

P.S. I am unsure on how to interpret the recovered matches hit 100.0% of the query when having multiple, here 3, (partially) matching references. Might this be an indication of some genetic exchange, thinking, MYQUERY "includes" the majority from SOMEREF, but also some "parts" from SOMEOTHERREF?

[EDIT] Along the lines of the P.S., how is the line 4.9 Mbp 96.2% 1.1% SOMEOTHERREF to be understood`? How can the overlap be so large, yet SOMEOTHERREF is only found with a fraction of 1.1%? For completeness, my index only includes signatures from bacterial genomes.

The text was updated successfully, but these errors were encountered:

ctb · 2017-06-07T10:20:15Z

First things first - could you update to the latest master branch? https://sourmash.readthedocs.io/en/latest/tutorials.html The columns and output are much more stable now :). On your P.S. and overlap, you are correct. I'm trying to think about what to do - see #266 (comment) for a more detailed explanation. thanks for asking!

claczny · 2017-06-07T11:12:27Z

Did the update and can confirm that the output is now more stable :) Thx!

stdout:

loaded query: MYQUERY.sig (k=31, DNA)
loaded SBT MYINDEX.sbt.json

overlap     p_query p_match
---------   ------- --------
5.0 Mbp      98.7%   96.2%      SOMEREF.gz
4.9 Mbp      96.2%    1.1%      SOMEOTHERREF.gz
3.3 Mbp      65.9%    0.1%      YETANOTHERREF.gz
found less than 4.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 100.0% of the query

-o :

(sourmash_env)  $ cat MYQUERY.gather.hits
intersect_bp,f_orig_query,f_match,f_unique_to_query,name,filename,md5
4980000.0,0.9873116574147502,0.9617612977983777,0.9873116574147502,SOMEREF.gz,MYINDEX.sbt.json,SOMEMD5
4854000.0,0.9623314829500397,0.01087378640776699,0.011102299762093577,SOMEOTHERREF.gz,MYINDEX,SOMEOTHERMD5
3324000.0,0.6590007930214116,0.0011966493817311527,0.0011895321173671688,YETANOTHERREF.gz,MYINDEX.sbt.json,YETANOTHERMD5

Thanks to the explanation in #266, the f_match_unique makes perfect sense. I remain unsure about the values in f_match, though. Specifically, does 0.011102299762093577 in f_match of the second hit (SOMEOTHERREF) refer to the part that has not been covered by the first hit (SOMEREF)? More generally, I'm thinking along the line of "fraction of match covered by query not covered by any preceding reference". Put differently, the 0.011102299762093577 of MYQUERY that is unique to the second match (SOMEOTHERREF) covers 0.01087378640776699 of SOMEOTHERREF, while SOMEOTHERREF covers 0.9623314829500397 of MYQUERY?

ctb · 2017-06-07T11:22:23Z

Man, we really don't make the output easy to interpret, do we... :)

see code for gather which (after perennial confusion on my part) I tried to simplify and sanify so that it's readable!

here, 'intersect_mins' is with respect to what has not yet been matched, as you say. The column f_match should sum to 1.0 (p_match to 100%) if everything is known.

I'm thinking about adding a --details flag so that each match can be explained with more info... hmm. Anyway, thoughts on what you'd like to see for your use case(s) would be VERY welcome.

ctb · 2017-06-07T11:41:46Z

Note, we are working on taxonomic summarization of the output too, so that we could e.g. group by species rather than splitting strains, but that functionality will be somewhat distinct from gather b/c it depends on NCBI metadata that not all databases will have. gather works equally well on private databases without lineage info. See #195 for that.

claczny · 2017-06-07T11:44:03Z

Maybe it's just me being "slow" today ;)
It's WIP after all, so totally understand that and greatly appreciate your quick support!

I found https://github.com/dib-lab/sourmash/blob/f76938edf1125c7aee61b9039dcf30add5b215f8/sourmash_lib/commands.py#L991 to be interesting in this respect, i.e., already found hashes are subtracted iteratively.

I'll keep thinking about features that might come in handy and let you know.

For now, there is another point that confused me. When I search instead of gather, it is much slower and the similarity values do not match those in gather:

310 matches; showing first 3:
similarity   match
----------   -----
 77.8%       SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 73.8%       ANEWREF.gz

In particular, ANEWREF was missing from the top hits in gather.
Could you please shed some light here too, especially, why are the similarity values "low"? :)

[EDIT] Not sure whether this should be a separate issue or if you'd close the current issue and we simply continue on the similarity issue herein.

ctb · 2017-06-07T11:55:44Z

On Wed, Jun 07, 2017 at 04:44:06AM -0700, Cedric Laczny wrote: Maybe it's just me being "slow" today ;)

hah unlikely!

It's WIP after all, so totally understand that and greatly appreciate your quick support!

welcome & thanks for the engagement!

I found https://github.com/dib-lab/sourmash/blob/f76938edf1125c7aee61b9039dcf30add5b215f8/sourmash_lib/commands.py#L991 to be interesting in this respect, i.e., already found hashes are *subtracted* iteratively.

yep. The idea is to greedily partition a metagenome into its constituent "best matches".

For now, there is another point that confused me. When I `search` instead of `gather`, it is much slower and the similarity values do not match those in `gather`: ``` 310 matches; showing first 3: similarity match ---------- ----- 77.8% SOMEREF.gz 74.4% SOMEOTHERREF.gz 73.8% ANEWREF.gz ``` In particular, ANEWREF was *missing* from the top hits in `gather`. Could you please shed some light here too, especially, why are the similarity values "low"? :)

gather matches are meant to be disjoint; 'search' is doing a straightforward Jaccard search and reporting anything that matches well; it should be identical to what the mash software does. If you use --best-only you will recover the speed of gather, but you will lose most of the 310 matches. (Also note that 'search' on SBTs is not necessarily 100% accurate yet; see #224 for discussion and proposed fix.) If you add --containment you should get similarity values that match those in gather for the top hit. Briefly, when matching, 'gather' does not take into account all of the k-mers that are in SOMEREF but are not in the query, while 'search' does -- this latter behavior is true Jaccard similarity, the former is what we call 'containment', and it is what gather uses. `--containment` is theoretically guaranteed to return best match in SBTs (vice plain ol' bugs) but will not be able to distinguish between e.g. a plasmid that is part of a big genome and the plasmid itself, if both full genome and plasmid are in the database.

claczny · 2017-06-07T14:02:53Z

That further clarifies many things, but not all unfortunately :)
I had a look at the issue you mentioned, but, TBH, did not readily see what the "inaccuracy" could be and how the proposed fix looks.

I gave search another try and can confirm that --best-only speeds up things greatly.
However, the results look unexpected to me.
When I do a search, this is what I get on stdout:

310 matches; showing first 3:
similarity   match
----------   -----
 77.8%      SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 73.8%       REF.gz

When I do a search --best-only, the stdout looks as follows:

(truncated search because of --best-only; only trust top result
6 matches; showing first 3:
similarity   match
----------   -----
 77.8%       SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 63.0%       AGAINANEWREF.gz

Which is not what I would expect, naively. Specifically, I would expect the top hits to be the same :)

ctb · 2017-06-07T15:09:14Z

On Wed, Jun 07, 2017 at 07:02:54AM -0700, Cedric Laczny wrote: That further clarifies many things, but not all unfortunately :) I had a look at the issue you mentioned, but, TBH, did not readily see what the "inaccuracy" could be and how the proposed fix looks.

Right, it's pretty technical - let's just say that we can only guarantee best containment with our current approach, and we need to adjust to guarantee best similarity!

I gave `search` another try and can confirm that `--best-only` speeds up things greatly. However, the results look unexpected to me. When I do a `search`, this is what I get on stdout: ``` 310 matches; showing first 3: similarity match ---------- ----- 77.8% SOMEREF.gz 74.4% SOMEOTHERREF.gz 73.8% REF.gz ``` When I do a `search --best-only`, the stdout looks as follows: ``` (truncated search because of --best-only; only trust top result 6 matches; showing first 3: similarity match ---------- ----- 77.8% SOMEREF.gz 74.4% SOMEOTHERREF.gz 63.0% AGAINANEWREF.gz ``` Which is not what I would expect, naively. Specifically, I would expect the top hits to be the same :)

only the very top result is guaranteed (see message starting with 'truncated search...' for my attempt to make this clear). I think in the future I will simply limit output to 1 result...! thanks again!

claczny · 2017-06-07T15:13:30Z

only the very top result is guaranteed (see message starting with 'truncated search...' for my attempt to make this clear).

doooooh sorry to have missed that.
That's actually pretty much what I need right now: a quick way to find the most similar reference. The SBT will help greatly here as it will avoid going over all references, as it would have to do withmash.

Thank you very much @ctb!

ctb · 2017-06-07T15:23:15Z

On Wed, Jun 07, 2017 at 08:13:30AM -0700, Cedric Laczny wrote: > only the very top result is guaranteed (see message starting with 'truncated search...' for my attempt to make this clear). *doooooh* sorry to have missed that. That's actually pretty much what I need right now: a quick way to find the most similar reference. The SBT will help greatly here as it will avoid going over all references, as it would have to do with`mash`. Thank you very much @ctb!

welcome! and glad to hear it! suggest that you use both 'search' and 'search --containment' until we get that bug fixed, just to be 100% certain. (The former should give you correct results 99.9% of the time, the latter should give you correct results 100% of the time.)

claczny closed this as completed Jun 7, 2017

This was referenced Jun 9, 2017

gather should report all matches with equal containment #278

Closed

limit 'search --best-only' output to a single result #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible inconsistency in sbt_gather output #275

Possible inconsistency in sbt_gather output #275

claczny commented Jun 7, 2017 •

edited

Loading

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017

ctb commented Jun 7, 2017

claczny commented Jun 7, 2017 •

edited

Loading

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017 via email

Possible inconsistency in sbt_gather output #275

Possible inconsistency in sbt_gather output #275

Comments

claczny commented Jun 7, 2017 • edited Loading

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017

ctb commented Jun 7, 2017

claczny commented Jun 7, 2017 • edited Loading

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017

ctb commented Jun 7, 2017 via email

claczny commented Jun 7, 2017 •

edited

Loading

claczny commented Jun 7, 2017 •

edited

Loading