Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible inconsistency in sbt_gather output #275

Closed
claczny opened this issue Jun 7, 2017 · 10 comments
Closed

Possible inconsistency in sbt_gather output #275

claczny opened this issue Jun 7, 2017 · 10 comments

Comments

@claczny
Copy link

claczny commented Jun 7, 2017

Hi,

first of all, I'd like to thank you all for this nice tool.

I am playing around with sourmash from http://2017-ucsc-metagenomics.readthedocs.io/en/latest/sourmash.html, i.e., I "pip installed" it as described there as I wanted to have the sbt_gatherfunction.
For now, I just wanted to get things running, so I used the default parameters (-k 31 --scaled 2000 --track-abundance) for building the reference signatures, followed by sbt_gather of MYQUERY (representing a de novo assembled, bacterial isolate genome).
While doing so, I observed an unexpected behavior and was wondering what is going on there.

Here comes the output to stdout:

# running sourmash subcommand: sbt_gather
loaded query: MYQUERY (k=31, DNA)

overlap    p_query p_genome
-------    ------- --------
5.0 Mbp   98.7%     96.2%      SOMEREF
4.9 Mbp   96.2%      1.1%      SOMEOTHERREF
3.3 Mbp   65.9%      0.1%      YETANOTHERREF
found less than 4.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 100.0% of the query

Followed by the output (-o MYQUERY.hits):

(sourmash_env)  $ less MYQUERY.hits
intersect_bp,f_orig_query,f_found_genome,name
4980000.0,0.9617612977983777,0.9873116574147502,SOMEREF.gz
4854000.0,0.01087378640776699,0.9623314829500397,SOMEOTHERREF.gz
3324000.0,0.0011966493817311527,0.6590007930214116,YETANOTHERREF.gz

Concretely, I am confused by the column order, as it seems switched between stdout and -o: (98.7% 96.2% vs. 0.9617[...],0.987[..]), albeit the column headers' order seems to be consistent.

TIA for your support.

Best,

Cedric

P.S. I am unsure on how to interpret the recovered matches hit 100.0% of the query when having multiple, here 3, (partially) matching references. Might this be an indication of some genetic exchange, thinking, MYQUERY "includes" the majority from SOMEREF, but also some "parts" from SOMEOTHERREF?

[EDIT] Along the lines of the P.S., how is the line 4.9 Mbp 96.2% 1.1% SOMEOTHERREF to be understood`? How can the overlap be so large, yet SOMEOTHERREF is only found with a fraction of 1.1%? For completeness, my index only includes signatures from bacterial genomes.

@ctb
Copy link
Contributor

ctb commented Jun 7, 2017 via email

@claczny
Copy link
Author

claczny commented Jun 7, 2017

Did the update and can confirm that the output is now more stable :) Thx!

stdout:

loaded query: MYQUERY.sig (k=31, DNA)
loaded SBT MYINDEX.sbt.json

overlap     p_query p_match
---------   ------- --------
5.0 Mbp      98.7%   96.2%      SOMEREF.gz
4.9 Mbp      96.2%    1.1%      SOMEOTHERREF.gz
3.3 Mbp      65.9%    0.1%      YETANOTHERREF.gz
found less than 4.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 100.0% of the query

-o :

(sourmash_env)  $ cat MYQUERY.gather.hits
intersect_bp,f_orig_query,f_match,f_unique_to_query,name,filename,md5
4980000.0,0.9873116574147502,0.9617612977983777,0.9873116574147502,SOMEREF.gz,MYINDEX.sbt.json,SOMEMD5
4854000.0,0.9623314829500397,0.01087378640776699,0.011102299762093577,SOMEOTHERREF.gz,MYINDEX,SOMEOTHERMD5
3324000.0,0.6590007930214116,0.0011966493817311527,0.0011895321173671688,YETANOTHERREF.gz,MYINDEX.sbt.json,YETANOTHERMD5

Thanks to the explanation in #266, the f_match_unique makes perfect sense. I remain unsure about the values in f_match, though. Specifically, does 0.011102299762093577 in f_match of the second hit (SOMEOTHERREF) refer to the part that has not been covered by the first hit (SOMEREF)? More generally, I'm thinking along the line of "fraction of match covered by query not covered by any preceding reference". Put differently, the 0.011102299762093577 of MYQUERY that is unique to the second match (SOMEOTHERREF) covers 0.01087378640776699 of SOMEOTHERREF, while SOMEOTHERREF covers 0.9623314829500397 of MYQUERY?

@ctb
Copy link
Contributor

ctb commented Jun 7, 2017

Man, we really don't make the output easy to interpret, do we... :)

see code for gather which (after perennial confusion on my part) I tried to simplify and sanify so that it's readable!

here, 'intersect_mins' is with respect to what has not yet been matched, as you say. The column f_match should sum to 1.0 (p_match to 100%) if everything is known.

I'm thinking about adding a --details flag so that each match can be explained with more info... hmm. Anyway, thoughts on what you'd like to see for your use case(s) would be VERY welcome.

@ctb
Copy link
Contributor

ctb commented Jun 7, 2017

Note, we are working on taxonomic summarization of the output too, so that we could e.g. group by species rather than splitting strains, but that functionality will be somewhat distinct from gather b/c it depends on NCBI metadata that not all databases will have. gather works equally well on private databases without lineage info. See #195 for that.

@claczny
Copy link
Author

claczny commented Jun 7, 2017

Maybe it's just me being "slow" today ;)
It's WIP after all, so totally understand that and greatly appreciate your quick support!

I found https://github.com/dib-lab/sourmash/blob/f76938edf1125c7aee61b9039dcf30add5b215f8/sourmash_lib/commands.py#L991 to be interesting in this respect, i.e., already found hashes are subtracted iteratively.

I'll keep thinking about features that might come in handy and let you know.

For now, there is another point that confused me. When I search instead of gather, it is much slower and the similarity values do not match those in gather:

310 matches; showing first 3:
similarity   match
----------   -----
 77.8%       SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 73.8%       ANEWREF.gz

In particular, ANEWREF was missing from the top hits in gather.
Could you please shed some light here too, especially, why are the similarity values "low"? :)

[EDIT] Not sure whether this should be a separate issue or if you'd close the current issue and we simply continue on the similarity issue herein.

@ctb
Copy link
Contributor

ctb commented Jun 7, 2017 via email

@claczny
Copy link
Author

claczny commented Jun 7, 2017

That further clarifies many things, but not all unfortunately :)
I had a look at the issue you mentioned, but, TBH, did not readily see what the "inaccuracy" could be and how the proposed fix looks.

I gave search another try and can confirm that --best-only speeds up things greatly.
However, the results look unexpected to me.
When I do a search, this is what I get on stdout:

310 matches; showing first 3:
similarity   match
----------   -----
 77.8%      SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 73.8%       REF.gz

When I do a search --best-only, the stdout looks as follows:

(truncated search because of --best-only; only trust top result
6 matches; showing first 3:
similarity   match
----------   -----
 77.8%       SOMEREF.gz
 74.4%       SOMEOTHERREF.gz
 63.0%       AGAINANEWREF.gz

Which is not what I would expect, naively. Specifically, I would expect the top hits to be the same :)

@ctb
Copy link
Contributor

ctb commented Jun 7, 2017 via email

@claczny
Copy link
Author

claczny commented Jun 7, 2017

only the very top result is guaranteed (see message starting with 'truncated search...' for my attempt to make this clear).

doooooh sorry to have missed that.
That's actually pretty much what I need right now: a quick way to find the most similar reference. The SBT will help greatly here as it will avoid going over all references, as it would have to do withmash.

Thank you very much @ctb!

@claczny claczny closed this as completed Jun 7, 2017
@ctb
Copy link
Contributor

ctb commented Jun 7, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants