Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gather does not break ties in any consistent manner #1366

Closed
klamens opened this issue Mar 4, 2021 · 9 comments
Closed

gather does not break ties in any consistent manner #1366

klamens opened this issue Mar 4, 2021 · 9 comments

Comments

@klamens
Copy link

klamens commented Mar 4, 2021

Hi,

I'm using sourmash (3.5) to investigate plasmids in assemblies.
Currently, I think that the 'gather' funtion is best suitable for this goal: I want to find multiple plasmids if they are contained, but kmers that are shared among multiple plasmids should only go to one plasmid.

However, I'm having a problem. I have two plasmids (NC_011078 (larger), NZ_WYDM02000026 (smaller)) and for scaling x100, the small one is fully contained in the larger one.

sourmash compare --containment signatures_catted.fasta.sig

0-NC_011078.fasta.gz    [1. 1.]
1-NZ_WYDM02000026    [0.053 1.   ]
min similarity in matrix: 0.053

However, when I gather the smaller plasmid against the file including the smaller and large plasmid, it will return the larger plasmid.

sourmash gather -k 21 --threshold-bp 0 NZ_WYDM02000026.fasta.sig signatures_catted.fasta.sig

overlap     p_query p_match

2.5 kbp      100.0%    5.3%    NC_011078.fasta.gz

I guess that in the gather function a tie (of identical shared hashes) is somehow in favour for the larger plasmid (either because a random one or alphabetical first is selected as winner). I would like that in case of a tie, the plasmid with the least amount of kmers (and hence the largest containment score) would win the hashes. What do you think of this and is it a possible (optional) feature?

@ctb
Copy link
Contributor

ctb commented Mar 4, 2021

yes, I think what is happening is pretty much what you say: the containment of the two matches is equal, and sourmash does not "tie break" equal matches, it just picks the first one it finds!

@bluegenes off the top of my head it seems like the max containment approach #1343 would not solve this, right?

#278 is probably the right approach; also see #707 for more motivation.

@ctb
Copy link
Contributor

ctb commented Mar 4, 2021

the implementation challenge until recently has been that doing this for really large collections of signatures is hard. however, we have some forthcoming solutions that help with this. it's probably not a next-week kind of feature tho, sorry :(

@klamens
Copy link
Author

klamens commented Mar 4, 2021

Thanks for the swift answer!

#1343 would not solve it because I only want to know the containment of plasmids into my assembly (not reversed).

#278 would not help at all because my smaller plasmid already has a much larger containment score (if it would win the hashes), but the winner of the hashes is not based on containment score, but # of shared hashes.

I will probably do the (suboptimal) following:
-First do a 'sourmash search' to select only those plasmids with sufficient containment (should help me to get rid of the large plasmids)
-do a 'sourmash gather' on the plasmids that were sufficiently contained in the previous step

@ctb ctb changed the title gather anomaly gather does not break ties in any consistent manner Mar 6, 2021
@ctb
Copy link
Contributor

ctb commented Mar 6, 2021

hi @klamens I think the information provided in the CSV file output by sourmash prefetch in #1370 would meet your needs - let me know if you need changes or additions!

@klamens
Copy link
Author

klamens commented Mar 8, 2021

It looks good to me. But if I only want the 'match containment', how does it differ with sourmash search?

@ctb
Copy link
Contributor

ctb commented Mar 8, 2021

Thanks for taking a look!

I think you need match containment and match bp to do what you want (which is tie break), and sourmash search doesn't provide match bp.

I'd be in favor of upgrading sourmash search to produce more useful CSV output, actually. But I'd have to look into that more, and because of our versioning approach, it would be hard to change it substantially until the next major release of sourmash (5.0) which isn't close. I hope sourmash prefetch from #1370 will be available in sourmash 4.1, which could be relatively soon (a few weeks to a few months).

Actually, now that I think of it, sourmash search would still suffer from the same problem (not showing ties properly); sourmash prefetch is a bit raw-er in terms of output and would explicitly show all relevant matches, including ties, so that you can post-process like I think you need to.

@ctb
Copy link
Contributor

ctb commented May 8, 2021

#1370 will indeed provide a first-cut solution to this.

@ctb
Copy link
Contributor

ctb commented Jun 7, 2021

similar issue over at marbl/Mash#159

@ctb
Copy link
Contributor

ctb commented Sep 23, 2021

Fixed in #1370, I think, and internal functionality discussed in #1615.

@ctb ctb closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants