Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A potential bug? search not reporting matches w/identical md5s #2805

Open
chunyuma opened this issue Oct 9, 2023 · 2 comments
Open

A potential bug? search not reporting matches w/identical md5s #2805

chunyuma opened this issue Oct 9, 2023 · 2 comments

Comments

@chunyuma
Copy link

chunyuma commented Oct 9, 2023

Hi @ctb,

I built a test database where I duplicated a genome but given two different names 48246288 and 482462:

internal_location,md5,md5short,ksize,moltype,num,scaled,n_hashes,with_abundance,name,filename
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,48246288,../all_genomes/482462.fna
signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0,e1215ff0aa1e9c23122889d108385737,e1215ff0,31,DNA,0,100,79324,1,482462,../all_genomes/482462.fna

When I ran

sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz_0 test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test.csv

to find the most similar genomes for 482462, I expect it would return both 48246288 and 482462but only 48246288:

$ cat test.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841

However, when I ran

sourmash search signatures/e1215ff0aa1e9c23122889d108385737.sig.gz test_k31_scale100.zip --ignore-abundance --containment --estimate-ani-ci -t 0 -n 0 -o test1.csv 

for 48246288 , it also returns 48246288 only:

$  cat test1.csv | cut -d',' -f 4
name
48246288
2877941
3003260
1336794
2511995
2653858
749
716541
1100841

Is it a bug? Or it is normal because these two genomes are the same and haves the same MD5 hash, so only one is reported?

@ctb
Copy link
Contributor

ctb commented Oct 10, 2023

I'll look into it! It should be reporting both matches. Thank you for providing so much info!

@ctb ctb changed the title A potential bug? A potential bug? search not reporting matches w/identical md5s Oct 10, 2023
@ctb
Copy link
Contributor

ctb commented Aug 16, 2024

Also found in #3284 by @agombolay!

This is caused by the following code, which intentionally removes duplicate sketches from consideration:

for db in databases:
search_iter = db.search(query, **kwargs)
for score, match, filename in search_iter:
md5 = match.md5sum()
if md5 not in found_md5:
results.append((score, match, filename))
found_md5.add(md5)

As I wrote in #3284,

... search is collapsing identical sketches and only reporting one result per distinct sketch while multisearch is reporting one result per matching database entry. The latter behavior is IMO better because it reports all matching names, while the former behavior will let you know that there's at least one match, but not report all of the matches.

Note that the new multisearch in the branchwater plugin is much faster and returns the correct set of results.


This behavior exists throughout search - also in the cos similarity code:

for score, match, filename in search_iter:
md5 = match.md5sum()
if md5 not in found_md5:
results.append((score, match, filename))
found_md5.add(md5)


If we want to fix it, the quickest fix would be to supply an additional flag to search that says not to collapse identical sketches.

An alternative (or addition) would be to follow through on some parts of the refactoring suggested in #2002

Thoughts? My sense is that multisearch in the branchwater plugin provides an adequate solution to the search problem, but it's easy to add the flag if that is of interest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants