Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what do we do about identical signatures when saving? #1501

Open
ctb opened this issue May 5, 2021 · 5 comments
Open

what do we do about identical signatures when saving? #1501

ctb opened this issue May 5, 2021 · 5 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented May 5, 2021

from @bluegenes comment on #1497,

hm, your testing made me realize that this has an interesting (and perhaps undesirable) consequence -- true duplicated sigs will not be caught/will be treated separately. This doesn't seem like too much of an issue? If we have a true duplicated sig in a database, it would return identical results and gather would randomly choose one to return, right?

Presumably (eventually), if we're selecting by name, we can choose to select just one of the duplicated sigs, if all metadata match?

worth discussing!

my current hot take is that actual identical signatures (hashes + metadata identical) should generally not be saved, but I think there are performance nuances to be discussed here around tracking such things in really large collections of signatures. Hence - this issue to discuss!

@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

(what I did in #1497 is have them append _0, _1, etc. to the filenames when the filename was based on the md5sum.)

@bluegenes
Copy link
Contributor

from #1574 (comment),

It would be nice if we came up with some strategies for handling duplicated md5sum's downstream (e.g. report them as alternative match results?). Seems particularly important for taxonomy functionality.

@ctb
Copy link
Contributor Author

ctb commented Jun 26, 2021

relevant: #1573

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2023

this has become a persistent problem also bc of manifests, see #2774 and #2749

@bluegenes
Copy link
Contributor

bluegenes commented Oct 6, 2023

From sourmash-bio/sourmash_plugin_branchwater#136 (comment)

@bluegenes:
Hmm, the optimal solution might be to use the manifest to load the same file for duplicates (not storing the duplicates at all). What are the challenges associated with that solution?

@ctb
ooh! I like it ;). Not ready to commit to it on sourmash yet, but hot take is it's a leading contender!

Thinking through some of the challenges:
For zips used as query:

  • We need to use the manifest to make sure we provide results from each row
  • We should avoid storing true duplicates (same name, md5sum) in the manifest, as they would now result in duplicated results
  • How to handle collections without manifests? Probably: continue to write multiple signatures, b/c we would lose sigs if not. And/or: always write manifests 5.0/6.0?

For zips used as the database:

  • challenge remains as to what results to provide if search/gather matches to signatures w/ identical md5sums. Which signature name to we provide as a result? Or can we provide a separate 'equivalent results' csv / printout that allows the user to investigate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants