Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with duplicated MD5 in storages #994

Merged
merged 4 commits into from
May 27, 2020
Merged

Deal with duplicated MD5 in storages #994

merged 4 commits into from
May 27, 2020

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented May 22, 2020

#648 unveiled a previous bug: since index save data to a location based on md5sum, what to do with files with potentially the same MD5 but different content?

This PR adds two files that match this case: both have empty k=21 scaled minhash sketches, and so have the same MD5 (despite coming from different datasets).

Note that FSStorage didn't check for duplicates, so it just overwrites the previous entry. This actually turns the tree into a DAG =]
ZipStorage triggers an error if entries are duplicated.

Solutions?

  • Expand md5sum to take into account more data (for now it hashes ksize + each individual hash). This breaks all previous signatures.
  • Storage.save takes a path argument, and return the path where the content was actually written. For FSStorage and ZipStorage it's the same, but for IPFSStorage it returns the IPFS hash, so... we can just change the path for the other storages too (MD5 + "_1", and so on), and this avoids creating duplicated entries. Since Storage.load takes what is returned from Storage.save when reading from a Storage, this is backwards-compatible, and is also correct (since it avoids the Tree-to-DAG problem).

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber luizirber changed the title Deal with duplicated md5sum [WIP] Deal with duplicated md5sum May 22, 2020
@codecov
Copy link

codecov bot commented May 22, 2020

Codecov Report

Merging #994 into master will decrease coverage by 0.00%.
The diff coverage is 94.28%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #994      +/-   ##
==========================================
- Coverage   92.23%   92.23%   -0.01%     
==========================================
  Files          72       72              
  Lines        5368     5393      +25     
==========================================
+ Hits         4951     4974      +23     
- Misses        417      419       +2     
Impacted Files Coverage Δ
sourmash/sbt_storage.py 92.11% <94.28%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce48fe6...dc00792. Read the comment docs.

@luizirber
Copy link
Member Author

I implemented the _n logic for ZipStorage, and checked that we are getting the right number of nodes saved. Now FSStorage is failing, because it only has 2 nodes, not the 3 required (1 internal, 2 leaves).

@luizirber
Copy link
Member Author

Fixed FSStorage too, same logic from ZipStorage.

@luizirber luizirber changed the title [WIP] Deal with duplicated md5sum Deal with duplicated MD5 in storages May 23, 2020
@luizirber
Copy link
Member Author

Ready for review and merge @ctb

Possibly release this as 3.3.1 too, since it is annoying when it happens (it triggers after SBT is built, during saving)?

tests/test_index.py Outdated Show resolved Hide resolved
tests/test_index.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants