Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fix SBT duplicates bug #1568

Merged
merged 25 commits into from
Jun 7, 2021
Merged

[MRG] Fix SBT duplicates bug #1568

merged 25 commits into from
Jun 7, 2021

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jun 4, 2021

For Zip SBTs, signatures with identical md5sums were being saved correctly as different signatures, but the updated signature filenames resulting from duplicate md5sums were not being communicated back to the tree properly. This meant that the names of a number of signatures (but not their content) were being lost, triggering the behavior seen over in #1511 (comment) as well as a number of apparent duplicate signatures per #1171.

Fixes #1171.

TODO:

  • add basic tests that trigger the problem!
  • add tests where we see what happens if the SBT already exists ;)

@codecov
Copy link

codecov bot commented Jun 4, 2021

Codecov Report

Merging #1568 (a52e1a7) into latest (01de852) will increase coverage by 0.17%.
The diff coverage is 92.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1568      +/-   ##
==========================================
+ Coverage   80.96%   81.14%   +0.17%     
==========================================
  Files         102      102              
  Lines       10299    10322      +23     
  Branches     1165     1169       +4     
==========================================
+ Hits         8339     8376      +37     
+ Misses       1751     1741      -10     
+ Partials      209      205       -4     
Flag Coverage Δ
python 89.27% <92.42%> (+0.24%) ⬆️
rust 66.59% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/sbt_storage.py 89.90% <91.83%> (+7.00%) ⬆️
src/sourmash/sbt.py 81.04% <94.11%> (+0.25%) ⬆️
src/sourmash/sourmash_args.py 94.24% <0.00%> (+0.02%) ⬆️
src/sourmash/lca/lca_utils.py 87.90% <0.00%> (+0.09%) ⬆️
src/sourmash/utils.py 78.94% <0.00%> (+1.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01de852...a52e1a7. Read the comment docs.

@ctb ctb changed the title [DEBUG] Trying various ways to track down SBT duplicates bug [WIP] Fix SBT duplicates bug Jun 4, 2021
@ctb ctb changed the title [WIP] Fix SBT duplicates bug [MRG] Fix SBT duplicates bug Jun 4, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

The tests are failing because of the need to support --append for zip SBTs; the revised code is saving the index as .sbt.json_0. I am thinking of adding a new function to SBT storage that is save_exact, which will replace the contents of the file if needed without generating a new filename.

@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

OK, @luizirber, I think this is ready for a look.

src/sourmash/sbt.py Show resolved Hide resolved
@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

@luizirber there's one failing test under python 3.9 on linux - ipfsstorage. I'm unclear on whether it's something I broke, tho.

@luizirber
Copy link
Member

luizirber commented Jun 4, 2021

@luizirber there's one failing test under python 3.9 on linux - ipfsstorage. I'm unclear on whether it's something I broke, tho.

It's something that was broken: list_sbts is not implemented for IPFSStorage (but wasn't tested/checked/needed before, because IPFS-backed SBTs usually are a single JSON anyway)
https://github.com/dib-lab/sourmash/pull/1568/checks?check_run_id=2748963142#step:12:135

(actually implementing it is interesting because it leads to using an IPFS address as we use .zip files nowadays, hmm...)

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@ctb
Copy link
Contributor Author

ctb commented Jun 4, 2021

@luizirber there's one failing test under python 3.9 on linux - ipfsstorage. I'm unclear on whether it's something I broke, tho.

It's something that was broken: list_sbts is not implemented for IPFSStorage (but wasn't tested/checked/needed before, because IPFS-backed SBTs usually are a single JSON anyway)
https://github.com/dib-lab/sourmash/pull/1568/checks?check_run_id=2748963142#step:12:135

(actually implementing it is interesting because it leads to using an IPFS address as we use .zip files nowadays, hmm...)

OK. Any thoughts on what to do about it, or why it's acting up now? I can dig into it later, just wondering if you know off the top of your head.

@ctb
Copy link
Contributor Author

ctb commented Jun 5, 2021

yay w00t! fixed!

@ctb
Copy link
Contributor Author

ctb commented Jun 5, 2021

(ready for review and merge @luizirber)

Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
@luizirber
Copy link
Member

OK. Any thoughts on what to do about it, or why it's acting up now? I can dig into it later, just wondering if you know off the top of your head.

Trying to think a bit conceptually about what is happening when saving/loading:

  • a location is always required. It can be a JSON file (with the SBT description, for FSStorage and IPFSStorage) or a Zip file containing the SBT JSON (for ZipStorage).
  • storage is not required, but can be passed to replace the storage defined in the location. For example, we can use an SBT with a FSStorage in location, but actually use a ZipStorage to load/save data.

Part of the confusion in using this API is that we mostly depend on the extension of the location to decide what to do. I'll try to come up with a table later to map behaviors...

@ctb
Copy link
Contributor Author

ctb commented Jun 7, 2021

OK. Any thoughts on what to do about it, or why it's acting up now? I can dig into it later, just wondering if you know off the top of your head.

Trying to think a bit conceptually about what is happening when saving/loading:

* a `location` is always required. It can be a JSON file (with the SBT description, for `FSStorage` and `IPFSStorage`) or a Zip file containing the SBT JSON (for `ZipStorage`).

* `storage` is not required, but can be passed to **replace** the storage defined in the `location`. For example, we can use an SBT with a `FSStorage` in `location`, but actually use a `ZipStorage` to load/save data.

Part of the confusion in using this API is that we mostly depend on the extension of the location to decide what to do. I'll try to come up with a table later to map behaviors...

Right - and I have a much better understanding of the code's behavior now 😄 , which is nice!

@ctb ctb merged commit bb723a4 into latest Jun 7, 2021
@ctb ctb deleted the debug/sbt_dups branch June 7, 2021 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplication of signatures seen in large SBT databases
2 participants