Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

Closed
yarikoptic opened this issue Feb 25, 2022 · 6 comments · Fixed by dandi/dandi-schema#120
Assignees
Milestone

Comments

@yarikoptic
Copy link
Member

Realization (for me) came in dandi/dandi-cli#923 (comment) that for zarr checksum a "full" path from the top of zarr asset is used for computing that file/path contribution to the checksum.
Examples of that effect are in https://github.com/dandi/dandischema/blob/master/dandischema/digests/tests/test_zarr.py showing such "full paths" used in ZarrChecksum . And looking at the code https://github.com/dandi/dandischema/blob/master/dandischema/digests/zarr.py (because design file I found primarily has only TODOs on those aspects) it seems that zarr checksumming relies on a full list of paths and doesn't rely on hierarchical (nested) structure of a checksum of higher directory to be aggregation of checksums of immediate subdirectories + files.

IMHO such design is "suboptimal"

  • conceptually in that the checksum of a subdirectory (within a zarr directory) for some reason depends not just on its content but on where in the zarr it resides
  • in "implementation" , as my confusion/question in Minimize/optimize Zarr digestion when uploading dandi-cli#923 (comment) showed, because requires knowing the top directory from which to "compute checksum of the subcomponent", so more complicated API.
  • "memory demand" -- full relative paths should be stored, thus replicating leading prefixes over and over again for no real reason.

IMHO ideally underlying checksum datastructure should be a tree, with only immediate children names contributing to checksum computation.

Attn @dchiquito @satra

@satra
Copy link
Member

satra commented Feb 27, 2022

@yarikoptic - as i understand the code, it's a generic checksum system at a given level of a tree. it doesn't walk anything or even digest a file. that would be a utility function written on top of those classes.

to use this code locally, it would have to be combined with a walk and a digest of a file. it would be useful to add such a utility function as well.

and as far as i know, @dchiquito implementation on the api side is a tree. i.e. a checksum at a level of tree is a digest of the structure underneath it.

@yarikoptic
Copy link
Member Author

@satra
Yes, underneath it, but with full paths from top of the tree (at least according to @jwodder ), is that right @dchiquito ?

@dchiquito dchiquito added this to the Zarr milestone Feb 28, 2022
@dchiquito dchiquito transferred this issue from dandi/dandi-schema Feb 28, 2022
@dchiquito
Copy link
Contributor

@yarikoptic you are correct about the current design, it does use full paths in the checksum files. I did it that way so that you could identify the exact object described in the .checksum file entry, but if you were able to obtain the .checksum file then you already have the containing directory, so the information is technically redundant.

I don't have an objection to removing that information, but it would require a recalculation of every existing zarr checksum, or some kind of versioned hash scheme. The current scheme is dandi:dandi-zarr-checksum. We would need to either alter the definition of that scheme or introduce a new one. @AlmightyYakob if we altered the checksum definition as Yarik proposes, the recalculation script could be used to redo all the checksums, correct?

@jjnesbitt
Copy link
Member

@AlmightyYakob if we altered the checksum definition as Yarik proposes, the recalculation script could be used to redo all the checksums, correct?

Yes, we could just run it on all affected dandisets/zarrs.

@yarikoptic
Copy link
Member Author

since we do not have yet (much) of dandisets with zarrs in main deployment, may be indeed we could take a shortcut and just redefine/recompute dandi:dandi-zarr-checksum instead of coming up with a new (e.g., dandi:dandi-zarr-checksum2 ;)). we would just need to rush/coordinate change of compute in dandi-cli as well, and declare prior state of the dandiverse (both dandischema, dandi-cli?) as "bad", right?

@dchiquito
Copy link
Contributor

I'm addressing this in #937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants