Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

yarikoptic · 2022-02-25T19:47:30Z

Realization (for me) came in dandi/dandi-cli#923 (comment) that for zarr checksum a "full" path from the top of zarr asset is used for computing that file/path contribution to the checksum.
Examples of that effect are in https://github.com/dandi/dandischema/blob/master/dandischema/digests/tests/test_zarr.py showing such "full paths" used in ZarrChecksum . And looking at the code https://github.com/dandi/dandischema/blob/master/dandischema/digests/zarr.py (because design file I found primarily has only TODOs on those aspects) it seems that zarr checksumming relies on a full list of paths and doesn't rely on hierarchical (nested) structure of a checksum of higher directory to be aggregation of checksums of immediate subdirectories + files.

IMHO such design is "suboptimal"

conceptually in that the checksum of a subdirectory (within a zarr directory) for some reason depends not just on its content but on where in the zarr it resides
in "implementation" , as my confusion/question in Minimize/optimize Zarr digestion when uploading dandi-cli#923 (comment) showed, because requires knowing the top directory from which to "compute checksum of the subcomponent", so more complicated API.
"memory demand" -- full relative paths should be stored, thus replicating leading prefixes over and over again for no real reason.

IMHO ideally underlying checksum datastructure should be a tree, with only immediate children names contributing to checksum computation.

Attn @dchiquito @satra

satra · 2022-02-27T14:42:19Z

@yarikoptic - as i understand the code, it's a generic checksum system at a given level of a tree. it doesn't walk anything or even digest a file. that would be a utility function written on top of those classes.

to use this code locally, it would have to be combined with a walk and a digest of a file. it would be useful to add such a utility function as well.

and as far as i know, @dchiquito implementation on the api side is a tree. i.e. a checksum at a level of tree is a digest of the structure underneath it.

yarikoptic · 2022-02-27T22:45:41Z

@satra
Yes, underneath it, but with full paths from top of the tree (at least according to @jwodder ), is that right @dchiquito ?

dchiquito · 2022-02-28T19:51:50Z

@yarikoptic you are correct about the current design, it does use full paths in the checksum files. I did it that way so that you could identify the exact object described in the .checksum file entry, but if you were able to obtain the .checksum file then you already have the containing directory, so the information is technically redundant.

I don't have an objection to removing that information, but it would require a recalculation of every existing zarr checksum, or some kind of versioned hash scheme. The current scheme is dandi:dandi-zarr-checksum. We would need to either alter the definition of that scheme or introduce a new one. @AlmightyYakob if we altered the checksum definition as Yarik proposes, the recalculation script could be used to redo all the checksums, correct?

jjnesbitt · 2022-02-28T19:53:10Z

@AlmightyYakob if we altered the checksum definition as Yarik proposes, the recalculation script could be used to redo all the checksums, correct?

Yes, we could just run it on all affected dandisets/zarrs.

yarikoptic · 2022-02-28T21:19:15Z

since we do not have yet (much) of dandisets with zarrs in main deployment, may be indeed we could take a shortcut and just redefine/recompute dandi:dandi-zarr-checksum instead of coming up with a new (e.g., dandi:dandi-zarr-checksum2 ;)). we would just need to rush/coordinate change of compute in dandi-cli as well, and declare prior state of the dandiverse (both dandischema, dandi-cli?) as "bad", right?

dchiquito · 2022-03-02T14:44:31Z

I'm addressing this in #937

yarikoptic mentioned this issue Feb 25, 2022

Minimize/optimize Zarr digestion when uploading dandi/dandi-cli#923

Merged

dchiquito added this to the Zarr milestone Feb 28, 2022

dchiquito transferred this issue from dandi/dandi-schema Feb 28, 2022

dchiquito mentioned this issue Mar 2, 2022

Update zarr checksum file format #937

Closed

dchiquito self-assigned this Mar 2, 2022

waxlamp mentioned this issue Mar 7, 2022

Update zarr checksums dandi/dandi-schema#120

Merged

dchiquito closed this as completed in dandi/dandi-schema#120 Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

yarikoptic commented Feb 25, 2022

satra commented Feb 27, 2022

yarikoptic commented Feb 27, 2022

dchiquito commented Feb 28, 2022

jjnesbitt commented Feb 28, 2022

yarikoptic commented Feb 28, 2022

dchiquito commented Mar 2, 2022

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931

Comments

yarikoptic commented Feb 25, 2022

satra commented Feb 27, 2022

yarikoptic commented Feb 27, 2022

dchiquito commented Feb 28, 2022

jjnesbitt commented Feb 28, 2022

yarikoptic commented Feb 28, 2022

dchiquito commented Mar 2, 2022