-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-design zarr checksum to not rely on "full path"s (and become a proper tree checksum) #931
Comments
@yarikoptic - as i understand the code, it's a generic checksum system at a given level of a tree. it doesn't walk anything or even digest a file. that would be a utility function written on top of those classes. to use this code locally, it would have to be combined with a walk and a digest of a file. it would be useful to add such a utility function as well. and as far as i know, @dchiquito implementation on the api side is a tree. i.e. a checksum at a level of tree is a digest of the structure underneath it. |
@satra |
@yarikoptic you are correct about the current design, it does use full paths in the checksum files. I did it that way so that you could identify the exact object described in the .checksum file entry, but if you were able to obtain the .checksum file then you already have the containing directory, so the information is technically redundant. I don't have an objection to removing that information, but it would require a recalculation of every existing zarr checksum, or some kind of versioned hash scheme. The current scheme is |
Yes, we could just run it on all affected dandisets/zarrs. |
since we do not have yet (much) of dandisets with zarrs in main deployment, may be indeed we could take a shortcut and just redefine/recompute |
I'm addressing this in #937 |
Realization (for me) came in dandi/dandi-cli#923 (comment) that for zarr checksum a "full" path from the top of zarr asset is used for computing that file/path contribution to the checksum.
Examples of that effect are in https://github.com/dandi/dandischema/blob/master/dandischema/digests/tests/test_zarr.py showing such "full paths" used in ZarrChecksum . And looking at the code https://github.com/dandi/dandischema/blob/master/dandischema/digests/zarr.py (because design file I found primarily has only TODOs on those aspects) it seems that zarr checksumming relies on a full list of paths and doesn't rely on hierarchical (nested) structure of a checksum of higher directory to be aggregation of checksums of immediate subdirectories + files.
IMHO such design is "suboptimal"
IMHO ideally underlying checksum datastructure should be a tree, with only immediate children names contributing to checksum computation.
Attn @dchiquito @satra
The text was updated successfully, but these errors were encountered: