Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update zarr checksum file format #937

Closed
dchiquito opened this issue Mar 2, 2022 · 9 comments · Fixed by dandi/dandi-schema#120
Closed

Update zarr checksum file format #937

dchiquito opened this issue Mar 2, 2022 · 9 comments · Fixed by dandi/dandi-schema#120
Assignees
Milestone

Comments

@dchiquito
Copy link
Contributor

The zarr .checksum file format has some deficiencies which should be addressed ASAP before too many zarrs are uploaded.

@satra
Copy link
Member

satra commented Mar 2, 2022

@dchiquito and @yarikoptic - while we are updating this format, can we consider a digest of the form <md5>-<#files_in_tree>. this would make it somewhat analogous to our multipart uploader for normal objects.

@dchiquito
Copy link
Contributor Author

@satra are you proposing that a file's digest is simply <md5> and a directory's digest is <md5>-<number_of_children>, where number_of_children is the number of files and directories that the directory contains?

I personally don't see what value that gives us. The multipart upload checksum form is <md5>-<number_of_parts>, as opposed to <md5> for normal uploads. This is useful because it indicates which checksumming algorithm was used on a file.
For zarr checksums, there is only one checksumming algorithm for files, and one for directories.

@satra
Copy link
Member

satra commented Mar 2, 2022

a directory's digest is -<number_of_children>, where number_of_children is the number of files and directories that the directory contains?

a directory digest is the md5 that we are computing, which is simply an md5 of the json, but the -<part> would be the total number of leaves in the entire tree of that node.

what value that gives us.

it gives us a quick representation of the set of components that make up the eventual checksum. just like multipart count simply tells us the set of parts that went into computing the etag.

@satra
Copy link
Member

satra commented Mar 2, 2022

btw, the checksumming algorithm used is whatever we define the schema property to be and fix the algorithm to be in dandischema. the checksum itself doesn't tell us anything about the algorithm at this point. (we did not go to multihash for this). and we do anticipate that future versions of the archive may have other algorithms that are used (e.g., based on blake or others).

@dchiquito
Copy link
Contributor Author

the - would be the total number of leaves in the entire tree of that node.

So a zarr containing 1,000,000 files would have a checksum <md5>-1000000? I suppose that might be useful. I will amend dandi/dandi-schema#120

@yarikoptic
Copy link
Member

greedy me, echoing the "size" in the original description, may be it should be <md5>-{#files}--{size}? some kind of rationale would be that for dandi-etag since we decide on how to chunk multipart upload we can deduce size "order" merely from seeing -#parts, but #files for zarr alone give us no idea on size. anecdotal support: git-annex has file size as part of its key for the content, so it is {backend-type-e.g.MD5}-s{size}--{digest}[{.extension-if-backendwithE}]. We found having that -s{size} handy on a number of occasions, e.g. not needing to do any remote calls to inquire about the size of the thing -- speed ups! So far we have not encountered some evil malicious actors adding knowingly wrong size ;)

@dchiquito
Copy link
Contributor Author

The way the API would be set up once #925 is addressed, the API would include the checksum and size whenever a zarr file/directory is queried. I don't see a lot of value in also embedding that data into the checksum, but I have no moral objection.

@satra any objection to <md5>-<number_of_files>--<size>?

@satra
Copy link
Member

satra commented Mar 2, 2022

any objection to -<number_of_files>--

no moral objection

@jwodder
Copy link
Member

jwodder commented Mar 24, 2022

@dchiquito Should this issue have been closed? The /api/zarr/{zarr_id}.zarr/{path} responses still don't include file size, and the new checksum algorithm doesn't seem to have been adopted by dandi-archive yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants