Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for Zarr upload process simplification #1415

Merged
merged 6 commits into from
Jan 12, 2023

Conversation

waxlamp
Copy link
Member

@waxlamp waxlamp commented Dec 21, 2022

@satra, @yarikoptic, and @jwodder, please take a look at this design doc (rendered).

This simpler design should solve the remaining performance problems with Zarr upload without compromising correctness, only requiring minor changes in how the archive and the CLI coordinate to accomplish it (see the doc itself for details).

@jwodder
Copy link
Member

jwodder commented Dec 21, 2022

Questions so far:

  • How is the wait for the Zarr to reach "COMPLETE" handled when uploading files to a pre-existing Zarr? Does requesting presigned upload URLs on a pre-existing Zarr cause it to enter a non-COMPLETE state?
  • The new procedure states "dandi-cli uses these URLs to upload the files using S3's Content-MD5 header to verify the uploaded file's integrity". Is this header provided by the Archive server when requesting presigned URLs or by the client when uploading to a presigned URL? Either way, if a malicious or buggy client uploads data that does not match the expected digest, what effect will this have on the upload?

@waxlamp
Copy link
Member Author

waxlamp commented Dec 22, 2022

Questions so far:

  • How is the wait for the Zarr to reach "COMPLETE" handled when uploading files to a pre-existing Zarr? Does requesting presigned upload URLs on a pre-existing Zarr cause it to enter a non-COMPLETE state?

Good question. We would want the Zarr in a non-COMPLETE state before new files can be uploaded to it (or existing files removed). We could either make this explicit by requiring the client to send a POST request to "open" the Zarr again, or (as you suggested) having it done upon a request for a new presigned URL. I think I like the former option better because deleting Zarr chunks would require a step like that anyway.

  • The new procedure states "dandi-cli uses these URLs to upload the files using S3's Content-MD5 header to verify the uploaded file's integrity". Is this header provided by the Archive server when requesting presigned URLs or by the client when uploading to a presigned URL? Either way, if a malicious or buggy client uploads data that does not match the expected digest, what effect will this have on the upload?

The client would supply the header from its own computation of the MD5 checksum.

Using an incorrect digest results in a failed S3 upload. A buggy client that ignores this failure would omit that file from the uploaded Zarr, leading to non-matching Zarr checksums in the last steps of the upload. (It is relatively straightforward to recover from this situation, as sketched out in the Error Handling section.)

@jwodder
Copy link
Member

jwodder commented Jan 3, 2023

To be clear, after the Zarr is finalized, the client is supposed to compare the server's Zarr checksum with the local Zarr checksum, and if they don't match, then what? Is the client supposed to notify the server, and is this what leads to the "asking S3 for a list of all objects stored at the Zarr prefix in question" error handling step? Wouldn't it be better for the client to send the locally-computed Zarr checksum to the server in the "finalize" request in case, say, the client gets killed while waiting for finalization? Either way, it seems that a buggy or malicious client could easily report a checksum mismatch where there is none, leading to wasted cycles in communicating with S3.

@jjnesbitt
Copy link
Member

Wouldn't it be better for the client to send the locally-computed Zarr checksum to the server in the "finalize" request in case, say, the client gets killed while waiting for finalization? Either way, it seems that a buggy or malicious client could easily report a checksum mismatch where there is none, leading to wasted cycles in communicating with S3.

I actually like this idea, as it removes the need to "hold on" to a locally computed zarr checksum, and also explicitly states the error case, which was previously implicit. It would require an additional status on the zarr archive to indicate this ("CHECKSUM MISMATCH" or something), which I think is okay.

@jwodder
Copy link
Member

jwodder commented Jan 3, 2023

Another thing: How exactly does this handle upload cancellation? Say the client requests a presigned URL for a Zarr entry, but before the actual upload takes place, the user hits Ctrl-C. Then, the user locally deletes the Zarr entry and starts another upload. When the client fetches the list of entries already in the Zarr from the server, I assume that the cancelled entry will not be in the list, so the client won't do anything about it, but will the server's checksumming procedure assume that the entry was uploaded, leading to a checksum mismatch and fallback to the "Error handling" procedure?

@jjnesbitt
Copy link
Member

Another thing: How exactly does this handle upload cancellation? Say the client requests a presigned URL for a Zarr entry, but before the actual upload takes place, the user hits Ctrl-C. Then, the user locally deletes the Zarr entry and starts another upload. When the client fetches the list of entries already in the Zarr from the server, I assume that the cancelled entry will not be in the list, so the client won't do anything about it, but will the server's checksumming procedure assume that the entry was uploaded, leading to a checksum mismatch and fallback to the "Error handling" procedure?

The server doesn't store any information about zarr entries, it simply reads all of the entries from the storage medium (S3) during the checksumming processing and produces an output. So in your scenario, the original entry for which the presigned URL was requested would not be counted in the final checksum.

Generally speaking, in this design, there is no concept of "upload cancelling". So if you uploaded a bunch of files that you actually didn't want included in the zarr, you'd need to use the DELETE endpoint to remove them. However if you simply requested presigned URLs for entries, but didn't upload to them, no explicit action is needed to "cancel" those uploads.

@jwodder
Copy link
Member

jwodder commented Jan 3, 2023

@AlmightyYakob

The server doesn't store any information about zarr entries, it simply reads all of the entries from the storage medium (S3) during the checksumming processing and produces an output.

The doc implies that reading from S3 would only be done as part of error handling when there's a checksum mismatch. If the server always fetches all checksums from S3, what is done differently for error handling?

@jjnesbitt
Copy link
Member

The doc implies that reading from S3 would only be done as part of error handling when there's a checksum mismatch. If the server always fetches all checksums from S3, what is done differently for error handling?

I think for brevity, the diagram isn't showing what the worker itself is doing to compute the zarr checksum, since it's supposed to be a view of the process from the client's perspective. But yes, the zarr checksum task and file listing endpoint both start by listing objects from s3 (prefixed to the zarr location in the bucket).

The zarr file listing endpoint (used for error handling) does some processing to the returned objects list so that it appears relative to the zarr root, and returns that list of objects.

The zarr checksumming task lists all the objects within the prefix from S3, and adds them to a "tree", which is then processed bottom-up.

@jwodder
Copy link
Member

jwodder commented Jan 12, 2023

I think the document should somewhere address uploading to a pre-existing Zarr as described in this comment.

@jjnesbitt
Copy link
Member

I think the document should somewhere address uploading to a pre-existing Zarr as described in this comment.

@jwodder This has been done in 078b544. The upload process for an existing zarr is merely a subset of the upload process for a new zarr, so I've just added a note that describes this.

@satra
Copy link
Member

satra commented Jan 12, 2023

@AlmightyYakob and @jwodder - in the new schema, steps 4 and 5 does not seem to be based on batches. do we know what the performance penalty is for moving to a single file request-upload option?

@jjnesbitt
Copy link
Member

@AlmightyYakob and @jwodder - in the new schema, steps 4 and 5 does not seem to be based on batches. do we know what the performance penalty is for moving to a single file request-upload option?

This step is still done in bulk, it's just no longer limited to the concept of an "upload batch" (where an "upload batch" requires not being able to request any more presigned urls until the current "batch" is fully uploaded) . So you could request as few, or as many (up to the server limit, currently 255 per request) presigned upload urls as you'd like from the API.

@satra
Copy link
Member

satra commented Jan 12, 2023

thanks @AlmightyYakob - i didn't see that mentioned in the doc. may be useful to add.

@jjnesbitt jjnesbitt merged commit 522ebb8 into master Jan 12, 2023
@jjnesbitt jjnesbitt deleted the zarr-overhaul-dd branch January 12, 2023 22:46
@dandibot
Copy link
Member

🚀 PR was released in v0.3.11 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants