-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for Zarr upload process simplification #1415
Conversation
Questions so far:
|
Good question. We would want the Zarr in a non-COMPLETE state before new files can be uploaded to it (or existing files removed). We could either make this explicit by requiring the client to send a
The client would supply the header from its own computation of the MD5 checksum. Using an incorrect digest results in a failed S3 upload. A buggy client that ignores this failure would omit that file from the uploaded Zarr, leading to non-matching Zarr checksums in the last steps of the upload. (It is relatively straightforward to recover from this situation, as sketched out in the |
To be clear, after the Zarr is finalized, the client is supposed to compare the server's Zarr checksum with the local Zarr checksum, and if they don't match, then what? Is the client supposed to notify the server, and is this what leads to the "asking S3 for a list of all objects stored at the Zarr prefix in question" error handling step? Wouldn't it be better for the client to send the locally-computed Zarr checksum to the server in the "finalize" request in case, say, the client gets killed while waiting for finalization? Either way, it seems that a buggy or malicious client could easily report a checksum mismatch where there is none, leading to wasted cycles in communicating with S3. |
I actually like this idea, as it removes the need to "hold on" to a locally computed zarr checksum, and also explicitly states the error case, which was previously implicit. It would require an additional status on the zarr archive to indicate this ("CHECKSUM MISMATCH" or something), which I think is okay. |
Another thing: How exactly does this handle upload cancellation? Say the client requests a presigned URL for a Zarr entry, but before the actual upload takes place, the user hits Ctrl-C. Then, the user locally deletes the Zarr entry and starts another upload. When the client fetches the list of entries already in the Zarr from the server, I assume that the cancelled entry will not be in the list, so the client won't do anything about it, but will the server's checksumming procedure assume that the entry was uploaded, leading to a checksum mismatch and fallback to the "Error handling" procedure? |
The server doesn't store any information about zarr entries, it simply reads all of the entries from the storage medium (S3) during the checksumming processing and produces an output. So in your scenario, the original entry for which the presigned URL was requested would not be counted in the final checksum. Generally speaking, in this design, there is no concept of "upload cancelling". So if you uploaded a bunch of files that you actually didn't want included in the zarr, you'd need to use the |
@AlmightyYakob
The doc implies that reading from S3 would only be done as part of error handling when there's a checksum mismatch. If the server always fetches all checksums from S3, what is done differently for error handling? |
I think for brevity, the diagram isn't showing what the worker itself is doing to compute the zarr checksum, since it's supposed to be a view of the process from the client's perspective. But yes, the zarr checksum task and file listing endpoint both start by listing objects from s3 (prefixed to the zarr location in the bucket). The zarr file listing endpoint (used for error handling) does some processing to the returned objects list so that it appears relative to the zarr root, and returns that list of objects. The zarr checksumming task lists all the objects within the prefix from S3, and adds them to a "tree", which is then processed bottom-up. |
I think the document should somewhere address uploading to a pre-existing Zarr as described in this comment. |
@jwodder This has been done in 078b544. The upload process for an existing zarr is merely a subset of the upload process for a new zarr, so I've just added a note that describes this. |
@AlmightyYakob and @jwodder - in the new schema, steps 4 and 5 does not seem to be based on batches. do we know what the performance penalty is for moving to a single file request-upload option? |
This step is still done in bulk, it's just no longer limited to the concept of an "upload batch" (where an "upload batch" requires not being able to request any more presigned urls until the current "batch" is fully uploaded) . So you could request as few, or as many (up to the server limit, currently 255 per request) presigned upload urls as you'd like from the API. |
thanks @AlmightyYakob - i didn't see that mentioned in the doc. may be useful to add. |
🚀 PR was released in |
@satra, @yarikoptic, and @jwodder, please take a look at this design doc (rendered).
This simpler design should solve the remaining performance problems with Zarr upload without compromising correctness, only requiring minor changes in how the archive and the CLI coordinate to accomplish it (see the doc itself for details).