-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1415 from dandi/zarr-overhaul-dd
- Loading branch information
Showing
1 changed file
with
202 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
# Zarr Upload Process Simplification | ||
|
||
Due to the size of Zarr archives, and specifically the number of small files | ||
they comprise, our existing design for the Zarr upload process fails to adhere | ||
to some implicit assumptions that lead to poor performance, both of Zarr uploads | ||
themselves, and the overall system. This proposal simplifies the process to gain | ||
high performance while improving data integrity guaraentees and preserving | ||
algorithmic correctness. | ||
|
||
The proposed improvement revolves around eliminating the upload batching | ||
bottleneck and replacing its functions with other, more performant actions | ||
elsewhere in the process. | ||
|
||
## Current Process | ||
|
||
To summarize the current state of Zarr upload, refer to this sequence diagram: | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant S3 | ||
participant Client as dandi-cli | ||
participant Server | ||
participant Worker | ||
Client->>+Server: Create a Zarr Archive | ||
Server-->>-Client: PENDING Zarr Archive | ||
loop for each batch of <= 255 files | ||
Client->>+Server: Create an Upload w/ list of file paths/etags | ||
Server-->>-Client: An Upload with a set of signed URLs | ||
loop for each file | ||
Client->>+S3: Upload individual file using signed URL | ||
end | ||
Client->>+Server: Finalize Upload (verify file presence and etag matching) 🐢 🐢 🐢 | ||
Server-->>-Client: PENDING Zarr Archive | ||
end | ||
Client->>+Server: Ingest Zarr Archive | ||
rect rgb(179, 209, 95) | ||
Server->>+Worker: Compute tree checksum for Zarr Archive | ||
end | ||
loop | ||
Client->>+Server: Check for COMPLETE Zarr Archive | ||
Server-->>-Client: PENDING/INGESTING/COMPLETE Zarr Archive | ||
end | ||
``` | ||
|
||
The process is as follows. | ||
|
||
(Steps 1 and 2): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state. | ||
|
||
(Steps 3 and 4): For each batch of (maxiumum) 255 Zarr chunk files the client wants to upload, `dandi-cli` asks the server to create an `Upload`, supplying the list of file paths and associated etags, and receiving a list of signed upload URLs. | ||
|
||
(Step 5): `dandi-cli` uses these URLs to upload the files in that batch. | ||
|
||
(Steps 6 and 7): Then, `dandi-cli` asks the server to finalize the batch, and the server does so, matching etags and verifiying that all files were uploaded. *This step is very costly, due to the server's need to contact S3 to verify these conditions.* | ||
|
||
(Step 8): When all batches are uploaded, `dandi-cli` signals the server to ingest the Zarr archive. | ||
|
||
(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum. | ||
|
||
(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach a `COMPLETE` state. | ||
|
||
### Performance Problem: Slow and Serial Upload Finalization | ||
|
||
The major performance problem with the existing design is the *slow* and | ||
*serial* upload batch finalize operation (step 6). The slowness comes from | ||
sending a `HEAD` request for every S3 object in the upload batch to verify its | ||
existence and check its etag against that of the file the client sent. | ||
|
||
Furthermore, the need to finalize a batch before starting a new one makes the | ||
verification of existence and checksum an essentially serial process. Together, | ||
this means that Zarr archives with a large number of files will consume quite a | ||
bit of time. | ||
|
||
The proposed design solves these flaws while still providing data integrity and | ||
existence guarantees. | ||
|
||
## Proposed Process | ||
|
||
The proposed process is largely the same as the existing process; in the | ||
following description, changed steps will be listed in **bold**. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant S3 | ||
participant Client as dandi-cli | ||
participant Server | ||
participant Worker | ||
Client->>+Client: Compute zarr checksum | ||
Client->>+Server: Create a Zarr Archive | ||
Server-->>-Client: PENDING Zarr Archive | ||
loop for each file | ||
Client->>+Server: Request signed URLs | ||
Server-->>-Client: A list of signed URLs | ||
Client->>+S3: Upload individual file using signed URL | ||
end | ||
Client->>+Server: Finalize Zarr Archive (w/ local checksum) ⚡⚡⚡ | ||
Server-->>-Client: PENDING Zarr Archive | ||
rect rgb(179, 209, 95) | ||
Server->>+Worker: Compute tree checksum for Zarr Archive, <br/> compare against provided checksum. | ||
end | ||
loop | ||
Client->>+Server: Check for COMPLETE Zarr Archive | ||
Server-->>-Client: PENDING/INGESTING/COMPLETE/MISMATCH Zarr Archive | ||
end | ||
``` | ||
|
||
(Step 1): **`dandi-cli` computes the Zarr checksum locally**. (Note that this step can actually be taken any time before step 7; it is listed here arbitrarily.) | ||
|
||
(Steps 2 and 3): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state. | ||
|
||
(Steps 4 and 5): **`dandi-cli` will request a presigned upload URL from the server for each Zarr chunk file**. | ||
Important notes: | ||
* For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state. | ||
* While there is no longer an explicit concept of an "upload batch", there is still a maximum number of presigned upload URLs that can be returned from a single request. This number is currently 255. | ||
|
||
(Step 6): `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity**. **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` repeats these steps until all files are uploaded (repeating steps 4, 5, and 6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a simple loop as depicted above; instead, it might maintain a queue of files and a set of files "in flight", replenishing them according to some dynamic batching strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will repeat until all files are uploaded.) | ||
|
||
(Steps 7 and 8): **`dandi-cli` asks the server to finalize the Zarr, providing the locally computed zarr checksum to compare against**. | ||
|
||
(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum, comparing it against the checksum provided in step 7 and updating the status of the Zarr once it's finished. | ||
|
||
(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach either the `COMPLETE` state, or if there was an error, the `MISMATCH` state. | ||
|
||
### Benefits | ||
|
||
Note that in this approach: | ||
|
||
1. S3 takes on the burden of verifying that the file arrived in the bucket | ||
correctly, saving on long-latency `HEAD` requests to S3 after the fact. | ||
2. Batches of uploaded files do not need to be serially verified before a new | ||
batch of uploads can begin. | ||
3. A final check for full Zarr archive integrity is performed by comparing | ||
checksums rather than individual files, enabling a fast and low-latency check | ||
before the Zarr archive is put into the `COMPLETE` state. | ||
|
||
Additionally, the proposed process is dramatically simpler than the existing | ||
one, which means fewer errors can creep into the codebase and runtime execution, | ||
and it is less prone to performance issues by avoiding the need for, e.g., | ||
database locks etc. | ||
|
||
The required changes to the software are: | ||
|
||
- In `dandi-archive`: | ||
- the logic for verifying upload batches will be removed; | ||
- the logic for computing checksums will be generalized to work on both S3 | ||
objects and local files; and | ||
- that logic will be refactored into a shared Python library. | ||
- In `dandi-cli`: | ||
- the local Zarr checksum needs to be computed (using the shared Python | ||
library for doing so); | ||
- (optionally) a new strategy for managing a set of signed URLs needs to be | ||
implemented (though the current strategy of managing statically sized | ||
batches will work); and | ||
- a comparison between the server-computed and locally-computed checksums | ||
needs to be made. | ||
|
||
## Error Handling | ||
|
||
In the overwhelming majority of cases, the proposed process will result in a | ||
successful Zarr upload. This is due to a combination of using S3's data | ||
integrity guarantees (to verify that all uploaded bytes arrived properly) and | ||
verifying a full checksum on the client and server's version of the uploaded | ||
Zarr (to guarantee that all required files were uploaded to the correct | ||
locations). | ||
|
||
In the rare case that something does go wrong (indicated by a non-matching Zarr | ||
checksum in the final step of the process), the algorithm for recovery involves | ||
the client asking the API for the list of all objects currently stored in the | ||
Zarr in question, then comparing that list with the local list of Zarr chunk files | ||
for both existence and matching checksum. Any files with an incorrect checksum can | ||
be removed from S3 and then reuploaded; any missing files can simply be uploaded. | ||
|
||
A variant of this algorithm can be used for resuming an interrupted upload | ||
process: skip the checksum comparison and only look for files that have not yet | ||
been uploaded. Other variants can be used to, e.g., upload a locally changed | ||
Zarr, etc. | ||
|
||
It is worth repeating: the need for this error handling mechanism is generally | ||
expected to almost never be needed; but it is possible to run this algorithm to | ||
correct any errors that do arise. | ||
|
||
## Execution | ||
|
||
Only small changes are needed to the `dandi-cli` codebase to bring it in line | ||
with this proposal. Some of the work in `dandi-archive` has already been done, | ||
and the rest is on par with the changes needed for `dandi-cli`. If all concerns | ||
with the plan are allayed, then it should not be difficult to execute the plan | ||
and gain significant performance for Zarr upload. |