Skip to content

Commit

Permalink
Merge pull request #1415 from dandi/zarr-overhaul-dd
Browse files Browse the repository at this point in the history
  • Loading branch information
jjnesbitt committed Jan 12, 2023
2 parents 5a0ed73 + 1f121a9 commit 522ebb8
Showing 1 changed file with 202 additions and 0 deletions.
202 changes: 202 additions & 0 deletions doc/design/zarr-performance-redesign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Zarr Upload Process Simplification

Due to the size of Zarr archives, and specifically the number of small files
they comprise, our existing design for the Zarr upload process fails to adhere
to some implicit assumptions that lead to poor performance, both of Zarr uploads
themselves, and the overall system. This proposal simplifies the process to gain
high performance while improving data integrity guaraentees and preserving
algorithmic correctness.

The proposed improvement revolves around eliminating the upload batching
bottleneck and replacing its functions with other, more performant actions
elsewhere in the process.

## Current Process

To summarize the current state of Zarr upload, refer to this sequence diagram:

```mermaid
sequenceDiagram
autonumber
participant S3
participant Client as dandi-cli
participant Server
participant Worker
Client->>+Server: Create a Zarr Archive
Server-->>-Client: PENDING Zarr Archive
loop for each batch of <= 255 files
Client->>+Server: Create an Upload w/ list of file paths/etags
Server-->>-Client: An Upload with a set of signed URLs
loop for each file
Client->>+S3: Upload individual file using signed URL
end
Client->>+Server: Finalize Upload (verify file presence and etag matching) 🐢 🐢 🐢
Server-->>-Client: PENDING Zarr Archive
end
Client->>+Server: Ingest Zarr Archive
rect rgb(179, 209, 95)
Server->>+Worker: Compute tree checksum for Zarr Archive
end
loop
Client->>+Server: Check for COMPLETE Zarr Archive
Server-->>-Client: PENDING/INGESTING/COMPLETE Zarr Archive
end
```

The process is as follows.

(Steps 1 and 2): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state.

(Steps 3 and 4): For each batch of (maxiumum) 255 Zarr chunk files the client wants to upload, `dandi-cli` asks the server to create an `Upload`, supplying the list of file paths and associated etags, and receiving a list of signed upload URLs.

(Step 5): `dandi-cli` uses these URLs to upload the files in that batch.

(Steps 6 and 7): Then, `dandi-cli` asks the server to finalize the batch, and the server does so, matching etags and verifiying that all files were uploaded. *This step is very costly, due to the server's need to contact S3 to verify these conditions.*

(Step 8): When all batches are uploaded, `dandi-cli` signals the server to ingest the Zarr archive.

(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum.

(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach a `COMPLETE` state.

### Performance Problem: Slow and Serial Upload Finalization

The major performance problem with the existing design is the *slow* and
*serial* upload batch finalize operation (step 6). The slowness comes from
sending a `HEAD` request for every S3 object in the upload batch to verify its
existence and check its etag against that of the file the client sent.

Furthermore, the need to finalize a batch before starting a new one makes the
verification of existence and checksum an essentially serial process. Together,
this means that Zarr archives with a large number of files will consume quite a
bit of time.

The proposed design solves these flaws while still providing data integrity and
existence guarantees.

## Proposed Process

The proposed process is largely the same as the existing process; in the
following description, changed steps will be listed in **bold**.

```mermaid
sequenceDiagram
autonumber
participant S3
participant Client as dandi-cli
participant Server
participant Worker
Client->>+Client: Compute zarr checksum
Client->>+Server: Create a Zarr Archive
Server-->>-Client: PENDING Zarr Archive
loop for each file
Client->>+Server: Request signed URLs
Server-->>-Client: A list of signed URLs
Client->>+S3: Upload individual file using signed URL
end
Client->>+Server: Finalize Zarr Archive (w/ local checksum) ⚡⚡⚡
Server-->>-Client: PENDING Zarr Archive
rect rgb(179, 209, 95)
Server->>+Worker: Compute tree checksum for Zarr Archive, <br/> compare against provided checksum.
end
loop
Client->>+Server: Check for COMPLETE Zarr Archive
Server-->>-Client: PENDING/INGESTING/COMPLETE/MISMATCH Zarr Archive
end
```

(Step 1): **`dandi-cli` computes the Zarr checksum locally**. (Note that this step can actually be taken any time before step 7; it is listed here arbitrarily.)

(Steps 2 and 3): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state.

(Steps 4 and 5): **`dandi-cli` will request a presigned upload URL from the server for each Zarr chunk file**.
Important notes:
* For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state.
* While there is no longer an explicit concept of an "upload batch", there is still a maximum number of presigned upload URLs that can be returned from a single request. This number is currently 255.

(Step 6): `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity**. **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` repeats these steps until all files are uploaded (repeating steps 4, 5, and 6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a simple loop as depicted above; instead, it might maintain a queue of files and a set of files "in flight", replenishing them according to some dynamic batching strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will repeat until all files are uploaded.)

(Steps 7 and 8): **`dandi-cli` asks the server to finalize the Zarr, providing the locally computed zarr checksum to compare against**.

(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum, comparing it against the checksum provided in step 7 and updating the status of the Zarr once it's finished.

(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach either the `COMPLETE` state, or if there was an error, the `MISMATCH` state.

### Benefits

Note that in this approach:

1. S3 takes on the burden of verifying that the file arrived in the bucket
correctly, saving on long-latency `HEAD` requests to S3 after the fact.
2. Batches of uploaded files do not need to be serially verified before a new
batch of uploads can begin.
3. A final check for full Zarr archive integrity is performed by comparing
checksums rather than individual files, enabling a fast and low-latency check
before the Zarr archive is put into the `COMPLETE` state.

Additionally, the proposed process is dramatically simpler than the existing
one, which means fewer errors can creep into the codebase and runtime execution,
and it is less prone to performance issues by avoiding the need for, e.g.,
database locks etc.

The required changes to the software are:

- In `dandi-archive`:
- the logic for verifying upload batches will be removed;
- the logic for computing checksums will be generalized to work on both S3
objects and local files; and
- that logic will be refactored into a shared Python library.
- In `dandi-cli`:
- the local Zarr checksum needs to be computed (using the shared Python
library for doing so);
- (optionally) a new strategy for managing a set of signed URLs needs to be
implemented (though the current strategy of managing statically sized
batches will work); and
- a comparison between the server-computed and locally-computed checksums
needs to be made.

## Error Handling

In the overwhelming majority of cases, the proposed process will result in a
successful Zarr upload. This is due to a combination of using S3's data
integrity guarantees (to verify that all uploaded bytes arrived properly) and
verifying a full checksum on the client and server's version of the uploaded
Zarr (to guarantee that all required files were uploaded to the correct
locations).

In the rare case that something does go wrong (indicated by a non-matching Zarr
checksum in the final step of the process), the algorithm for recovery involves
the client asking the API for the list of all objects currently stored in the
Zarr in question, then comparing that list with the local list of Zarr chunk files
for both existence and matching checksum. Any files with an incorrect checksum can
be removed from S3 and then reuploaded; any missing files can simply be uploaded.

A variant of this algorithm can be used for resuming an interrupted upload
process: skip the checksum comparison and only look for files that have not yet
been uploaded. Other variants can be used to, e.g., upload a locally changed
Zarr, etc.

It is worth repeating: the need for this error handling mechanism is generally
expected to almost never be needed; but it is possible to run this algorithm to
correct any errors that do arise.

## Execution

Only small changes are needed to the `dandi-cli` codebase to bring it in line
with this proposal. Some of the work in `dandi-archive` has already been done,
and the rest is on par with the changes needed for `dandi-cli`. If all concerns
with the plan are allayed, then it should not be difficult to execute the plan
and gain significant performance for Zarr upload.

0 comments on commit 522ebb8

Please sign in to comment.