From 5498fd9f3f8b89ef23caed189092d476cd88efe6 Mon Sep 17 00:00:00 2001 From: Roni Choudhury Date: Mon, 19 Dec 2022 13:19:59 -0500 Subject: [PATCH 1/6] Add design doc --- doc/design/zarr-performance-redesign.md | 206 ++++++++++++++++++++++++ 1 file changed, 206 insertions(+) create mode 100644 doc/design/zarr-performance-redesign.md diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md new file mode 100644 index 000000000..e016cf277 --- /dev/null +++ b/doc/design/zarr-performance-redesign.md @@ -0,0 +1,206 @@ +# Zarr Upload Process Simplification + +Due to the size of Zarr archives, and specifically the number of small files +they comprise, our existing design for the Zarr upload process fails to adhere +to some implicit assumptions that lead to poor performance, both of Zarr uploads +themselves, and the overall system. This proposal simplifies the process to gain +high performance while improving data integrity guaraentees and preserving +algorithmic correctness. + +The proposed improvement revolves around eliminating the upload batching +bottleneck and replacing its functions with other, more performant actions +elsewhere in the process. + +## Current Process + +To summarize the current state of Zarr upload, refer to this sequence diagram: + +```mermaid +sequenceDiagram + autonumber + + participant S3 + participant Client as dandi-cli + participant Server + participant Worker + + Client->>+Server: Create a Zarr Archive + Server-->>-Client: PENDING Zarr Archive + + loop for each batch of <= 255 files + Client->>+Server: Create an Upload w/ list of file paths/etags + Server-->>-Client: An Upload with a set of signed URLs + loop for each file + Client->>+S3: Upload individual file using signed URL + end + Client->>+Server: Finalize Upload (verify file presence and etag matching) 🐢 🐢 🐢 + Server-->>-Client: PENDING Zarr Archive + end + + Client->>+Server: Ingest Zarr Archive + + rect rgb(179, 209, 95) + Server->>+Worker: Compute tree checksum for Zarr Archive + end + + loop + Client->>+Server: Check for COMPLETE Zarr Archive + Server-->>-Client: PENDING/INGESTING/COMPLETE Zarr Archive + end +``` + +The process is as follows. `dandi-cli` asks the server to create a new Zarr +archive, which is put into the `PENDING` state (steps 1 and 2). For each batch +of (maxiumum) 255 Zarr chunk files the client wants to upload, `dandi-cli` asks +the server to create an `Upload`, supplying the list of file paths and +associated etags, and receiving a list of signed upload URLs (steps 3 and 4). +`dandi-cli` uses these URLs to upload the files in that batch (step 5). Then, +`dandi-cli` asks the server to finalize the batch, and the server does so, +matching etags and verifiying that all files were uploaded (steps 6 and 7). +*This step is very costly, due to the server's need to contact S3 to verify +these conditions.* When all batches are uploaded, `dandi-cli` signals the server +to ingest the Zarr archive (step 8). The server kicks off an asynchronous task +to compute the treewise Zarr checksum (step 9). Meanwhile, `dandi-cli` sits in a +loop checking for the Zarr archive to reach a `COMPLETE` state (steps 10 and +11). + +### Performance Problem: Slow and Serial Upload Finalization + +The major performance problem with the existing design is the *slow* and +*serial* upload batch finalize operation (step 6). The slowness comes from +sending a `HEAD` request for every S3 object in the upload batch to verify its +existence and check its etag against that of the file the client sent. + +Furthermore, the need to finalize a batch before starting a new one makes the +verification of existence and checksum an essentially serial process. Together, +this means that Zarr archives with a large number of files will consume quite a +bit of time. + +The proposed design solves these flaws while still providing data integrity and +existence guarantees. + +## Proposed Process + +The proposed process is largely the same as the existing process; in the +following description, changed steps will be listed in **bold**. + +```mermaid +sequenceDiagram + autonumber + + participant S3 + participant Client as dandi-cli + participant Server + participant Worker + + Client->>+Client: Compute zarr checksum + + Client->>+Server: Create a Zarr Archive + Server-->>-Client: PENDING Zarr Archive + + loop for each file + Client->>+Server: Request signed URL + Server-->>-Client: A signed URL + Client->>+S3: Upload individual file using signed URL + end + + Client->>+Server: Finalize Zarr Archive ⚡⚡⚡ + Server-->>-Client: PENDING Zarr Archive + + rect rgb(179, 209, 95) + Server->>+Worker: Compute tree checksum for Zarr Archive + end + + loop + Client->>+Server: Check for COMPLETE Zarr Archive + Server-->>-Client: PENDING/INGESTING/COMPLETE Zarr Archive + end + + Client->>+Client: Verify zarr checksum with local +``` + +**`dandi-cli` computes the Zarr checksum locally** (step 1). (Note that this +step can actually be taken any time before step 12; it is listed here +arbitrarily.) `dandi-cli` asks the server to create a new Zarr archive, which is +put into the `PENDING` state (steps 2 and 3). **`dandi-cli` will request a +presigned upload URL from the server for each Zarr chunk file.** (setps 4 and +5). `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` +header to verify the uploaded file's integrity** (step 6). **Instead of +finalizing a batch (since there is no longer a batch concept), `dandi-cli` +repeats these steps until all files are uploaded (repeating steps 4, 5, and +6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a +simple loop as depicted above; instead, it might maintain a queue of files and a +set of files "in flight", replenishing them according to some dynamic batching +strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will +repeat until all files are uploaded.) **`dandi-cli` asks the server to finalize +the Zarr** (steps 7 and 8). The server kicks off an asynchronous task to compute +the treewise Zarr checksum (step 9). Meanwhile, `dandi-cli` sits in a loop +checking for the Zarr archive to reach a `COMPLETE` state (steps 10 and 11). +**`dandi-cli` compares its locally computed checksum (from step 1) with the +checksum recorded in the Zarr archive** (step 12). + +### Benefits + +Note that in this approach: + +1. S3 takes on the burden of verifying that the file arrived in the bucket + correctly, saving on long-latency `HEAD` requests to S3 after the fact. +2. Batches of uploaded files do not need to be serially verified before a new + batch of uploads can begin. +3. A final check for full Zarr archive integrity is performed by comparing + checksums rather than individual files, enabling a fast and low-latency check + before the Zarr archive is put into the `COMPLETE` state. + +Additionally, the proposed process is dramatically simpler than the existing +one, which means fewer errors can creep into the codebase and runtime execution, +and it is less prone to performance issues by avoiding the need for, e.g., +database locks etc. + +The required changes to the software are: + +- In `dandi-archive`: + - the logic for verifying upload batches will be removed; + - the logic for computing checksums will be generalized to work on both S3 + objects and local files; and + - that logic will be refactored into a shared Python library. +- In `dandi-cli`: + - the local Zarr checksum needs to be computed (using the shared Python + library for doing so); + - (optionally) a new strategy for managing a set of signed URLs needs to be + implemented (though the current strategy of managing statically sized + batches will work); and + - a comparison between the server-computed and locally-computed checksums + needs to be made. + +## Error Handling + +In the overwhelming majority of cases, the proposed process will result in a +successful Zarr upload. This is due to a combination of using S3's data +integrity guarantees (to verify that all uploaded bytes arrived properly) and +verifying a full checksum on the client and server's version of the uploaded +Zarr (to guarantee that all required files were uploaded to the correct +locations). + +In the rare case that something does go wrong (indicated by a non-matching Zarr +checksum in the final step of the process), the algorithm for recovery involves +asking S3 for a list of all objects stored at the Zarr prefix in question, then +comparing that list with the local list of Zarr chunk files for both existence +and matching checksum. Any files with an incorrect checksum can be removed from +S3 and then reuploaded; any missing files can simply be uploaded. + +A variant of this algorithm can be used for resuming an interrupted upload +process: skip the checksum comparison and only look for files that have not yet +been uploaded. Other variants can be used to, e.g., upload a locally changed +Zarr, etc. + +It is worth repeating: the need for this error handling mechanism is generally +expected to almost never be needed; but it is possible to run this algorithm to +correct any errors that do arise. + +## Execution + +Only small changes are needed to the `dandi-cli` codebase to bring it in line +with this proposal. Some of the work in `dandi-archive` has already been done, +and the rest is on par with the changes needed for `dandi-cli`. If all concerns +with the plan are allayed, then it should not be difficult to execute the plan +and gain significant performance for Zarr upload. From e148c8045f072499e02519bcc9b3857c9834061e Mon Sep 17 00:00:00 2001 From: Jacob Nesbitt Date: Wed, 11 Jan 2023 12:45:59 -0500 Subject: [PATCH 2/6] Update description of zarr upload error case --- doc/design/zarr-performance-redesign.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md index e016cf277..ebccaf43f 100644 --- a/doc/design/zarr-performance-redesign.md +++ b/doc/design/zarr-performance-redesign.md @@ -183,10 +183,10 @@ locations). In the rare case that something does go wrong (indicated by a non-matching Zarr checksum in the final step of the process), the algorithm for recovery involves -asking S3 for a list of all objects stored at the Zarr prefix in question, then -comparing that list with the local list of Zarr chunk files for both existence -and matching checksum. Any files with an incorrect checksum can be removed from -S3 and then reuploaded; any missing files can simply be uploaded. +the client asking the API for the list of all objects currently stored in the +Zarr in question, then comparing that list with the local list of Zarr chunk files +for both existence and matching checksum. Any files with an incorrect checksum can +be removed from S3 and then reuploaded; any missing files can simply be uploaded. A variant of this algorithm can be used for resuming an interrupted upload process: skip the checksum comparison and only look for files that have not yet From 244528a11390e56c12e8d6fcf41f9d7b002090a5 Mon Sep 17 00:00:00 2001 From: Jacob Nesbitt Date: Wed, 11 Jan 2023 13:33:17 -0500 Subject: [PATCH 3/6] Update zarr upload finalization --- doc/design/zarr-performance-redesign.md | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md index ebccaf43f..89eadc1df 100644 --- a/doc/design/zarr-performance-redesign.md +++ b/doc/design/zarr-performance-redesign.md @@ -104,26 +104,24 @@ sequenceDiagram Client->>+S3: Upload individual file using signed URL end - Client->>+Server: Finalize Zarr Archive ⚡⚡⚡ + Client->>+Server: Finalize Zarr Archive (w/ local checksum) ⚡⚡⚡ Server-->>-Client: PENDING Zarr Archive rect rgb(179, 209, 95) - Server->>+Worker: Compute tree checksum for Zarr Archive + Server->>+Worker: Compute tree checksum for Zarr Archive,
compare against provided checksum. end loop Client->>+Server: Check for COMPLETE Zarr Archive - Server-->>-Client: PENDING/INGESTING/COMPLETE Zarr Archive + Server-->>-Client: PENDING/INGESTING/COMPLETE/MISMATCH Zarr Archive end - - Client->>+Client: Verify zarr checksum with local ``` **`dandi-cli` computes the Zarr checksum locally** (step 1). (Note that this -step can actually be taken any time before step 12; it is listed here +step can actually be taken any time before step 7; it is listed here arbitrarily.) `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state (steps 2 and 3). **`dandi-cli` will request a -presigned upload URL from the server for each Zarr chunk file.** (setps 4 and +presigned upload URL from the server for each Zarr chunk file.** (steps 4 and 5). `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity** (step 6). **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` @@ -133,11 +131,9 @@ simple loop as depicted above; instead, it might maintain a queue of files and a set of files "in flight", replenishing them according to some dynamic batching strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will repeat until all files are uploaded.) **`dandi-cli` asks the server to finalize -the Zarr** (steps 7 and 8). The server kicks off an asynchronous task to compute -the treewise Zarr checksum (step 9). Meanwhile, `dandi-cli` sits in a loop -checking for the Zarr archive to reach a `COMPLETE` state (steps 10 and 11). -**`dandi-cli` compares its locally computed checksum (from step 1) with the -checksum recorded in the Zarr archive** (step 12). +the Zarr, providing the locally computed zarr checksum to compare against** (steps 7 and 8). The server kicks off an asynchronous task to compute +the treewise Zarr checksum, comparing it against the checksum provided in step 7 and updating the status of the Zarr once it's finished (step 9). Meanwhile, `dandi-cli` sits in a loop +checking for the Zarr archive to reach either the `COMPLETE` state, or if there was an error, the `MISMATCH` state (steps 10 and 11). ### Benefits From 078b544875c0cf03860f02f455f36fd04e8af4d1 Mon Sep 17 00:00:00 2001 From: Jacob Nesbitt Date: Thu, 12 Jan 2023 12:52:28 -0500 Subject: [PATCH 4/6] Explicity mention existing zarrs --- doc/design/zarr-performance-redesign.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md index 89eadc1df..c23a120a8 100644 --- a/doc/design/zarr-performance-redesign.md +++ b/doc/design/zarr-performance-redesign.md @@ -121,8 +121,8 @@ sequenceDiagram step can actually be taken any time before step 7; it is listed here arbitrarily.) `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state (steps 2 and 3). **`dandi-cli` will request a -presigned upload URL from the server for each Zarr chunk file.** (steps 4 and -5). `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` +presigned upload URL from the server for each Zarr chunk file** (steps 4 and +5). Note: For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state. `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity** (step 6). **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` repeats these steps until all files are uploaded (repeating steps 4, 5, and From 2a0bc5e0acc027a908d37393b255cb00e12ef81e Mon Sep 17 00:00:00 2001 From: Jacob Nesbitt Date: Thu, 12 Jan 2023 13:03:02 -0500 Subject: [PATCH 5/6] Reformat process descriptions --- doc/design/zarr-performance-redesign.md | 59 ++++++++++++------------- 1 file changed, 28 insertions(+), 31 deletions(-) diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md index c23a120a8..4a72dd4eb 100644 --- a/doc/design/zarr-performance-redesign.md +++ b/doc/design/zarr-performance-redesign.md @@ -49,20 +49,21 @@ sequenceDiagram end ``` -The process is as follows. `dandi-cli` asks the server to create a new Zarr -archive, which is put into the `PENDING` state (steps 1 and 2). For each batch -of (maxiumum) 255 Zarr chunk files the client wants to upload, `dandi-cli` asks -the server to create an `Upload`, supplying the list of file paths and -associated etags, and receiving a list of signed upload URLs (steps 3 and 4). -`dandi-cli` uses these URLs to upload the files in that batch (step 5). Then, -`dandi-cli` asks the server to finalize the batch, and the server does so, -matching etags and verifiying that all files were uploaded (steps 6 and 7). -*This step is very costly, due to the server's need to contact S3 to verify -these conditions.* When all batches are uploaded, `dandi-cli` signals the server -to ingest the Zarr archive (step 8). The server kicks off an asynchronous task -to compute the treewise Zarr checksum (step 9). Meanwhile, `dandi-cli` sits in a -loop checking for the Zarr archive to reach a `COMPLETE` state (steps 10 and -11). +The process is as follows. + +(Steps 1 and 2): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state. + +(Steps 3 and 4): For each batch of (maxiumum) 255 Zarr chunk files the client wants to upload, `dandi-cli` asks the server to create an `Upload`, supplying the list of file paths and associated etags, and receiving a list of signed upload URLs. + +(Step 5): `dandi-cli` uses these URLs to upload the files in that batch. + +(Steps 6 and 7): Then, `dandi-cli` asks the server to finalize the batch, and the server does so, matching etags and verifiying that all files were uploaded. *This step is very costly, due to the server's need to contact S3 to verify these conditions.* + +(Step 8): When all batches are uploaded, `dandi-cli` signals the server to ingest the Zarr archive. + +(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum. + +(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach a `COMPLETE` state. ### Performance Problem: Slow and Serial Upload Finalization @@ -117,23 +118,19 @@ sequenceDiagram end ``` -**`dandi-cli` computes the Zarr checksum locally** (step 1). (Note that this -step can actually be taken any time before step 7; it is listed here -arbitrarily.) `dandi-cli` asks the server to create a new Zarr archive, which is -put into the `PENDING` state (steps 2 and 3). **`dandi-cli` will request a -presigned upload URL from the server for each Zarr chunk file** (steps 4 and -5). Note: For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state. `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` -header to verify the uploaded file's integrity** (step 6). **Instead of -finalizing a batch (since there is no longer a batch concept), `dandi-cli` -repeats these steps until all files are uploaded (repeating steps 4, 5, and -6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a -simple loop as depicted above; instead, it might maintain a queue of files and a -set of files "in flight", replenishing them according to some dynamic batching -strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will -repeat until all files are uploaded.) **`dandi-cli` asks the server to finalize -the Zarr, providing the locally computed zarr checksum to compare against** (steps 7 and 8). The server kicks off an asynchronous task to compute -the treewise Zarr checksum, comparing it against the checksum provided in step 7 and updating the status of the Zarr once it's finished (step 9). Meanwhile, `dandi-cli` sits in a loop -checking for the Zarr archive to reach either the `COMPLETE` state, or if there was an error, the `MISMATCH` state (steps 10 and 11). +(Step 1): **`dandi-cli` computes the Zarr checksum locally**. (Note that this step can actually be taken any time before step 7; it is listed here arbitrarily.) + +(Steps 2 and 3): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state. + +(Steps 4 and 5): **`dandi-cli` will request a presigned upload URL from the server for each Zarr chunk file**. (Note: For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state). + +(Step 6): `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity**. **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` repeats these steps until all files are uploaded (repeating steps 4, 5, and 6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a simple loop as depicted above; instead, it might maintain a queue of files and a set of files "in flight", replenishing them according to some dynamic batching strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will repeat until all files are uploaded.) + +(Steps 7 and 8): **`dandi-cli` asks the server to finalize the Zarr, providing the locally computed zarr checksum to compare against**. + +(Step 9): The server kicks off an asynchronous task to compute the treewise Zarr checksum, comparing it against the checksum provided in step 7 and updating the status of the Zarr once it's finished. + +(Steps 10 and 11): Meanwhile, `dandi-cli` sits in a loop checking for the Zarr archive to reach either the `COMPLETE` state, or if there was an error, the `MISMATCH` state. ### Benefits From 1f121a927eeec15e1b5e52a887dcbe22b727a351 Mon Sep 17 00:00:00 2001 From: Jacob Nesbitt Date: Thu, 12 Jan 2023 17:06:58 -0500 Subject: [PATCH 6/6] Clarify presigned URL requests --- doc/design/zarr-performance-redesign.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/doc/design/zarr-performance-redesign.md b/doc/design/zarr-performance-redesign.md index 4a72dd4eb..a733a1804 100644 --- a/doc/design/zarr-performance-redesign.md +++ b/doc/design/zarr-performance-redesign.md @@ -100,8 +100,8 @@ sequenceDiagram Server-->>-Client: PENDING Zarr Archive loop for each file - Client->>+Server: Request signed URL - Server-->>-Client: A signed URL + Client->>+Server: Request signed URLs + Server-->>-Client: A list of signed URLs Client->>+S3: Upload individual file using signed URL end @@ -122,7 +122,10 @@ sequenceDiagram (Steps 2 and 3): `dandi-cli` asks the server to create a new Zarr archive, which is put into the `PENDING` state. -(Steps 4 and 5): **`dandi-cli` will request a presigned upload URL from the server for each Zarr chunk file**. (Note: For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state). +(Steps 4 and 5): **`dandi-cli` will request a presigned upload URL from the server for each Zarr chunk file**. +Important notes: +* For an existing zarr archive, this is where the upload process begins, as requesting a signed url for upload will always place the zarr archive into a `PENDING` state. +* While there is no longer an explicit concept of an "upload batch", there is still a maximum number of presigned upload URLs that can be returned from a single request. This number is currently 255. (Step 6): `dandi-cli` uses these URLs to upload the files **using S3's `Content-MD5` header to verify the uploaded file's integrity**. **Instead of finalizing a batch (since there is no longer a batch concept), `dandi-cli` repeats these steps until all files are uploaded (repeating steps 4, 5, and 6).** (Note that `dandi-cli`'s actual strategy here may be more nuanced than a simple loop as depicted above; instead, it might maintain a queue of files and a set of files "in flight", replenishing them according to some dynamic batching strategy, etc. In any such strategy, some combination of steps 4, 5, and 6 will repeat until all files are uploaded.)