Allow user to review checksums before multipart upload completes #327

graebm · 2023-07-06T18:23:10Z

Issue:
It is very hard for the user to do end-to-end checksum verification for multipart uploads.

A user might buffer data in memory before passing it into the S3 client. The S3 client will compute checksums on whatever data is passed in, but what if this buffered data suffers a bit flip before hand? The S3 client would simply compute the checksum of the corrupted data.

It's easy if the object is uploaded in 1 request via PutObject. The user can simply compare their own checksum with the appropriate response header (e.g. x-amz-checksum-crc32).

But it's hard for multipart upload, where the object's final checksum is calculated differently. The algorithm is described here. The algorithm is basically: checksum of all parts' checksums concatenated together, then add "-14" (if there were 14 parts).

Currently, to verify multipart upload a user would need to reproduce the part-splitting logic of aws-c-s3 (which might change in the future), and reproduce S3's final checksum algorithm (I'm not sure if this could ever change?).

Description of Changes:
Add a review_upload_callback allowing users to "review" the part boundaries and part checksums calculated by aws-c-s3, and cancel the multipart upload if they don't agree.

Design Considerations:
This design isn't perfect, so it's marked "experimental/unstable" for now.

If you're using CRC32, there's a trick where you can combine part-part checksums to get the whole-object checksum. The customer asking for this feature is taking this approach, so we're providing this to unblock them.

But other algorithms (e.g. SHA-256) don't allow this trick. The user would need to re-review all their data, calculating a checksum per part based on the part-boundaries provide by this callback, plus a whole-object checksum to compare against their original. Or the user would need to reproduce aws-c-s3's part-splitting logic as they stream data the first time, them from this callback they could confirm that their part-splitting logic matches ours. But if aws-c-s3's logic ever changed, they'd need to re-write their code to adjust.

My first instinct was to let the user provide the per-part checksums, via a callback that occurred after streaming data from each part. But I realized that, to truly guarantee end-to-end integrity, the user would still need to anticipate the part boundaries and reproduce aws-c-s3's part splitting logic. Or the user would need to keep each part's data buffered until this checksum callback occurred and re-scan it then. Also, this part-part callback based on recent reads would be hard to fit in a possible future where we read data for multiple parts in parallel.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

TingDaoK · 2023-07-06T21:53:00Z

include/aws/s3/s3_client.h

+     * WARNING: experimental/unstable
+     * See `aws_s3_upload_review_fn`
+     */
+    aws_s3_meta_request_upload_review_fn *upload_review_callback;


If we going to support single part for the same functionality, how can we cancel the put?

For a single part, there will only be request, and if we invoke the callback as we read through the body, which has already been sent to the server and request has basically finished. We will need to delete the object. Will that be an overkill?

Just bring it up as the naming is upload_review. If we only gonna support MPU for this, we should name it as mpu

my think was: to make it work with singlepart, we'd need to very carefully fire this callback as part of the input_stream_read(), so we could prevent the final bytes from reaching the HTTP connection

anything's possible with computers ... but yeah it would be pretty complex is why I'm not tackling it in this 1st pass

Or just do checksum calculation when we initially stream data from the user, which would be way simpler, but we'd need benchmarks to prove that it doesn't degrade performance

codecov-commenter · 2023-07-07T05:06:53Z

Codecov Report

Merging #327 (316886e) into main (7c34328) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head 316886e differs from pull request most recent head 548a10d. Consider uploading reports for the commit 548a10d to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #327      +/-   ##
==========================================
+ Coverage   88.87%   88.91%   +0.03%     
==========================================
  Files          17       17              
  Lines        4943     4969      +26     
==========================================
+ Hits         4393     4418      +25     
- Misses        550      551       +1

Impacted Files	Coverage Δ
source/s3_auto_ranged_put.c	`92.44% <100.00%> (+0.28%)`	⬆️
source/s3_meta_request.c	`93.34% <100.00%> (+<0.01%)`	⬆️

... and 1 file with indirect coverage changes

remove TODOs about naming

…eview callback

source/s3_auto_ranged_put.c

Also ASSERT if an error wasn't properly raised, and add a debug run to CI to try and catch any mistakes

graebm added 2 commits July 6, 2023 11:21

Allow user to review checksums before multipart upload completes

2347143

Merge branch 'main' into checksum-review

1c75f9c

TingDaoK reviewed Jul 6, 2023

View reviewed changes

some more boilerplate to get the test working

3ce9dc6

TingDaoK approved these changes Jul 7, 2023

View reviewed changes

TingDaoK and others added 3 commits July 7, 2023 11:35

enable assert lock held and fix the compile (#329)

efa1b12

rename "review_info" to simply "review"

80ddf2d

remove TODOs about naming

All pause_resume tests now double-check the checksums in the upload_r…

9809595

…eview callback

graebm force-pushed the checksum-review branch from 72be0ad to 9809595 Compare July 7, 2023 21:07

graebm added 4 commits July 7, 2023 14:09

remove todo

efac521

Merge branch 'main' into checksum-review

323dccb

fix MSVC warning

316886e

more MSVC warnings, but this time a good one

548a10d

waahm7 approved these changes Jul 7, 2023

View reviewed changes

source/s3_auto_ranged_put.c Show resolved Hide resolved

graebm added 2 commits July 7, 2023 16:23

Better documentation on how to raise errors.

9255d5e

Also ASSERT if an error wasn't properly raised, and add a debug run to CI to try and catch any mistakes

code that compiles > code that doesn't compile

c107432

graebm merged commit bc9c8b2 into main Jul 8, 2023

graebm deleted the checksum-review branch July 8, 2023 03:48

passaro mentioned this pull request Jul 10, 2023

Update CRT submodules to latest releases awslabs/mountpoint-s3#365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to review checksums before multipart upload completes #327

Allow user to review checksums before multipart upload completes #327

graebm commented Jul 6, 2023 •

edited

Loading

TingDaoK Jul 6, 2023

graebm Jul 6, 2023 •

edited

Loading

graebm Jul 6, 2023

codecov-commenter commented Jul 7, 2023 •

edited

Loading

Allow user to review checksums before multipart upload completes #327

Allow user to review checksums before multipart upload completes #327

Conversation

graebm commented Jul 6, 2023 • edited Loading

TingDaoK Jul 6, 2023

Choose a reason for hiding this comment

graebm Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

graebm Jul 6, 2023

Choose a reason for hiding this comment

codecov-commenter commented Jul 7, 2023 • edited Loading

Codecov Report

graebm commented Jul 6, 2023 •

edited

Loading

graebm Jul 6, 2023 •

edited

Loading

codecov-commenter commented Jul 7, 2023 •

edited

Loading