Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow user to review checksums before multipart upload completes #327

Merged
merged 12 commits into from
Jul 8, 2023

Conversation

graebm
Copy link
Contributor

@graebm graebm commented Jul 6, 2023

Issue:
It is very hard for the user to do end-to-end checksum verification for multipart uploads.

A user might buffer data in memory before passing it into the S3 client. The S3 client will compute checksums on whatever data is passed in, but what if this buffered data suffers a bit flip before hand? The S3 client would simply compute the checksum of the corrupted data.

It's easy if the object is uploaded in 1 request via PutObject. The user can simply compare their own checksum with the appropriate response header (e.g. x-amz-checksum-crc32).

But it's hard for multipart upload, where the object's final checksum is calculated differently. The algorithm is described here. The algorithm is basically: checksum of all parts' checksums concatenated together, then add "-14" (if there were 14 parts).

Currently, to verify multipart upload a user would need to reproduce the part-splitting logic of aws-c-s3 (which might change in the future), and reproduce S3's final checksum algorithm (I'm not sure if this could ever change?).

Description of Changes:
Add a review_upload_callback allowing users to "review" the part boundaries and part checksums calculated by aws-c-s3, and cancel the multipart upload if they don't agree.

Design Considerations:
This design isn't perfect, so it's marked "experimental/unstable" for now.

If you're using CRC32, there's a trick where you can combine part-part checksums to get the whole-object checksum. The customer asking for this feature is taking this approach, so we're providing this to unblock them.

But other algorithms (e.g. SHA-256) don't allow this trick. The user would need to re-review all their data, calculating a checksum per part based on the part-boundaries provide by this callback, plus a whole-object checksum to compare against their original. Or the user would need to reproduce aws-c-s3's part-splitting logic as they stream data the first time, them from this callback they could confirm that their part-splitting logic matches ours. But if aws-c-s3's logic ever changed, they'd need to re-write their code to adjust.

My first instinct was to let the user provide the per-part checksums, via a callback that occurred after streaming data from each part. But I realized that, to truly guarantee end-to-end integrity, the user would still need to anticipate the part boundaries and reproduce aws-c-s3's part splitting logic. Or the user would need to keep each part's data buffered until this checksum callback occurred and re-scan it then. Also, this part-part callback based on recent reads would be hard to fit in a possible future where we read data for multiple parts in parallel.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* WARNING: experimental/unstable
* See `aws_s3_upload_review_fn`
*/
aws_s3_meta_request_upload_review_fn *upload_review_callback;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we going to support single part for the same functionality, how can we cancel the put?

For a single part, there will only be request, and if we invoke the callback as we read through the body, which has already been sent to the server and request has basically finished. We will need to delete the object. Will that be an overkill?

Just bring it up as the naming is upload_review. If we only gonna support MPU for this, we should name it as mpu

Copy link
Contributor Author

@graebm graebm Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my think was: to make it work with singlepart, we'd need to very carefully fire this callback as part of the input_stream_read(), so we could prevent the final bytes from reaching the HTTP connection

anything's possible with computers ... but yeah it would be pretty complex is why I'm not tackling it in this 1st pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just do checksum calculation when we initially stream data from the user, which would be way simpler, but we'd need benchmarks to prove that it doesn't degrade performance

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2023

Codecov Report

Merging #327 (316886e) into main (7c34328) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head 316886e differs from pull request most recent head 548a10d. Consider uploading reports for the commit 548a10d to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #327      +/-   ##
==========================================
+ Coverage   88.87%   88.91%   +0.03%     
==========================================
  Files          17       17              
  Lines        4943     4969      +26     
==========================================
+ Hits         4393     4418      +25     
- Misses        550      551       +1     
Impacted Files Coverage Δ
source/s3_auto_ranged_put.c 92.44% <100.00%> (+0.28%) ⬆️
source/s3_meta_request.c 93.34% <100.00%> (+<0.01%) ⬆️

... and 1 file with indirect coverage changes

source/s3_auto_ranged_put.c Show resolved Hide resolved
Also ASSERT if an error wasn't properly raised, and add a debug run to CI to try and catch any mistakes
@graebm graebm merged commit bc9c8b2 into main Jul 8, 2023
@graebm graebm deleted the checksum-review branch July 8, 2023 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants