-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to review checksums before multipart upload completes #327
Conversation
* WARNING: experimental/unstable | ||
* See `aws_s3_upload_review_fn` | ||
*/ | ||
aws_s3_meta_request_upload_review_fn *upload_review_callback; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we going to support single part for the same functionality, how can we cancel the put?
For a single part, there will only be request, and if we invoke the callback as we read through the body, which has already been sent to the server and request has basically finished. We will need to delete the object. Will that be an overkill?
Just bring it up as the naming is upload_review
. If we only gonna support MPU for this, we should name it as mpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my think was: to make it work with singlepart, we'd need to very carefully fire this callback as part of the input_stream_read(), so we could prevent the final bytes from reaching the HTTP connection
anything's possible with computers ... but yeah it would be pretty complex is why I'm not tackling it in this 1st pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just do checksum calculation when we initially stream data from the user, which would be way simpler, but we'd need benchmarks to prove that it doesn't degrade performance
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #327 +/- ##
==========================================
+ Coverage 88.87% 88.91% +0.03%
==========================================
Files 17 17
Lines 4943 4969 +26
==========================================
+ Hits 4393 4418 +25
- Misses 550 551 +1
|
remove TODOs about naming
Also ASSERT if an error wasn't properly raised, and add a debug run to CI to try and catch any mistakes
Issue:
It is very hard for the user to do end-to-end checksum verification for multipart uploads.
A user might buffer data in memory before passing it into the S3 client. The S3 client will compute checksums on whatever data is passed in, but what if this buffered data suffers a bit flip before hand? The S3 client would simply compute the checksum of the corrupted data.
It's easy if the object is uploaded in 1 request via PutObject. The user can simply compare their own checksum with the appropriate response header (e.g. x-amz-checksum-crc32).
But it's hard for multipart upload, where the object's final checksum is calculated differently. The algorithm is described here. The algorithm is basically: checksum of all parts' checksums concatenated together, then add "-14" (if there were 14 parts).
Currently, to verify multipart upload a user would need to reproduce the part-splitting logic of aws-c-s3 (which might change in the future), and reproduce S3's final checksum algorithm (I'm not sure if this could ever change?).
Description of Changes:
Add a
review_upload_callback
allowing users to "review" the part boundaries and part checksums calculated by aws-c-s3, and cancel the multipart upload if they don't agree.Design Considerations:
This design isn't perfect, so it's marked "experimental/unstable" for now.
If you're using CRC32, there's a trick where you can combine part-part checksums to get the whole-object checksum. The customer asking for this feature is taking this approach, so we're providing this to unblock them.
But other algorithms (e.g. SHA-256) don't allow this trick. The user would need to re-review all their data, calculating a checksum per part based on the part-boundaries provide by this callback, plus a whole-object checksum to compare against their original. Or the user would need to reproduce aws-c-s3's part-splitting logic as they stream data the first time, them from this callback they could confirm that their part-splitting logic matches ours. But if aws-c-s3's logic ever changed, they'd need to re-write their code to adjust.
My first instinct was to let the user provide the per-part checksums, via a callback that occurred after streaming data from each part. But I realized that, to truly guarantee end-to-end integrity, the user would still need to anticipate the part boundaries and reproduce aws-c-s3's part splitting logic. Or the user would need to keep each part's data buffered until this checksum callback occurred and re-scan it then. Also, this part-part callback based on recent reads would be hard to fit in a possible future where we read data for multiple parts in parallel.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.