-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 sync and cp commands should have a flag to show the local (and remote) file hashes #6631
Comments
Hi @ITmaze, thanks for reaching out. Have you looked into the S3 documentation on using the Content-MD5 header? This premium support article gives a good high-level summary of using the Content-MD5 header to verify the integrity of an object uploaded to S3: https://aws.amazon.com/premiumsupport/knowledge-center/data-integrity-s3/ And this CLI documentation goes into further detail: https://docs.aws.amazon.com/cli/latest/topic/s3-faq.html#cli-aws-help-s3-faq There was also some discussion on this topic here in another issue: #2585 |
Hi @tim-finnigan, thank you. I have seen those pages, but even using the information contained within them gives me spurious results, to the point of raising a case with AWS support, who also point me at the same documents and essentially tell me to RTFM. I created this feature-request when it occurred to me that all of this edge-case detection as well as multi-step hash construction was unnecessary since the CLI already does the correct hash calculation for each case - since that's how it verifies that the upload was complete. I'm just asking for having a way to surface both sides of that process, both the source and the target files, so I can check if the source and target files are the same, without needing to upload another 2TB of data. |
Hi @ITmaze thanks for following up. I understand your point about wanting to ensure that your upload was successful. But the documentation mentioned earlier notes that the CLI will retry validating uploads up to 5 times and then exit if unsuccessful. And in regard to the request to provide hash data this was addressed in this comment from #2585:
But there is an older open feature request that mentions these topics: #599 I’m going to close this because of the overlap with #599 but please leave a comment there if you want to mention anything else regarding this request. You could also consider posting in the new re:Post forums to get more input from the S3 community. |
|
Hi @tim-finnigan we seem to have misunderstood each other. I'm not talking about adding any hashes anywhere. The ETag already contains the information we're looking for. It's visible in raw XML when you use the debug flag for an What I'm asking for is to DISPLAY the hashes for both source and target, since they already exist within the code and are actively used to verify that the upload was completed. What I'm trying to determine is after the upload has completed, is the file system the same as the s3 bucket. You assert that it retries to validate 5 times and then exits. I'm trying to determine if the files I'm looking at locally are the same as those that are stored remotely, using the calculation that's already built into the cli. We're literally talking about adding a flag and two printf statements. |
Hi @ITmaze, thanks for clarifying that and sorry if I misunderstood. I’m saying that based on the documentation you can assume successfully uploaded files should match your local files. Have you looked into using s3api to get the ETag? Here is an example: https://docs.aws.amazon.com/cli/latest/reference/s3api/head-object.html#examples |
Hi @tim-finnigan, at the time of upload, sure, perhaps. What about an hour later? How do I check the hash of the local file against that of the uploaded one, without writing a whole process that does the exact same thing as the CLI does? Not only that, if the CLI behavior changes, where the default size of a part changes for example, any code I write has to accommodate that. Not only that, from a resource perspective, I've now wasted a week on this matter. You've spent time on it, the AWS support engineers have spent time on it, between them we've collectively spent several thousand dollars on a problem that recurs for anyone doing more than casual uploading of objects to S3. Sorry to be blunt, but given the numerous posts on this matter, going back YEARS, this feature-request is in my professional opinion a no-brainer, and I say that with 40 years of software development experience. I'm not sure what the push back is being driven by, but it doesn't make any sense to me in any way. |
Hi @ITmaze, sorry to hear your frustration. We can discuss this more to try and get on the same page. I want to highlight this ETag documentation: https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html, specifically:
So the ETag can’t be considered a reliable way to verify the integrity of uploads. But that’s what the Content-MD5 header is for. I think what you’re asking for may be closer aligned to this open feature request: aws/aws-sdk#89 |
Hi @tim-finnigan, that's requesting the exact same thing, but in the API. I'm pointing out that all this has been done INSIDE the CLI ALREADY! All that has to happen is to print it out. |
Hi @ITmaze, just wanted to help clarify a few points. Multipart uploads are generally used for s3 sync and cp. The default chunk size is 8MB and minimum is 5MB. (source) The AWS CLI will calculate and auto-populate the Content-MD5 header for both standard and multipart uploads. (Standard uploads using the PutObject API and multipart using UploadPart API). And that is what is recommended in the API documentation:
But the overall validation happens server side using a calculation involving the combined hashes. The CLI does not verify the whole, assembled file. Generally speaking the CLI isn't doing anything special here, just what S3 provides. (For more information on the multipart upload process please refer to this documentation.) And another thing worth highlighting from the ETag description mentioned before is:
|
Is your feature request related to a problem? Please describe.
When you use
aws s3 sync
to copy a local directory to S3, the cli calculates each object hash locally before sending it together with the object to S3 - either as a single object or a multipart. After the upload has succeeded, the hash is stored as an ETag on the object. You can retrieve the ETag from the object by adding the--debug
flag and manually extracting it from the XML, but you cannot get the CLI to output the hash for the local file.Describe the solution you'd like
Ultimately it would be extremely helpful if you could compare the hash of a local file with that of the remote object using the same method as used by the aws cli itself. If the two don't match, you could then remove the object from S3 and try again.
Describe alternatives you've considered
Right now all you can do is attempt to calculate the hash locally. There are a few scripts that purport to calculate the value correctly, for example under OSX - with a linux version below it - at: https://gist.github.com/emersonf/7413337 - which appears to work for some files, but not for others. It's unclear if this is due to a failed upload, or a failed hash calculation. The hashes that are different are for some, but not all, files that are 1.6M and smaller.
Additional context
I've uploaded 2TB of data in files as large as 9.5 GB which froze several times over the three days that it took. Restarting the process multiple times eventually finished the process, but I'm left wondering if the upload is actually complete and correct.
The text was updated successfully, but these errors were encountered: