Expose MD5 hash of objects in AWS S3 via APIs #89

augustoproiete · 2021-06-03T19:27:50Z

Currently the only ways to know the MD5 hash of an object in S3 are:

Calculate it yourself upfront on the client-side before uploading, and store the hash as a metadata property (so that you can retrieve it later via the API)
Download the object from S3 into a storage that you can run code to calculate the MD5 hash and discard the downloaded object

It's documented that the ETag is present in response headers and may or may not be an MD5 digest of the object data depending on how it was created and how it's encrypted. In the cases where ETag is not an MD5 digest, it is generated server-side and cannot be calculated on the client-side for comparison.

Describe the Feature

AWS S3 should automatically calculate the MD5 digest of objects and expose it as a property via the API, similar to how ContentLength is exposed today.

If a user wants to know the size of an object, they can easily call GetObjectMetadataAsync and inspect the ContentLength. Knowing the MD5 hash of an object should be just as easy as it's something that can be easily calculated by Amazon S3.

For instance, when calling ListObjectsV2 or GetObjectMetadataAsync, the caller should be able to retrieve the MD5 hash of the object(s) stored in S3.

For multipart uploads, the MD5 hash should be calculated after all parts of the object are uploaded, and after Amazon S3 assembles these parts and creates the final object. The MD5 hash should be a digest of the final object.

The property should always contain the MD5 hash of the contents of the object uploaded before any encryption occurs server-side.

Is your Feature Request related to a problem?

No

Proposed Solution

Expose a property Server-Content-MD5 that contains the MD5 digest of the full contents of the object after it has been uploaded to an AWS S3 bucket

To address security concerns I propose:

Generating the MD5 hash should be opt-in in each bucket and controlled via bucket policies
Accessing the Server-Content-MD5 property of objects that are encrypted server-side should require a special permission assigned via AIM

Describe alternatives you've considered

Calculating the MD5 in the client side and storing as a property during upload
Downloading the object and calculating the MD5 by myself
Using ETag when objects are not encrypted

Additional Context

The main goal is to obtain the MD5 hash of an object in an S3 bucket without having to download the object out of S3. This is helpful when trying to determine, for example, if an object stored in an on-prem environment has the same contents as an object stored in S3.

Environment

N/A

This is a 🚀 Feature Request

The text was updated successfully, but these errors were encountered:

ashishdhingra · 2021-06-03T20:55:59Z

P48205542

sqnfsa · 2021-11-01T03:41:54Z

We have the exact same use case at our organisation. We are dealing with files that are upto 100GB and need to be integrity-checked after download.

We have looked at several custom etag implementations to verify integrity of the file post-download but they rely on guessing the nature of upload (multipart, PUT object, etc) and whether encryption was used during upload.

Object lambda maybe an option i.e. computing MD5 on-the-fly and then storing the value in object's custom metadata but that would require long-running lambda functions churning through heaps of data and potentially becoming cost-ineffective for us.

@augustoproiete's suggestion sounds alot cleaner and cost effective.

Cheers!

vasaf · 2022-01-02T07:09:47Z

This is needed

brunomsantiago · 2022-03-04T13:36:47Z

I haven't tested, but there is new checksum options on AWS

https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/
https://www.youtube.com/watch?v=Xt6Lv4LrBQE

EDIT: However according to this the issue persists (it works different for multi-part uploads and will change if you copy after)

jasonivers · 2022-07-20T23:40:45Z

This is a year old now. Is there any information on the status? It's still 'Open', so I'm assuming it hasn't been completed, but is there an intention to implement this feature request, and if so, is there an estimate on how long until that happens?

scotthulluk · 2022-08-10T09:46:03Z

+1

janosgats · 2022-08-16T13:55:21Z

+1

alexrzem · 2022-10-05T01:52:21Z

+1

juanmarti81 · 2022-11-04T18:51:19Z

+1

In our case, the files are created by Media Convert and the etag is not an MD5. We have lots of files, all bigger than 20gb what makes impossible to download and generate it for each file.

KrzysztofPilarski · 2022-11-08T11:22:52Z

+1

LKirk-LandG · 2022-11-09T09:44:17Z

+1

mdavis-xyz · 2022-11-22T08:15:53Z

Because of the distributed nature of S3, and the way MD5 can't be parallelised into chunks, I suspect this feature won't be implemented as described (for multipart uploads)

You could achieve the same outcome of verifying integrity if you just hash each corresponding part in the local file.

The real problem is that after a multi-part upload is completed it doesn't appear possible to figure out the size of each part. You can get the parts count from a head or getObjectAttributes call. But you can't find the size of each part.

You can try to guess the part size, to match common defaults. (8MB, 5MB, 10MB, 1GB?). But there's no way to know if you guessed the part size wrong and the content still matches what you expect.

Note that parts don't have to be a power of 2 in size, and even the non-last ones could differ in size.

The getObjectAttributes API docs say they'll return a list of part sizes and part hashes. But when I tried, those fields were missing. I suspect they are omitted for completed multipart uploads, and present from in-progress multi part uploads.

So the most feasible solution is for S3 to return the part size and hash info in the getObjectAttributes API, as per the existing documentation, for completed multipart uploads.

brunomsantiago · 2022-11-30T01:00:58Z

I don't know hashing in depth, but @mdavis-xyz post made me curious about calculating hashes in parallel and I landed in interesting post on stack overflow, which I quote:

don't use MD5 it is no longer considered to be secure Cryptographic hash function, rather prefer at least SHA256 or use the Blake2 which has speed records. There are also parallel hashes like ParallelHash of SHA3 and the Blake3. This can fasten the calculations whenever the parallelization provides benefits

The main point is there are other options to md5

If there are other hash function that will make it possible/easier/cheaper to implement a hash as described on the original post by @augustoproiete I would be happy with it. Probably most people requesting md5 would be too.

mdavis-xyz · 2022-11-30T22:22:45Z

MD5 Is good enough for some use cases. If you're trying to get a fingerprint of two files to compare them, and you're assuming the only corruption would be some errors over the wire or hard drive error, or maybe two lines in a text file have been swapped by a legit author (I.e. nothing malicious) then its suitable. In fact it's better than most others because it's faster.

The weakness in MD5 is if you're checking whether a file has been modified by a malicious and powerful adversary. E.g. when torrenting a Linux ISO from untrustworthy peers.

Note that earlier this year Amazon announced that you can choose to use sha256 and other hashes, per object, if you want.

You still end up with the same hash-of-hashes for multi-part uploads.

ashishdhingra · 2023-02-15T23:15:08Z

@augustoproiete S3 has moved away from supporting MD5. Our SDKs now support flexible checksums. This S3 blog shares the various ways to validate and retrieve checksums. Closing this issue in favor of flexible checksums.

github-actions · 2023-02-15T23:15:28Z

This issue is now closed.

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

ashishdhingra transferred this issue from aws/aws-sdk-net Jun 3, 2021

ashishdhingra added feature-request New feature or request s3 service-api This issue pertains to the AWS API labels Jun 3, 2021

ashishdhingra self-assigned this Jun 3, 2021

stobrien89 removed the service-api This issue pertains to the AWS API label Jul 26, 2021

ashishdhingra mentioned this issue Oct 4, 2021

AmazonS3Client getObjectMetadata always returns null value for contentMD5 aws/aws-sdk-java#2645

Closed

tim-finnigan mentioned this issue Jan 5, 2022

S3 sync and cp commands should have a flag to show the local (and remote) file hashes aws/aws-cli#6631

Closed

ashishdhingra closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose MD5 hash of objects in AWS S3 via APIs #89

Expose MD5 hash of objects in AWS S3 via APIs #89

augustoproiete commented Jun 3, 2021 •

edited

Loading

ashishdhingra commented Jun 3, 2021

sqnfsa commented Nov 1, 2021

vasaf commented Jan 2, 2022

brunomsantiago commented Mar 4, 2022 •

edited

Loading

jasonivers commented Jul 20, 2022

scotthulluk commented Aug 10, 2022

janosgats commented Aug 16, 2022

alexrzem commented Oct 5, 2022

juanmarti81 commented Nov 4, 2022 •

edited

Loading

KrzysztofPilarski commented Nov 8, 2022

LKirk-LandG commented Nov 9, 2022

mdavis-xyz commented Nov 22, 2022

brunomsantiago commented Nov 30, 2022 •

edited

Loading

mdavis-xyz commented Nov 30, 2022

ashishdhingra commented Feb 15, 2023

github-actions bot commented Feb 15, 2023

Expose MD5 hash of objects in AWS S3 via APIs #89

Expose MD5 hash of objects in AWS S3 via APIs #89

Comments

augustoproiete commented Jun 3, 2021 • edited Loading

Describe the Feature

Is your Feature Request related to a problem?

Proposed Solution

Describe alternatives you've considered

Additional Context

Environment

ashishdhingra commented Jun 3, 2021

sqnfsa commented Nov 1, 2021

vasaf commented Jan 2, 2022

brunomsantiago commented Mar 4, 2022 • edited Loading

jasonivers commented Jul 20, 2022

scotthulluk commented Aug 10, 2022

janosgats commented Aug 16, 2022

alexrzem commented Oct 5, 2022

juanmarti81 commented Nov 4, 2022 • edited Loading

KrzysztofPilarski commented Nov 8, 2022

LKirk-LandG commented Nov 9, 2022

mdavis-xyz commented Nov 22, 2022

brunomsantiago commented Nov 30, 2022 • edited Loading

mdavis-xyz commented Nov 30, 2022

ashishdhingra commented Feb 15, 2023

github-actions bot commented Feb 15, 2023

augustoproiete commented Jun 3, 2021 •

edited

Loading

brunomsantiago commented Mar 4, 2022 •

edited

Loading

juanmarti81 commented Nov 4, 2022 •

edited

Loading

brunomsantiago commented Nov 30, 2022 •

edited

Loading