Skip to content
This repository has been archived by the owner on Nov 5, 2024. It is now read-only.

Expose MD5 hash of objects in AWS S3 via APIs #89

Closed
augustoproiete opened this issue Jun 3, 2021 · 16 comments
Closed

Expose MD5 hash of objects in AWS S3 via APIs #89

augustoproiete opened this issue Jun 3, 2021 · 16 comments
Assignees
Labels
feature-request New feature or request s3

Comments

@augustoproiete
Copy link

augustoproiete commented Jun 3, 2021

Currently the only ways to know the MD5 hash of an object in S3 are:

  • Calculate it yourself upfront on the client-side before uploading, and store the hash as a metadata property (so that you can retrieve it later via the API)
  • Download the object from S3 into a storage that you can run code to calculate the MD5 hash and discard the downloaded object

It's documented that the ETag is present in response headers and may or may not be an MD5 digest of the object data depending on how it was created and how it's encrypted. In the cases where ETag is not an MD5 digest, it is generated server-side and cannot be calculated on the client-side for comparison.

Describe the Feature

AWS S3 should automatically calculate the MD5 digest of objects and expose it as a property via the API, similar to how ContentLength is exposed today.

If a user wants to know the size of an object, they can easily call GetObjectMetadataAsync and inspect the ContentLength. Knowing the MD5 hash of an object should be just as easy as it's something that can be easily calculated by Amazon S3.

For instance, when calling ListObjectsV2 or GetObjectMetadataAsync, the caller should be able to retrieve the MD5 hash of the object(s) stored in S3.

For multipart uploads, the MD5 hash should be calculated after all parts of the object are uploaded, and after Amazon S3 assembles these parts and creates the final object. The MD5 hash should be a digest of the final object.

The property should always contain the MD5 hash of the contents of the object uploaded before any encryption occurs server-side.

Is your Feature Request related to a problem?

No

Proposed Solution

Expose a property Server-Content-MD5 that contains the MD5 digest of the full contents of the object after it has been uploaded to an AWS S3 bucket

To address security concerns I propose:

  • Generating the MD5 hash should be opt-in in each bucket and controlled via bucket policies
  • Accessing the Server-Content-MD5 property of objects that are encrypted server-side should require a special permission assigned via AIM

Describe alternatives you've considered

  • Calculating the MD5 in the client side and storing as a property during upload
  • Downloading the object and calculating the MD5 by myself
  • Using ETag when objects are not encrypted

Additional Context

The main goal is to obtain the MD5 hash of an object in an S3 bucket without having to download the object out of S3. This is helpful when trying to determine, for example, if an object stored in an on-prem environment has the same contents as an object stored in S3.

Environment

N/A


This is a 🚀 Feature Request

@ashishdhingra ashishdhingra transferred this issue from aws/aws-sdk-net Jun 3, 2021
@ashishdhingra ashishdhingra added feature-request New feature or request s3 service-api This issue pertains to the AWS API labels Jun 3, 2021
@ashishdhingra
Copy link

P48205542

@sqnfsa
Copy link

sqnfsa commented Nov 1, 2021

We have the exact same use case at our organisation. We are dealing with files that are upto 100GB and need to be integrity-checked after download.

We have looked at several custom etag implementations to verify integrity of the file post-download but they rely on guessing the nature of upload (multipart, PUT object, etc) and whether encryption was used during upload.

Object lambda maybe an option i.e. computing MD5 on-the-fly and then storing the value in object's custom metadata but that would require long-running lambda functions churning through heaps of data and potentially becoming cost-ineffective for us.

@augustoproiete's suggestion sounds alot cleaner and cost effective.

Cheers!

@vasaf
Copy link

vasaf commented Jan 2, 2022

This is needed

@brunomsantiago
Copy link

brunomsantiago commented Mar 4, 2022

I haven't tested, but there is new checksum options on AWS

https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/
https://www.youtube.com/watch?v=Xt6Lv4LrBQE

EDIT: However according to this the issue persists (it works different for multi-part uploads and will change if you copy after)

@jasonivers
Copy link

This is a year old now. Is there any information on the status? It's still 'Open', so I'm assuming it hasn't been completed, but is there an intention to implement this feature request, and if so, is there an estimate on how long until that happens?

@scotthulluk
Copy link

+1

2 similar comments
@janosgats
Copy link

+1

@alexrzem
Copy link

alexrzem commented Oct 5, 2022

+1

@juanmarti81
Copy link

juanmarti81 commented Nov 4, 2022

+1

In our case, the files are created by Media Convert and the etag is not an MD5. We have lots of files, all bigger than 20gb what makes impossible to download and generate it for each file.

@KrzysztofPilarski
Copy link

+1

1 similar comment
@LKirk-LandG
Copy link

+1

@mdavis-xyz
Copy link

Because of the distributed nature of S3, and the way MD5 can't be parallelised into chunks, I suspect this feature won't be implemented as described (for multipart uploads)

You could achieve the same outcome of verifying integrity if you just hash each corresponding part in the local file.

The real problem is that after a multi-part upload is completed it doesn't appear possible to figure out the size of each part. You can get the parts count from a head or getObjectAttributes call. But you can't find the size of each part.

You can try to guess the part size, to match common defaults. (8MB, 5MB, 10MB, 1GB?). But there's no way to know if you guessed the part size wrong and the content still matches what you expect.

Note that parts don't have to be a power of 2 in size, and even the non-last ones could differ in size.

The getObjectAttributes API docs say they'll return a list of part sizes and part hashes. But when I tried, those fields were missing. I suspect they are omitted for completed multipart uploads, and present from in-progress multi part uploads.

So the most feasible solution is for S3 to return the part size and hash info in the getObjectAttributes API, as per the existing documentation, for completed multipart uploads.

@brunomsantiago
Copy link

brunomsantiago commented Nov 30, 2022

I don't know hashing in depth, but @mdavis-xyz post made me curious about calculating hashes in parallel and I landed in interesting post on stack overflow, which I quote:

don't use MD5 it is no longer considered to be secure Cryptographic hash function, rather prefer at least SHA256 or use the Blake2 which has speed records. There are also parallel hashes like ParallelHash of SHA3 and the Blake3. This can fasten the calculations whenever the parallelization provides benefits

The main point is there are other options to md5

If there are other hash function that will make it possible/easier/cheaper to implement a hash as described on the original post by @augustoproiete I would be happy with it. Probably most people requesting md5 would be too.

@mdavis-xyz
Copy link

MD5 Is good enough for some use cases. If you're trying to get a fingerprint of two files to compare them, and you're assuming the only corruption would be some errors over the wire or hard drive error, or maybe two lines in a text file have been swapped by a legit author (I.e. nothing malicious) then its suitable. In fact it's better than most others because it's faster.

The weakness in MD5 is if you're checking whether a file has been modified by a malicious and powerful adversary. E.g. when torrenting a Linux ISO from untrustworthy peers.

Note that earlier this year Amazon announced that you can choose to use sha256 and other hashes, per object, if you want.

You still end up with the same hash-of-hashes for multi-part uploads.

@ashishdhingra
Copy link

@augustoproiete S3 has moved away from supporting MD5. Our SDKs now support flexible checksums. This S3 blog shares the various ways to validate and retrieve checksums. Closing this issue in favor of flexible checksums.

@ashishdhingra ashishdhingra closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2023
@github-actions
Copy link

This issue is now closed.

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature-request New feature or request s3
Projects
None yet
Development

No branches or pull requests