-
Notifications
You must be signed in to change notification settings - Fork 15
Expose MD5 hash of objects in AWS S3 via APIs #89
Comments
P48205542 |
We have the exact same use case at our organisation. We are dealing with files that are upto 100GB and need to be integrity-checked after download. We have looked at several custom etag implementations to verify integrity of the file post-download but they rely on guessing the nature of upload (multipart, PUT object, etc) and whether encryption was used during upload. Object lambda maybe an option i.e. computing MD5 on-the-fly and then storing the value in object's custom metadata but that would require long-running lambda functions churning through heaps of data and potentially becoming cost-ineffective for us. @augustoproiete's suggestion sounds alot cleaner and cost effective. Cheers! |
This is needed |
I haven't tested, but there is new checksum options on AWS https://aws.amazon.com/blogs/aws/new-additional-checksum-algorithms-for-amazon-s3/ EDIT: However according to this the issue persists (it works different for multi-part uploads and will change if you copy after) |
This is a year old now. Is there any information on the status? It's still 'Open', so I'm assuming it hasn't been completed, but is there an intention to implement this feature request, and if so, is there an estimate on how long until that happens? |
+1 |
2 similar comments
+1 |
+1 |
+1 In our case, the files are created by Media Convert and the etag is not an MD5. We have lots of files, all bigger than 20gb what makes impossible to download and generate it for each file. |
+1 |
1 similar comment
+1 |
Because of the distributed nature of S3, and the way MD5 can't be parallelised into chunks, I suspect this feature won't be implemented as described (for multipart uploads) You could achieve the same outcome of verifying integrity if you just hash each corresponding part in the local file. The real problem is that after a multi-part upload is completed it doesn't appear possible to figure out the size of each part. You can get the parts count from a head or getObjectAttributes call. But you can't find the size of each part. You can try to guess the part size, to match common defaults. (8MB, 5MB, 10MB, 1GB?). But there's no way to know if you guessed the part size wrong and the content still matches what you expect. Note that parts don't have to be a power of 2 in size, and even the non-last ones could differ in size. The getObjectAttributes API docs say they'll return a list of part sizes and part hashes. But when I tried, those fields were missing. I suspect they are omitted for completed multipart uploads, and present from in-progress multi part uploads. So the most feasible solution is for S3 to return the part size and hash info in the getObjectAttributes API, as per the existing documentation, for completed multipart uploads. |
I don't know hashing in depth, but @mdavis-xyz post made me curious about calculating hashes in parallel and I landed in interesting post on stack overflow, which I quote:
The main point is there are other options to md5 If there are other hash function that will make it possible/easier/cheaper to implement a hash as described on the original post by @augustoproiete I would be happy with it. Probably most people requesting md5 would be too. |
MD5 Is good enough for some use cases. If you're trying to get a fingerprint of two files to compare them, and you're assuming the only corruption would be some errors over the wire or hard drive error, or maybe two lines in a text file have been swapped by a legit author (I.e. nothing malicious) then its suitable. In fact it's better than most others because it's faster. The weakness in MD5 is if you're checking whether a file has been modified by a malicious and powerful adversary. E.g. when torrenting a Linux ISO from untrustworthy peers. Note that earlier this year Amazon announced that you can choose to use sha256 and other hashes, per object, if you want. You still end up with the same hash-of-hashes for multi-part uploads. |
@augustoproiete S3 has moved away from supporting MD5. Our SDKs now support flexible checksums. This S3 blog shares the various ways to validate and retrieve checksums. Closing this issue in favor of flexible checksums. |
This issue is now closed. Comments on closed issues are hard for our team to see. |
Currently the only ways to know the MD5 hash of an object in S3 are:
It's documented that the
ETag
is present in response headers and may or may not be an MD5 digest of the object data depending on how it was created and how it's encrypted. In the cases whereETag
is not an MD5 digest, it is generated server-side and cannot be calculated on the client-side for comparison.Describe the Feature
AWS S3 should automatically calculate the MD5 digest of objects and expose it as a property via the API, similar to how
ContentLength
is exposed today.If a user wants to know the size of an object, they can easily call
GetObjectMetadataAsync
and inspect theContentLength
. Knowing the MD5 hash of an object should be just as easy as it's something that can be easily calculated by Amazon S3.For instance, when calling
ListObjectsV2
orGetObjectMetadataAsync
, the caller should be able to retrieve the MD5 hash of the object(s) stored in S3.For multipart uploads, the MD5 hash should be calculated after all parts of the object are uploaded, and after Amazon S3 assembles these parts and creates the final object. The MD5 hash should be a digest of the final object.
The property should always contain the MD5 hash of the contents of the object uploaded before any encryption occurs server-side.
Is your Feature Request related to a problem?
No
Proposed Solution
Expose a property
Server-Content-MD5
that contains the MD5 digest of the full contents of the object after it has been uploaded to an AWS S3 bucketTo address security concerns I propose:
Server-Content-MD5
property of objects that are encrypted server-side should require a special permission assigned via AIMDescribe alternatives you've considered
ETag
when objects are not encryptedAdditional Context
The main goal is to obtain the MD5 hash of an object in an S3 bucket without having to download the object out of S3. This is helpful when trying to determine, for example, if an object stored in an on-prem environment has the same contents as an object stored in S3.
Environment
N/A
This is a 🚀 Feature Request
The text was updated successfully, but these errors were encountered: