For cloud storage client library, consider adding integrity checking functionality to compute checksums for downloads and uploads #660
Labels
api: storage
Issues related to the Cloud Storage API.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
Context: Cloud Storage already offers the ability to supply a known checksum when uploading an object [1]:
But for downloads and uploads where the MD5 or CRC32C checksum(s) is not known beforehand, clients have to take responsibility for integrity checking. W.r.t downloads, this is mentioned in the same docs page:
The same strategy could be used for uploads where the checksum(s) isn't known beforehand.
Suggestion: From a reliability standpoint, it makes sense that a storage client library should take care of integrity checks for all local-to-cloud uploads and cloud-to-local downloads. We ought to be doing this to protect users from data corruption (in the event that a user doesn't provide us with the file's checksum up-front). While other client libraries didn't have this functionality at their inception, it's since been added, e.g. in the Python storage library ([2], [3]) and the dotnet storage library ([4], [5]). As mentioned in the Python library issue [2], the MD5 sums were used there because a fast CRC32C implementation (the C extension of the crcmod module) is not guaranteed to be available on all Python installations; since this is a C++ library and we can ensure we have a fast implementation of it, we should use CRC32C (guaranteed to be available for all objects, including composite objects), rather than MD5 checksumming.
[1] https://cloud.google.com/storage/docs/hashes-etags
[2] googleapis/google-resumable-media-python#22
[3] googleapis/google-cloud-python#4133
[4] googleapis/google-cloud-dotnet#395
[5] jskeet/google-cloud-dotnet@aaa2aa0
The text was updated successfully, but these errors were encountered: