Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For cloud storage client library, consider adding integrity checking functionality to compute checksums for downloads and uploads #660

Closed
houglum opened this issue Jun 5, 2018 · 1 comment
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@houglum
Copy link
Contributor

houglum commented Jun 5, 2018

Context: Cloud Storage already offers the ability to supply a known checksum when uploading an object [1]:

Cloud Storage supports server-side validation for uploads, but client-side validation is also a common approach. If your application has already computed the object’s MD5 or CRC32c prior to starting the upload, you can supply it with the upload request, and Cloud Storage will only create the object if the hash you provided matches the value we calculated.

But for downloads and uploads where the MD5 or CRC32C checksum(s) is not known beforehand, clients have to take responsibility for integrity checking. W.r.t downloads, this is mentioned in the same docs page:

A download integrity check can be performed by hashing downloaded data on the fly and comparing your results to server-supplied hashes. You should discard downloaded data with incorrect hash values, and [...].

The same strategy could be used for uploads where the checksum(s) isn't known beforehand.

Suggestion: From a reliability standpoint, it makes sense that a storage client library should take care of integrity checks for all local-to-cloud uploads and cloud-to-local downloads. We ought to be doing this to protect users from data corruption (in the event that a user doesn't provide us with the file's checksum up-front). While other client libraries didn't have this functionality at their inception, it's since been added, e.g. in the Python storage library ([2], [3]) and the dotnet storage library ([4], [5]). As mentioned in the Python library issue [2], the MD5 sums were used there because a fast CRC32C implementation (the C extension of the crcmod module) is not guaranteed to be available on all Python installations; since this is a C++ library and we can ensure we have a fast implementation of it, we should use CRC32C (guaranteed to be available for all objects, including composite objects), rather than MD5 checksumming.

  • Note: The Python library actually deletes a corrupted download file, while I think the C#/dotnet library just throws an error without deleting it. Deleting the file seems more appealing and doesn't seem hard to tack on to this functionality, but removing the corrupted file means we lose some ability to debug the corruption (e.g. what's the nature of the corruption? Did we just miss a byte at the end? etc.). We should think about the pros and cons of both approaches if we decide to add this integrity checking functionality.

[1] https://cloud.google.com/storage/docs/hashes-etags
[2] googleapis/google-resumable-media-python#22
[3] googleapis/google-cloud-python#4133
[4] googleapis/google-cloud-dotnet#395
[5] jskeet/google-cloud-dotnet@aaa2aa0

@sduskis sduskis added api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jun 6, 2018
@coryan
Copy link
Contributor

coryan commented Nov 26, 2018

I think this is now implemented, all uploads and downloads can include the CRC32C checksum and/or MD5 checksums. For uploads, if the integrity is not known in advance (or cannot be computed before the upload), we raise an exception when the integrity check does not match whatever is computed by GCS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

3 participants