Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsutil cp -Z always force adds Cache-control: no-transform and Content-Encoding: gzip. Breaks http protocol #480

Open
nojvek opened this issue Oct 19, 2017 · 36 comments

Comments

@nojvek
Copy link

nojvek commented Oct 19, 2017

Curl client should be able to receive the unzipped version, but GCS always returns content-encoding gzip. This breaks HTTP/1.1 protocol since the client didn't give "Accept-Encoding: gzip, deflate, br" header.

$ gsutil -m -h "Cache-Control: public,max-age=31536000" cp -Z foo.txt gs://somebucket/foo.txt
Copying file://foo.txt [Content-Type=text/plain]...
- [1/1 files][   42.0 B/   12.0 B] 100% Done
Operation completed over 1 objects/12.0 B.

$ curl -v somebucket.io/foo.txt
> GET /foo1.txt HTTP/1.1
> User-Agent: curl/7.37.0
> Host: somebucket.io
> Accept: */*
> 
< HTTP/1.1 200 OK
< X-GUploader-UploadID: ...
< Date: Thu, 19 Oct 2017 18:04:05 GMT
< Expires: Fri, 19 Oct 2018 18:04:05 GMT
< Last-Modified: Thu, 19 Oct 2017 18:03:47 GMT
< ETag: "c35fdf2f0c2dcadc46333b0709c87e64"
< x-goog-generation: 1508436227151587
< x-goog-metageneration: 1
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 42
< Content-Type: text/plain
< Content-Encoding: gzip
< x-goog-hash: crc32c=V/9tDw==
< x-goog-hash: md5=w1/fLwwtytxGMzsHCch+ZA==
< x-goog-storage-class: MULTI_REGIONAL
< Accept-Ranges: bytes
< Content-Length: 42
< Access-Control-Allow-Origin: *
* Server UploadServer is not blacklisted
< Server: UploadServer
< Age: 2681
< Cache-Control: public,max-age=31536000,no-transform
<                                                                                                                                                                                   ��Y�tmpG1oc6S�H���W(�/�I���9�

Seems to be happening here

gsutil/gslib/copy_helper.py

Lines 1741 to 1759 in e8154ba

if (gzip_exts == GZIP_ALL_FILES or
(gzip_exts and len(fname_parts) > 1 and fname_parts[-1] in gzip_exts)):
upload_url, upload_size = _CompressFileForUpload(
src_url, src_obj_filestream, src_obj_size, logger)
upload_stream = open(upload_url.object_name, 'rb')
dst_obj_metadata.contentEncoding = 'gzip'
# If we're sending an object with gzip encoding, it's possible it also
# has an incompressible content type. Google Cloud Storage will remove
# the top layer of compression when serving the object, which would cause
# the served content not to match the CRC32C/MD5 hashes stored and make
# integrity checking impossible. Therefore we set cache control to
# no-transform to ensure it is served in its original form. The caveat is
# that to read this object, other clients must then support
# accept-encoding:gzip.
if not dst_obj_metadata.cacheControl:
dst_obj_metadata.cacheControl = 'no-transform'
elif 'no-transform' not in dst_obj_metadata.cacheControl.lower():
dst_obj_metadata.cacheControl += ',no-transform'
zipped_file = True

@nojvek nojvek changed the title gsutil cp -Z always force adds Cache-control: no-transform. Breaks Http protocol gsutil cp -Z always force adds Cache-control: no-transform and Content-Encoding: gzip. Breaks Http protocol Oct 19, 2017
@nojvek nojvek changed the title gsutil cp -Z always force adds Cache-control: no-transform and Content-Encoding: gzip. Breaks Http protocol gsutil cp -Z always force adds Cache-control: no-transform and Content-Encoding: gzip. Breaks http protocol Oct 19, 2017
@nojvek
Copy link
Author

nojvek commented Oct 19, 2017

@houglum ^

@houglum
Copy link
Collaborator

houglum commented Oct 19, 2017

This behavior (ignoring the Accept-Encoding header) is documented here:
https://cloud.google.com/storage/docs/transcoding#decompressive_transcoding

If the Cache-Control metadata field for the object is set to no-transform, the object is served as a compressed object in all subsequent requests, regardless of any Accept-Encoding request headers.

...although it also seems like it would be helpful for us to mention this (along with the fact that we apply the no-transform cache-control directive) in the docs for the -z option.

@nojvek
Copy link
Author

nojvek commented Oct 19, 2017

If the request for the object includes an Accept-Encoding: gzip header, the object is served as-is in that specific request, along with a Content-Encoding: gzip response header.
If the Cache-Control metadata field for the object is set to no-transform, the object is served as a compressed object in all subsequent requests, regardless of any Accept-Encoding request headers.

Basically what I am saying is, there should be a way to turn off the no-transform that -Z/z adds. It's too aggressive and breaks clients that don't understand gzip. I understand the no-transform is used for integrity check, but the official gsutil client can always ask with Accept-Content: gzip and do an integrity check.

no-transform on -z seems like an implementation logic that is having a side effect. It essentially makes it an unwise choice to use it in a production environment because it breaks the HTTP protocol between server and client.

@houglum
Copy link
Collaborator

houglum commented Oct 23, 2017

+@thobrla for comment, as he added this in 439573e and likely has more context.

@thobrla
Copy link
Contributor

thobrla commented Oct 26, 2017

If we remove no-transform, it's possible that integrity checking will be impossible for doubly compressed objects, since GCS may remove a layer of compression prior to sending the object even when Accept-Encoding:gzip is provided. This would in turn cause the stored MD5 not to match an MD5 computed on the received bytes regardless of the headers provided by the client.

So if we add such an option to drop no-transform, we're back in the situation we were in before 439573e where certain files uploaded by gsutil cannot then be downloaded by gsutil, and this seems worse than not being downloadable by a different client.

To put it differently, I cannot see a way to author a fully compatible solution with GCS's current behavior.

As a workaround, you can remove cache-control: no-transform on such objects using the gsutil setmeta command. Would that work for your use case?

@nojvek
Copy link
Author

nojvek commented Oct 26, 2017 via email

@thobrla
Copy link
Contributor

thobrla commented Oct 26, 2017

Take a look at the second paragraph of Using gzip on compressed objects; if you upload gzipped and ask for gzipped and GCS considers the content type to be incompressible, it will remove the encoding regardless of your request. Then it will serve bytes that will not match the MD5 stored in the object's metadata. I think there is a core issue with the service in that GCS does not publish the content-types that it considers to be incompressible; as such that list is also subject to change.

I agree there are serious side-effects to using no-transform as an approach; we decided on this as a compromise in gsutil because most modern clients can accept gzip encoding. I think unless that issue in the GCS service is addressed, we won't be able to arrive at a clean solution.

@nojvek
Copy link
Author

nojvek commented Oct 31, 2017

So you're essentially saying GCS will tamper with my data when storing based on an undocumented process that even some of the google cloud team doesn't know about.

Do you know where I can file an issue for the root cause? This seems like bad design on so many levels. I would expect GCS to just be a dumb store of bytes and follow the content-encoding: gzip http spec.

@lotten
Copy link
Contributor

lotten commented Nov 1, 2017

Just to be clear, GCS will never touch the stored data, this is exclusively about the encoding when sending it over the wire.

@mikeheme
Copy link

mikeheme commented Nov 8, 2017

@thobrla For my use case, the behavior of always getting gzipped content when the object's Cache-Control is set to 'no-transform' works fine for optimized web serving.

However, I did some extra tinkering and removed "no-transform" from Cache-Control as you suggested, but that causes another issue, the expected behavior should be that the server respects the "Accept-Encoding" header, if 'gzip' is included it should return gzipped content (no decompressive transcoding, serve file as is stored), if no "Accept-Encoding: gzip” request header is included it should DO decompressive transcoding as documented (right?), in both cases it should respond with according "Content-Encoding" header. BUT it appears that GCS ignores the “Accept-Encoding” request header and always does descompressive transcoding.

If the request for the object includes an Accept-Encoding: gzip header, the object is served as-is in that specific request, along with a Content-Encoding: gzip response header.

For example, the following file has the following metadata in GCS:
https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg
Content-Type: image/jpeg
Content-Encoding: gzip
Cache-Control:

When requesting file with request header 'Accept-Encoding: gzip’, the server doesn’t respond with header "ContentEncoding: gzip" and the image is NOT compressed/gzipped, therefore, it forces decompressive transcoding incorrectly, notice the header ‘Warning: 214 UploadServer gunzipped’ which I suppose is how google informs clients that it actually did decompression.

With -H “Accept-Encoding: gzip” :

# removed unnecessary lines to save space
curl -v -H 'Accept-Encoding: gzip' "https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg" > should_be_compressed_image.jpg

> GET /cedar-league-184821.appspot.com/1/2017/11c/FullSizeRender-1.jpg HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.54.0
> Accept: */*
> Accept-encoding: gzip
>
< HTTP/1.1 200 OK
< X-GUploader-UploadID: AEnB2UpLqw1hkA5MngrLr70nY3nBRZTAmG_432r5LaRipy7nKN4vVzoWlCSoW2220v1tER_10RQ-jMFF7h3tndchkWwVVT46nA
< x-goog-generation: 1510118544142300
< x-goog-metageneration: 2
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 315737
< Content-Type: image/jpeg
< Content-Language: en
< x-goog-hash: crc32c=/tU2vQ==
< x-goog-hash: md5=BkUrq+p+go4s4q1dvK4O3w==
< x-goog-storage-class: STANDARD
< Warning: 214 UploadServer gunzipped
< Content-Length: 319480
< Server: UploadServer
< Cache-Control: public, max-age=3600
< Age: 1808

Without -H “Accept-Encoding: gzip” :

curl -v "https://storage.googleapis.com/cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg" > should_be_decompressed_image.jpg
> GET /cedar-league-184821.appspot.com/1/2017/11c/CT-jpg-CE-gzip-CC-empty.jpg HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< X-GUploader-UploadID: AEnB2Uq6Ztc3zslZqP8uoBlZPzIS1l92hGTfJjdwIhP3t5V1j2ll6sFRhj3vlqVndnlgcKZM82fjskm1tNWd5N9i1V1E-qsk6sswtoqxB8V0_PJ_lA11fB4
< x-goog-generation: 1510130246520360
< x-goog-metageneration: 1
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 315737
< Content-Type: image/jpeg
< Content-Language: en
< x-goog-hash: crc32c=/tU2vQ==
< x-goog-hash: md5=BkUrq+p+go4s4q1dvK4O3w==
< x-goog-storage-class: STANDARD
< Warning: 214 UploadServer gunzipped
< Content-Length: 319480
< Server: UploadServer
< Cache-Control: public, max-age=3600
< Age: 1056

@thobrla
Copy link
Contributor

thobrla commented Nov 9, 2017

Thanks for the detailed reproduction of the issue. I'm discussing with the GCS team internally and we may be able to fix the service from removing a layer of compression even when Accept-Encoding: gzip is present. If the fix works, it will remove the need for gsutil to add Cache-Control:no-transform.

I'll let you know when I have more details.

@nojvek
Copy link
Author

nojvek commented Nov 9, 2017 via email

@mikeheme
Copy link

mikeheme commented Nov 9, 2017

awesome @thobrla! thanks!

@mikeheme
Copy link

mikeheme commented Dec 8, 2017

@thobrla any updates on this?

@thobrla
Copy link
Contributor

thobrla commented Dec 8, 2017

Update: the work to stop the GCS service from unnecessarily removing a layer of compression is understood but larger-effort than GCS team originally thought. Part of that work is complete, but finishing the remainder isn't on the team's priorities for the near future.

Until that changes, we'll have to live with this behavior in clients. I think the Cache-Control behavior of gsutil is the best default (given that it can be dsiabled with setmeta if necessary).

Leaving this issue open to track fixing this if GCS service implements the fix.

@nojvek
Copy link
Author

nojvek commented Dec 8, 2017 via email

@sdkks
Copy link

sdkks commented Dec 14, 2017

When I remove header with 'setmeta' by using only -h "Cache-Control", on client side I'm seeing Google's CDN (which is using this backend bucket) is also sending the same header with null value. By default if header is not set, I used to see 'public, max-age=3600' . I'm guessing we need to track this with GSC team, too...

@dalbani
Copy link

dalbani commented Jan 6, 2018

Hi, I have discovered this bug report by the way of googleapis/google-cloud-python#4227 and googleapis/google-resumable-media-python#34.
I had this strange Checksum mismatch while downloading message when downloading GCS blobs using the official Python library.
(Although the issue is supposed to be fixed, it still doesn't work for me by the way.)
But regardless of the Python specific issue, I am curious to what you think of the following requests logs:

Retrieving a blob with the Python library:

> GET /download/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd?alt=media HTTP/1.1
> Host: www.googleapis.com
> Connection: keep-alive
> accept-encoding: gzip
> Accept: */*
> User-Agent: python-requests/2.18.4
> authorization: Bearer xxx
< HTTP/1.1 200 OK
< X-GUploader-UploadID: xxx
< Content-Type: image/jpeg
< Content-Disposition: attachment
< ETag: W/COmetKubk9gCEAE=
< Vary: Origin
< Vary: X-Origin
< X-Goog-Generation: 1513588173639529
< X-Goog-Hash: crc32c=fkoHfw==,md5=Lbe8pGpkq2fctqveModTlw==
< X-Goog-Metageneration: 1
< X-Goog-Storage-Class: REGIONAL
< Cache-Control: no-cache, no-store, max-age=0, must-revalidate
< Pragma: no-cache
< Expires: Mon, 01 Jan 1990 00:00:00 GMT
< Date: Sat, 06 Jan 2018 20:02:10 GMT
< Warning: 214 UploadServer gunzipped
< Content-Length: 368869
< Server: UploadServer
< Alt-Svc: hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"

See that Warning: 214 UploadServer gunzipped header in the response.
But the problem here is that the blob was specifically uploaded with Cache-Control: no-transform.
Here are the details of the blob:

{
  "kind": "storage#object", 
  "contentType": "image/jpeg", 
  "name": "binary/00da00d2ddc203a245753a8c1276c0d398341abd", 
  "timeCreated": "2017-12-18T09:09:33.635Z", 
  "generation": "1513588173639529", 
  "md5Hash": "Lbe8pGpkq2fctqveModTlw==", 
  "bucket": "xxx", 
  "updated": "2017-12-18T09:09:33.635Z", 
  "contentEncoding": "gzip", 
  "crc32c": "fkoHfw==", 
  "metageneration": "1", 
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd?generation=1513588173639529&alt=media", 
  "storageClass": "REGIONAL", 
  "timeStorageClassUpdated": "2017-12-18T09:09:33.635Z", 
  "cacheControl": "no-transform", 
  "etag": "COmetKubk9gCEAE=", 
  "id": "xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd/1513588173639529", 
  "selfLink": "https://www.googleapis.com/storage/v1/b/xxx/o/binary%2F00da00d2ddc203a245753a8c1276c0d398341abd", 
  "size": "368849"
}

And, sure enough, retrieving the blob using the public URL works as expected according the documentation:

$ curl -v -O https://storage.googleapis.com/xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd
> GET /xxx/binary/00da00d2ddc203a245753a8c1276c0d398341abd HTTP/1.1
> Host: storage.googleapis.com
> User-Agent: curl/7.47.0
> Accept: */*
< HTTP/1.1 200 OK
< X-GUploader-UploadID: xxx
< Date: Sat, 06 Jan 2018 20:03:39 GMT
< Cache-Control: no-transform
< Expires: Sun, 06 Jan 2019 20:03:39 GMT
< Last-Modified: Mon, 18 Dec 2017 09:09:33 GMT
< ETag: "2db7bca46a64ab67dcb6abde32875397"
< x-goog-generation: 1513588173639529
< x-goog-metageneration: 2
< x-goog-stored-content-encoding: gzip
< x-goog-stored-content-length: 368849
< Content-Type: image/jpeg
< Content-Encoding: gzip
< x-goog-hash: crc32c=fkoHfw==
< x-goog-hash: md5=Lbe8pGpkq2fctqveModTlw==
< x-goog-storage-class: REGIONAL
< Accept-Ranges: bytes
< Server: UploadServer
< Alt-Svc: hq=":443"; ma=2592000; quic=51303431; quic=51303339; quic=51303338; quic=51303337; quic=51303335,quic=":443"; ma=2592000; v="41,39,38,37,35"
< Transfer-Encoding: chunked

Out of curiousity, I tried playing with the Accept-Encoding header, but that made no difference.
Setting an Accept-Encoding: gzip header in the request to the public URL returns the same, expected uncompressed result.
And when disabling the Accept-Encoding: gzip header in the code of the Python library, "www.googleapis.com/download/storage/..." insists on returning a decompressed content.

So is there some UploadServer black magic going here when making a request via "www.googleapis.com/download/storage/..."?!?

@dalbani
Copy link

dalbani commented Jan 6, 2018

Black magic seems to be the appropriate term, because it looks like the content type of the uploaded blob has an effect on this unexpected decompression.
I could for example determine that the following content types trigger the "bug":

  • application/x-rar-compressed
  • image/{jpeg,png,gif} (but not image/xyz...)
  • video/mpeg

Could some from Google report on that? Thanks.

@thobrla
Copy link
Contributor

thobrla commented Jan 11, 2018

@dalbani : see the documentation at https://cloud.google.com/storage/docs/transcoding#gzip-gzip on Google Cloud Storage's current behavior regarding compressible content-types. Per my comments above, the work to stop GCS from removing a layer of compression isn't currently prioritized.

@nojvek
Copy link
Author

nojvek commented Jan 11, 2018 via email

@dalbani
Copy link

dalbani commented Jan 11, 2018

@thobrla: thanks for your response, but I have already had a look at this documentation.
And especially where it says that the Cache-Control: no-transform header should force GCS to never gunzip the data.
Which it obviously does, as shown in my logs above, doesn't it?
Recap: GCS doesn't behave as documented for a particular HTTP endpoint as far as I could test.

@thobrla
Copy link
Contributor

thobrla commented Jan 11, 2018

@dalbani Thanks for the report. I can't reproduce this issue, though - I tried out your scenario with an Content-Type image/jpeg, Content-Encoding gzip, Cache-Control: no-transform object and did not see an unzipped response. Can you construct curl requests (with auth headers and bucket name omitted) that create an object that reproduces this issue?

@nojvek
Copy link
Author

nojvek commented Feb 21, 2018

I am seeing a buggy behaviour too where setmeta Cache-Control overrides gzipping functionality

gsutil cp -Z foo.min.js gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
alt-svc:clear
cache-control:no-transform <---- undesired
content-encoding:gzip <----- correct
content-language:en
content-length:7074
content-type:application/javascript
date:Wed, 21 Feb 2018 01:21:27 GMT

After gsutil setmeta -h "Cache-Control: public,max-age=31536000" gs://cdn-bucket/foo.min.js

accept-ranges:bytes
access-control-allow-origin:*
age:127807
alt-svc:clear
cache-control:public,max-age=31536000
content-language:en
content-length:31684 <------- No content encoding gzip :(
content-type:text/css
date:Mon, 19 Feb 2018 14:01:02 GMT
etag:"691cfcaa0eb97e1f3c7d4b1687b37834"
expires:Tue, 19 Feb 2019 14:01:02 GMT
last-modified:Tue, 24 Oct 2017 00:48:44 GMT
server:UploadServer
status:200

So @thobrla it seems your recommendation of setmeta afterwards does not work.

@dalbani
Copy link

dalbani commented Mar 2, 2018

@thobrla
Any library should be able to create a "problematic" blob but I've created an all-in-one script to show the behaviour I was talking about: https://gist.github.com/dalbani/ae837a0f00b395f875c74646eda5bfac.
It shows the difference between retrieving a blob via https://www.googleapis.com/download/storage/... and storage.googleapis.com, including the strange effect of some content types.

TL;DR: the so-called "UploadServer" treats some content types differently than other when downloading resources via https://www.googleapis.com/download/storage/....

For example, let's say I run the script with an empty, gzip'ed 32x32 JPEG file:

$ convert -size 32x32 xc:white /tmp/32x32.jpg
$ cat /tmp/32x32.jpg | gzip -9 > /tmp/32x32.jpg.gz
$ ls -l /tmp/32x32.jpg*
-rw-rw-r-- 1 me me 165 Mar  2 23:08 /tmp/32x32.jpg
-rw-rw-r-- 1 me me 137 Mar  2 23:08 /tmp/32x32.jpg.gz

If I run my script with ./test-gcs.sh 32x32.jpg /tmp/32x32.jpg.gz image/jpeg, I get the following output:

{
  "kind": "storage#object",
  "id": "xyz/32x32.jpg/1520033236514750",
  "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg",
  "name": "32x32.jpg",
  "bucket": "xyz",
  "generation": "1520033236514750",
  "metageneration": "1",
  "contentType": "image/jpeg",
  "timeCreated": "2018-03-02T23:27:16.513Z",
  "updated": "2018-03-02T23:27:16.513Z",
  "storageClass": "REGIONAL",
  "timeStorageClassUpdated": "2018-03-02T23:27:16.513Z",
  "size": "137",
  "md5Hash": "rV9N/0RX6QgkCjpDIi2Lyw==",
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media",
  "contentEncoding": "gzip",
  "cacheControl": "no-transform",
  "acl": [
    ...,
    {
      "kind": "storage#objectAccessControl",
      "id": "xyz/32x32.jpg/1520033236514750/allUsers",
      "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg/acl/allUsers",
      "bucket": "xyz",
      "object": "32x32.jpg",
      "generation": "1520033236514750",
      "entity": "allUsers",
      "role": "READER",
      "etag": "CL7v74jlztkCEAE="
    }
  ],
  "owner": {
    "entity": "..."
  },
  "crc32c": "ttHcwA==",
  "etag": "CL7v74jlztkCEAE="
}

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media <== (No "Accept-Encoding" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq-q5SJGTO2-LXhvc-e9LX6AVG-i4TAtNVBs9oI-OtywZ-oyUGrb_EHAT8qUbXC6lDKB9NR-1Oy_odHur7Ndx6Kq45XDg
Content-Type: image/jpeg
Content-Disposition: attachment
ETag: W/CL7v74jlztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033236514750
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:27:16 GMT
Warning: 214 UploadServer gunzipped
Content-Length: 165
Server: UploadServer

165

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033236514750&alt=media <== ("Accept-Encoding: gzip" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UqElToltlztZmVJ5kHTg7-MHRNwxHru-o1ta1kfylxKEQ66zZ8JU36gsz0nqgA8Jrmx86B7MJpUJ1EjVsfIWHOve-3Q4w
Content-Type: image/jpeg
Content-Disposition: attachment
Vary: X-Origin
X-Goog-Generation: 1520033236514750
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:27:16 GMT
Server: UploadServer
Accept-Ranges: none
Vary: Origin,Accept-Encoding
Transfer-Encoding: chunked

165

==> https://storage.googleapis.com/xyz/32x32.jpg <==
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Ur42wt38EuxIxQo6FPFRMCYav2YPQpvxw1GFE-6vq4jpuQNbgU1r5vXN_JjzYzoRCwNgtwldZUF9JenNyYPE0oDaW9_Vg
Date: Fri, 02 Mar 2018 23:27:16 GMT
Cache-Control: no-transform
Expires: Sat, 02 Mar 2019 23:27:16 GMT
Last-Modified: Fri, 02 Mar 2018 23:27:16 GMT
ETag: "ad5f4dff4457e908240a3a43222d8bcb"
x-goog-generation: 1520033236514750
x-goog-metageneration: 1
x-goog-stored-content-encoding: gzip
x-goog-stored-content-length: 137
Content-Type: image/jpeg
Content-Encoding: gzip
x-goog-hash: crc32c=ttHcwA==
x-goog-hash: md5=rV9N/0RX6QgkCjpDIi2Lyw==
x-goog-storage-class: REGIONAL
Accept-Ranges: bytes
Server: UploadServer
Transfer-Encoding: chunked

137

See that both requests to https://www.googleapis.com/download/... return gunzip'ed data, without mentioning the encoding (and only with Warning: 214 UploadServer gunzipped when no Accept-Encoding: gzip header were present in the request?!).
Summary: responses are respectively of 165, 165 and 137 bytes.

This transparent gunzip causes problems with, for example, the Python library, which sends an Accept-Encoding: gzip header and thus falls into case number 2.

...
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:

  https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?alt=media

The X-Goog-Hash header indicated an MD5 checksum of:

  rV9N/0RX6QgkCjpDIi2Lyw==

but the actual MD5 checksum of the downloaded contents was:

  Dpx7jzPpJiEyPwovSJL/fA==

Now, let's compare with the output of the same command but with a different content type, e.g. ./test-gcs.sh 32x32.jpg /tmp/32x32.jpg.gz image/xyz:

{
  "kind": "storage#object",
  "id": "xyz/32x32.jpg/1520033806814916",
  "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg",
  "name": "32x32.jpg",
  "bucket": "xyz",
  "generation": "1520033806814916",
  "metageneration": "1",
  "contentType": "image/xyz",
  "timeCreated": "2018-03-02T23:36:46.813Z",
  "updated": "2018-03-02T23:36:46.813Z",
  "storageClass": "REGIONAL",
  "timeStorageClassUpdated": "2018-03-02T23:36:46.813Z",
  "size": "137",
  "md5Hash": "rV9N/0RX6QgkCjpDIi2Lyw==",
  "mediaLink": "https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media",
  "contentEncoding": "gzip",
  "cacheControl": "no-transform",
  "acl": [
    ...
    {
      "kind": "storage#objectAccessControl",
      "id": "xyz/32x32.jpg/1520033806814916/allUsers",
      "selfLink": "https://www.googleapis.com/storage/v1/b/xyz/o/32x32.jpg/acl/allUsers",
      "bucket": "xyz",
      "object": "32x32.jpg",
      "generation": "1520033806814916",
      "entity": "allUsers",
      "role": "READER",
      "etag": "CMSd6JjnztkCEAE="
    }
  ],
  "owner": {
    "entity": "..."
  },
  "crc32c": "ttHcwA==",
  "etag": "CMSd6JjnztkCEAE="
}

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media <== (No "Accept-Encoding" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq8B_5VWTlAYbs6pPWsBb2Ap7DJ2gyVi_ZUWA6noZ0m7dflv9hn1siBbwQGRkOk0g6CMw_j1eOlRmzpJoBylLX-FupKEA
Content-Type: image/xyz
Content-Disposition: attachment
ETag: W/CMSd6JjnztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033806814916
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:36:46 GMT
Warning: 214 UploadServer gunzipped
Content-Length: 165
Server: UploadServer

165

==> https://www.googleapis.com/download/storage/v1/b/xyz/o/32x32.jpg?generation=1520033806814916&alt=media <== ("Accept-Encoding: gzip" header)
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2Uq63b0MPQcdUOVj4GxXRJqlXifJTg_6xhUjZe8KVKb6hsRxXGo1VbmIUraY2EjQ6WpMtdhJysQE8AyorbF_QkelHoGcx6wq4vsyX9WNBlPTGoqisMY
Content-Type: image/xyz
Content-Disposition: attachment
Content-Encoding: gzip
ETag: CMSd6JjnztkCEAE=
Vary: Origin
Vary: X-Origin
X-Goog-Generation: 1520033806814916
X-Goog-Hash: crc32c=ttHcwA==,md5=rV9N/0RX6QgkCjpDIi2Lyw==
X-Goog-Metageneration: 1
X-Goog-Storage-Class: REGIONAL
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Fri, 02 Mar 2018 23:36:46 GMT
Server: UploadServer
Transfer-Encoding: chunked

137

== https://storage.googleapis.com/xyz/32x32.jpg ==
HTTP/1.1 200 OK
X-GUploader-UploadID: AEnB2UqIGXYH1gXIkhJAYRHfGjzqYExKxvN3MunX9IE9iZdJDq0Uwkn-CFRxwXjsccCuFKwO5izsEdtEc2R3u5DzrrRVg2ikJA
Date: Fri, 02 Mar 2018 23:36:47 GMT
Cache-Control: no-transform
Expires: Sat, 02 Mar 2019 23:36:47 GMT
Last-Modified: Fri, 02 Mar 2018 23:36:46 GMT
ETag: "ad5f4dff4457e908240a3a43222d8bcb"
x-goog-generation: 1520033806814916
x-goog-metageneration: 1
x-goog-stored-content-encoding: gzip
x-goog-stored-content-length: 137
Content-Type: image/xyz
Content-Encoding: gzip
x-goog-hash: crc32c=ttHcwA==
x-goog-hash: md5=rV9N/0RX6QgkCjpDIi2Lyw==
x-goog-storage-class: REGIONAL
Accept-Ranges: bytes
Server: UploadServer
Transfer-Encoding: chunked

137

Summary: responses are respectively of 165, 137 (not 165 as above!) and 137 bytes.
In the 2nd request, no automatic gunzip has taken place in "UploadServer", only because of the content type?!

And here the Python library has no problem downloading the blob and checking the MD5 checksum.

Although very long, I hope this post is clear enough so you can pinpoint and eventually fix the issue.
Thanks for your attention!

@thobrla
Copy link
Contributor

thobrla commented Mar 3, 2018

Thanks @dalbani - I think at this point the problem is well understood and we're waiting for the Cloud Storage team to prioritize a fix (but to my knowledge it's not currently prioritized).

@yonran
Copy link

yonran commented May 30, 2018

The Object Transcoding documentation and gsutil cp documentation should probably be modified to indicate that gsutil cp -z disables decompressive transcoding.

@acoulton
Copy link

I would go further than @yonran and say the documentation should definitely be modified : this is a really frustrating omission. Also, for publishing static web assets, it's really frustrating to have no flag/option to disable this behaviour : I have no need to ever download these again with gsutil so the checksum thing isn't an issue - I just want to gzip them on the way up and have GCS then serve them to clients in line with the documentation....

@starsandskies
Copy link
Contributor

Closing this issue - documentation is now updated on cloud.google.com, and I'm backfilling the source files here in Github to match.

@acoulton
Copy link

@starsandskies great that the docs have been updated - thanks for that. I'm not sure it's valid to close this issue though.

When uploading more than a few files - e.g. for web assets / static sites - it is extremely inefficient to have to run gsutil cp -m - z -r $dir gs://$bucket/$path and then a separate gsutil -m -r setmeta -h "Cache-Control:public, max-age=.." gs://$bucket/$path afterwards to fix the cache header.

That adds a fair time overhead, and more importantly creates the risk of objects existing in the bucket with inconsistent / unexpected state if the second setmeta command fails for any reason.

If we specify gsutil cp -r -z -h "Cache-Control:public, max-age=..." then at very least gsutil should emit a runtime warning that our explicit -h value has been ignored / overwritten. But it would be much better if gsutil respected a command-line explicit value in preference to the default. Or if that's really not possible for BC reasons, then an explicit flag to disable this behaviour.

FWIW although the docs are now clearer I think it's still not obvious that at present the Cache-Control:no-transform completely overwrites any Cache-Control header set on the command line.

@starsandskies
Copy link
Contributor

I definitely think there are improvements to the tool that could be made (and, fwiw, the push to fix the underlying behavior that necessitates the -z behavior had a renewed interest at the end of 2020). I've no objection to re-opening this (I assume you're able to, though let me know if not - I'm not a Github expert by any stretch of the imagination), but this thread has gotten quite long and meander-y. I'd recommend taking the relevant points and making a fresh issue that cuts away the excess.

@acoulton
Copy link

@starsandskies thanks for the response - no, I can't reopen, only core contributors/admins can reopen on Github.

I couldn't see a branch / pull request relevant to the underlying behaviour that necessitates the -z behaviour, do you mean server-side on GCS as mentioned up the thread, or is there an issue / pull request open for that elsewhere that I could reference / add to?

I'm happy to make a new issue, tho I think the issue description and first couple of comments here (e.g. #480 (comment)) capture the problem and IMO there's an advantage to keeping this issue alive since there are already people watching it.

But if you'd prefer a new issue I'll open one and reference this.

@starsandskies
Copy link
Contributor

Ah, in that case, I'll reopen this.

To answer you question, my understanding is that what blocks a true fix is on the server side and that it affects other tools as well, such as the client libraries (see, for example, googleapis/nodejs-storage#709)

@starsandskies starsandskies reopened this Jan 29, 2021
@acoulton
Copy link

acoulton commented Feb 2, 2021

@starsandskies thanks :)

Yes, I see the problem on that nodejs-storage issue. I think though it breaks into two usecases:

  • using client libraries / gsutil to download files that have already been uploaded, where I can see transitive decompression is a problem for validating the checksums. Appreciate that's probably blocked on a server-side fix.

  • using gsutil to upload files to a one-way bucket used for e.g. static website / asset hosting, where end-clients are accessing over http, so the checksum validation on download is not a problem but the forced override of cache headers at upload time is.

AFAICS the second usecase was working without any problems, until the gsutil behaviour was changed to fix the first case.

The key thing is that it's obviously still valid to have gzipped files in the bucket with transitive decompression enabled - nothing stops you setting your own Cache-Control header after the initial upload. And that obviously fixes usecase 2 but breaks usecase 1. That being the case, I don't think there's any good reason why gsutil should silently prevent you from doing that in a single call, even if you want to keep the default behaviour as it is now.

@MrTrustworthy
Copy link

Since we just stumbled upon this issue when trying to move to GCP/GCS for our CDN assets, and this thread was very helpful in figuring out why, I wanted to leave a piece of feedback from the user side for this topic.

There are many responses (I assume from maintainers/developers at GCP/gsutil) that suggest that per default adding the no-transform setting with no way to disable it is the best possible option. Example:

So if we add such an option to drop no-transform, we're back in the situation we were in before 439573e where certain files uploaded by gsutil cannot then be downloaded by gsutil, and this seems worse than not being downloadable by a different client.

I just want to say that, as a user of GCP, I harshly disagree with that assessment.

From my perspective, the actual state of things is that gsutil is simply bugged and won't allow you to download files if they are uploaded via cp -z/Z. This is not nice, but acceptable - tools have bugs sometimes, and they need to be prioritised and fixed. But instead, the cp behaviour was modified to break CDN users in a way that's very hard to detect in the first place.

For an outside user of GCP, it seems like the respective team isn't interested in fixing the bug, so it's hiding the issue behind a different and harder-to-notice issue, just so it's technically not broken on their end per their definition. As a user of GCP, I don't care about whether gsutils technically works correctly, I care whether my entire GCP setup works correctly - and it currently doesn't.

To be clear: the default behaviour of gsutil cp -z/Z, when used for CDN purposes (which is probably the main reason people use -z/Z for in the first place) is to silently break HTTP specs. After uploading our assets, our pages suddenly delivered compressed assets even if the clients didn't support them. This is simply wrong. If the CDN was automatically configured to send the correct (406) response in those cases, it would be somewhat acceptable - but it doesn't.

In my personal view, gsutils should simply be allowed to break when trying to download compressed files, maybe for now with a nice error message explaining why. If the download is severe enough, then a fix should be prioritised accordingly. But silently breaking HTTP specs for CDN users to hide the download bug is not acceptable IMHO.

@frankyn
Copy link
Member

frankyn commented Nov 19, 2021

Short update: As of end of this week Cloud Storage API will always respect Accept-Encoding: gzip which caused the underlying issue in that GCS would decompress data even when not requested.

Rolled back the change so we will need to follow-up again when we have an update. Apologies, I jinxed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests