feat: safely resume interrupted downloads #294

cojenco · 2022-01-18T19:11:17Z

If a retryable error occurs mid-download, the download starts sending data to the stream from the offset_of_last_byte_received rather than starting from the beginning of the file, and resolves data integrity issues.

for interruped downloads, safely resume by reading from offset_of_last_byte_received using a ranged get request and include object generation URL query parameter to make sure the same object content is requested
adds support for download instances to track information such as object_generation and bytes downloaded
adds tests

Fixes #284

tritone

Couple thoughts, generally looking really good to me!

General comment, can you give some more details about how you tested this out with the emulator?

google/resumable_media/_helpers.py

tritone · 2022-01-22T19:00:26Z

google/resumable_media/requests/download.py

+        # data corruption for that byte range alone.
+        if self._expected_checksum is None and self._checksum_object is None:
+            # `_get_expected_checksum()` may return None even if a checksum was
+            # requested, in which case it will emit an info log _MISSING_CHECKSUM.


What causes this case to happen? Transcoding?

This is due to retried requests being range requests. For range requests, as noted here, there's no way to detect data corruption for that byte range alone.

Therefore, here we retrieve the expected checksum/checksum object only once for the initial download request. Then we calculate and validate the checksum when the download completes.

tritone · 2022-01-22T19:02:59Z

google/resumable_media/requests/download.py

@@ -160,13 +171,48 @@ def consume(
        if self._stream is not None:
            request_kwargs["stream"] = True

+        # Assign object generation if generation is specified in the media url.


Would this happen via a user specifying a generation on the object? Were we not respecting this previously?

Yep this would happen via a user specifying a generation on the object. Previously, we've been respecting that only through download.media_url

A property download._object_generation is added. It records the object generation from either (1) generation query param from the media_url, or (2) the object generation from the initial response header. This specific line of code does (1) and retrieves it from the media_url

P.S. It's tricky in how limited information is passed from python-storage to resumable-media-python. A resumable-media-python download instance only knows the specified object generation from its media_url, and the "object" itself isn't pertained in a download.

google/resumable_media/requests/download.py

google/resumable_media/_helpers.py

tritone · 2022-01-22T19:10:47Z

google/resumable_media/requests/download.py


            self._process_response(result)

+            # With decompressive transcoding, GCS serves back the whole file regardless of the range request,


Wondering if this should be highlighted as a shortcoming in the decompressive transcoding docs-- not being able to resume a download may be costly.

It's mentioned in the very bottom section of the decompressive transcoding docs. I agree we can add notes on how retries may be impacted in this sense.

cojenco · 2022-01-24T18:28:22Z

Couple thoughts, generally looking really good to me!

General comment, can you give some more details about how you tested this out with the emulator?

Thanks for the review! I've added data integrity checks and test cases to the retry conf test (open PR). The changes in this PR are tested against the testbench using above-mentioned tests.

Before the changes, conformance tests fail as below. The conf tests pass running locally against the changes made in this PR.

  File "/tmpfs/src/github/python-storage/tests/conformance/test_conformance.py", line 93, in blob_download_as_bytes
    assert stored_contents == payload
AssertionError: assert b'ThisThisThi... text file.\n' == b'This is a s... text file.\n'
  At index 4 diff: b'T' != b' '
  Full diff:
  - b'This is a simple text file.\n'
  ?       ^
  + b'ThisThisThis is a simple text file.\n'
  ?       ^^  +++++++

=========================== short test summary info ============================
FAILED tests/conformance/test_conformance.py::test-S8-storage.objects.get-blob_download_to_filename-0
FAILED tests/conformance/test_conformance.py::test-S8-storage.objects.get-client_download_blob_to_file-0
FAILED tests/conformance/test_conformance.py::test-S8-storage.objects.get-blob_download_as_bytes-0
FAILED tests/conformance/test_conformance.py::test-S8-storage.objects.get-blobreader_read-0
FAILED tests/conformance/test_conformance.py::test-S8-storage.objects.get-blob_download_as_text-0
5 failed, 555 passed, 5 skipped, 7 warnings in 287.55s (0:04:47)
nox > Command py.test -n auto --quiet tests/conformance failed with exit code 1
nox > Session conftest_retry-3.8 failed.

tritone · 2022-02-10T23:08:56Z

This is looking really good in general. Based on offline discussion I would recommend moving the decompressive transcoding feature to a TODO and moving ahead with the rest of this PR. There may be some details that take a while to resolve for transcoding and it's important that we still move ahead with the rest of this PR which is a major fix to retry logic for downloads.

andrewsg

LGTM pending @tritone comment resolutions. Thank you!

cojenco · 2022-02-11T02:09:10Z

Thanks Chris and Andrew! I've moved the transcoding feature, tracking in #303

cojenco added 11 commits November 17, 2021 17:50

fix: add offset of last byte received to retry a streaming download

3189ac4

add helper method _parse_generation_header

4a37b82

add some object generation related logic

2b14f44

revise object genertion helper method

b1895f9

fix url variable scope

c04d8ba

handle special cases with decompressive transcoding

fee2696

fix helper method

1e81eb7

move to _helpers

5dea6ab

add support to safely resume interrupted raw downloads

8b09434

add unit tests

c29efb2

add more tests

4967a4e

product-auto-label bot added the api: storage Issues related to the googleapis/google-resumable-media-python API. label Jan 18, 2022

Merge branch 'main' into midstream-retries

c4591d8

parthea mentioned this pull request Jan 18, 2022

Fix: Reset stream on retries #285

Closed

cojenco marked this pull request as ready for review January 19, 2022 17:37

cojenco requested review from a team as code owners January 19, 2022 17:37

tritone reviewed Jan 22, 2022

View reviewed changes

cojenco added 2 commits January 27, 2022 21:45

update helper method per comments

e28d237

address comments on handling stream seek error

d579918

cojenco added 2 commits February 10, 2022 16:52

address comments on moving transcoding feature

9a04788

Merge branch 'main' into midstream-retries

188b484

andrewsg approved these changes Feb 11, 2022

View reviewed changes

tritone approved these changes Feb 11, 2022

View reviewed changes

cojenco added the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 11, 2022

gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 11, 2022

cojenco merged commit b363329 into googleapis:main Feb 11, 2022

release-please bot mentioned this pull request Feb 11, 2022

chore(main): release 2.3.0 #304

Merged

This was referenced Aug 9, 2024

Bump up google-cloud-storage version to fix data corruption issue apache/beam#32135

Merged

[Bug]: [Python SDK] Data Corruption on GCS read in 2.53.0 - 2.58.0 SDKs. apache/beam#32169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: safely resume interrupted downloads #294

feat: safely resume interrupted downloads #294

cojenco commented Jan 18, 2022 •

edited

Loading

tritone left a comment

tritone Jan 22, 2022

cojenco Jan 25, 2022

tritone Jan 22, 2022

cojenco Jan 25, 2022

tritone Jan 22, 2022

cojenco Jan 25, 2022

cojenco commented Jan 24, 2022

tritone commented Feb 10, 2022

andrewsg left a comment

cojenco commented Feb 11, 2022


		self._process_response(result)

		# With decompressive transcoding, GCS serves back the whole file regardless of the range request,

feat: safely resume interrupted downloads #294

feat: safely resume interrupted downloads #294

Conversation

cojenco commented Jan 18, 2022 • edited Loading

tritone left a comment

Choose a reason for hiding this comment

tritone Jan 22, 2022

Choose a reason for hiding this comment

cojenco Jan 25, 2022

Choose a reason for hiding this comment

tritone Jan 22, 2022

Choose a reason for hiding this comment

cojenco Jan 25, 2022

Choose a reason for hiding this comment

tritone Jan 22, 2022

Choose a reason for hiding this comment

cojenco Jan 25, 2022

Choose a reason for hiding this comment

cojenco commented Jan 24, 2022

tritone commented Feb 10, 2022

andrewsg left a comment

Choose a reason for hiding this comment

cojenco commented Feb 11, 2022

cojenco commented Jan 18, 2022 •

edited

Loading