fix(bigquerystorage): resume reader connection on `EOS` internal error #9994

tswast · 2019-12-18T15:48:04Z

It's infeasible for the backend to change the status of EOS on DATA
internal errors, so instead we check the error message to see if it's
an error that is resumable. We don't want to try to resume on all
internal errors, so inspecting the message is the best we can do.

This fixes the same error as https://issuetracker.google.com/143292803 but for Python.

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py

tswast · 2020-01-03T15:59:11Z

Closing this PR per "Custom retry logic in GAPIC-generated clients" email thread.

tswast · 2020-01-08T19:25:59Z

We were able to reproduce this error in Python with the long-running integration tests. It surfaces as:

E                   grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
E                   	status = StatusCode.INTERNAL
E                   	details = "Received RST_STREAM with error code 2"
E                   	debug_error_string = "{"created":"@1578389183.706753048","description":"Error received from peer ipv4:172.217.219.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"
E                   >

It appears to match the related Java code:

https://github.com/grpc/grpc/blob/9dd05c4abc1741223ded8566bc143c927d9cba79/src/core/ext/transport/chttp2/transport/frame_rst_stream.cc#L110

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py

bigquery_storage/tests/unit/test_reader.py

It's infeasible for the backend to change the status of `EOS on DATA` internal errors, so instead we check the error message to see if it's an error that is resumable. We don't want to try to resume on *all* internal errors, so inspecting the message is the best we can do.

plamut

Have one minor refactoring suggestion, otherwise it generally LGTM. We still need to resolve a merge conflict, though.

BTW, what is the best way of reproducing this locally? Is it even feasible?

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py

…storage

tswast · 2020-01-15T22:05:27Z

I haven't figured out a way to reproduce it locally, at least not without a lot of time and network bandwidth. We get the error pretty consistently with the following sample code in our internal nightly CI:

# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import concurrent.futures

from google.cloud import bigquery_storage_v1beta1
import pytest


@pytest.mark.parametrize(
    "format_",
    (
        bigquery_storage_v1beta1.enums.DataFormat.AVRO,
        bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    ),
)
def test_long_running_read(bqstorage_client, project_id, format_):
    table_ref = bigquery_storage_v1beta1.types.TableReference()
    table_ref.project_id = "bigquery-public-data"
    table_ref.dataset_id = "samples"
    table_ref.table_id = "wikipedia"

    session = bqstorage_client.create_read_session(
        table_ref,
        "projects/{}".format(project_id),
        requested_streams=5,
        format_=format_
    )
    assert len(session.streams) == 5

    def count_rows(stream):
        read_position = bigquery_storage_v1beta1.types.StreamPosition(
            stream=stream
        )
        reader = bqstorage_client.read_rows(read_position)
        row_count = 0
        for page in reader.rows(session).pages:
            row_count += page.num_items
        return row_count

    with concurrent.futures.ThreadPoolExecutor() as pool:
        row_count = sum(pool.map(count_rows, session.streams))

    assert row_count == 313797035

plamut · 2020-01-15T22:07:29Z

OK good, I'll try to reproduce it locally tomorrow, but if it doesn't work, I'll approve and rely on the next nightly run (the code itself LGTM now).

Edit: Since the test takes 90 minute on a GCE instance, and the error is only reproduced roughly half of the time, it would be inefficient to testing the fix in action locally (would require multiple runs). Since the fix is not going to production immediately, we can still revert it should the nightly runs reveal that the issue still persists.

plamut

Approving under the assumption that the fix is not being shipped to production immediately, and that we can revert it in case the error occurs again in one of the nightly runs.

tswast requested a review from a team December 18, 2019 15:48

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019

tswast requested a review from shollyman December 18, 2019 15:48

kmjung reviewed Dec 18, 2019

View reviewed changes

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py Outdated Show resolved Hide resolved

tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Dec 18, 2019

tswast closed this Jan 3, 2020

tswast deleted the b143292803-eos-bqstorage branch January 3, 2020 15:59

tswast restored the b143292803-eos-bqstorage branch January 8, 2020 19:21

tswast reopened this Jan 8, 2020

kmjung reviewed Jan 8, 2020

View reviewed changes

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py Outdated Show resolved Hide resolved

bigquery_storage/tests/unit/test_reader.py Outdated Show resolved Hide resolved

tswast added 2 commits January 8, 2020 14:45

fix: update resumable errors to match string from test failure

50ecc98

tswast force-pushed the b143292803-eos-bqstorage branch from d893035 to 50ecc98 Compare January 8, 2020 20:53

tswast removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 8, 2020

tswast requested a review from plamut January 13, 2020 16:21

plamut suggested changes Jan 15, 2020

View reviewed changes

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py Outdated Show resolved Hide resolved

bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py Outdated Show resolved Hide resolved

tswast added 2 commits January 15, 2020 15:47

Merge remote-tracking branch 'upstream/master' into b143292803-eos-bq…

1a4b881

…storage

refactor: use more readable any instead of loop

642f714

tswast requested a review from plamut January 15, 2020 21:59

plamut approved these changes Jan 15, 2020

View reviewed changes

tswast merged commit b492bdc into googleapis:master Jan 15, 2020

tswast deleted the b143292803-eos-bqstorage branch January 15, 2020 22:27

stephaniewang526 mentioned this pull request Feb 25, 2020

[bigquerystorage, Java] Reconnect on "EOS on DATA frame" Internal Server Errors googleapis/java-bigquerystorage#86

Closed

larkee mentioned this pull request Aug 4, 2020

fix: resume iterator on EOS internal error googleapis/python-spanner#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bigquerystorage): resume reader connection on `EOS` internal error #9994

fix(bigquerystorage): resume reader connection on `EOS` internal error #9994

tswast commented Dec 18, 2019

tswast commented Jan 3, 2020

tswast commented Jan 8, 2020

plamut left a comment

tswast commented Jan 15, 2020

plamut commented Jan 15, 2020 •

edited

Loading

plamut left a comment

fix(bigquerystorage): resume reader connection on EOS internal error #9994

fix(bigquerystorage): resume reader connection on EOS internal error #9994

Conversation

tswast commented Dec 18, 2019

tswast commented Jan 3, 2020

tswast commented Jan 8, 2020

plamut left a comment

Choose a reason for hiding this comment

tswast commented Jan 15, 2020

plamut commented Jan 15, 2020 • edited Loading

plamut left a comment

Choose a reason for hiding this comment

fix(bigquerystorage): resume reader connection on `EOS` internal error #9994

fix(bigquerystorage): resume reader connection on `EOS` internal error #9994

plamut commented Jan 15, 2020 •

edited

Loading