-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(bigquerystorage): resume reader connection on EOS
internal error
#9994
Conversation
bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py
Outdated
Show resolved
Hide resolved
Closing this PR per "Custom retry logic in GAPIC-generated clients" email thread. |
We were able to reproduce this error in Python with the long-running integration tests. It surfaces as:
It appears to match the related Java code: |
bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py
Outdated
Show resolved
Hide resolved
It's infeasible for the backend to change the status of `EOS on DATA` internal errors, so instead we check the error message to see if it's an error that is resumable. We don't want to try to resume on *all* internal errors, so inspecting the message is the best we can do.
d893035
to
50ecc98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one minor refactoring suggestion, otherwise it generally LGTM. We still need to resolve a merge conflict, though.
BTW, what is the best way of reproducing this locally? Is it even feasible?
bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py
Outdated
Show resolved
Hide resolved
bigquery_storage/google/cloud/bigquery_storage_v1beta1/reader.py
Outdated
Show resolved
Hide resolved
I haven't figured out a way to reproduce it locally, at least not without a lot of time and network bandwidth. We get the error pretty consistently with the following sample code in our internal nightly CI: # Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import concurrent.futures
from google.cloud import bigquery_storage_v1beta1
import pytest
@pytest.mark.parametrize(
"format_",
(
bigquery_storage_v1beta1.enums.DataFormat.AVRO,
bigquery_storage_v1beta1.enums.DataFormat.ARROW,
),
)
def test_long_running_read(bqstorage_client, project_id, format_):
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "bigquery-public-data"
table_ref.dataset_id = "samples"
table_ref.table_id = "wikipedia"
session = bqstorage_client.create_read_session(
table_ref,
"projects/{}".format(project_id),
requested_streams=5,
format_=format_
)
assert len(session.streams) == 5
def count_rows(stream):
read_position = bigquery_storage_v1beta1.types.StreamPosition(
stream=stream
)
reader = bqstorage_client.read_rows(read_position)
row_count = 0
for page in reader.rows(session).pages:
row_count += page.num_items
return row_count
with concurrent.futures.ThreadPoolExecutor() as pool:
row_count = sum(pool.map(count_rows, session.streams))
assert row_count == 313797035 |
OK good, I'll try to reproduce it locally tomorrow, but if it doesn't work, I'll approve and rely on the next nightly run (the code itself LGTM now). Edit: Since the test takes 90 minute on a GCE instance, and the error is only reproduced roughly half of the time, it would be inefficient to testing the fix in action locally (would require multiple runs). Since the fix is not going to production immediately, we can still revert it should the nightly runs reveal that the issue still persists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving under the assumption that the fix is not being shipped to production immediately, and that we can revert it in case the error occurs again in one of the nightly runs.
It's infeasible for the backend to change the status of
EOS on DATA
internal errors, so instead we check the error message to see if it's
an error that is resumable. We don't want to try to resume on all
internal errors, so inspecting the message is the best we can do.
This fixes the same error as https://issuetracker.google.com/143292803 but for Python.