Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry RESOURCE_EXHAUSTED errors in read_rows #366

Merged
merged 2 commits into from
Jan 12, 2022

Conversation

esert-g
Copy link
Contributor

@esert-g esert-g commented Dec 31, 2021

BigQuery Storage Read API will start returning retryable RESOURCE_EXHAUSTED errors in 2022 when certain concurrency limits are hit, so this PR adds some code to handle them.

Tested with unit tests and system tests. System tests ran successfully on a test project that intentionally returns retryable RESOURCE_EXHAUSTED errors.

@esert-g esert-g requested a review from a team December 31, 2021 00:30
@esert-g esert-g requested a review from a team as a code owner December 31, 2021 00:30
@product-auto-label product-auto-label bot added the api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API. label Dec 31, 2021
@esert-g esert-g force-pushed the resource_exhausted branch 5 times, most recently from 167e3fc to 96caee6 Compare January 5, 2022 23:11
@tswast tswast self-requested a review January 5, 2022 23:54
@esert-g esert-g force-pushed the resource_exhausted branch 4 times, most recently from 96caee6 to f6cbd60 Compare January 6, 2022 21:25
google/cloud/bigquery_storage_v1/reader.py Outdated Show resolved Hide resolved
@@ -81,14 +83,12 @@ class ReadRowsStream(object):
method to parse all messages into a :class:`pandas.DataFrame`.
"""

def __init__(self, wrapped, client, name, offset, read_rows_kwargs):
def __init__(
self, client, name, offset, read_rows_kwargs, retry_delay_callback=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a breaking change, though I don't expect anyone to be constructing a ReadRowsStream directly, except perhaps in unit tests.

We should have had this before, but please add a comment to this class's docstring like the following:

    This object should not be created directly, but is returned by other
    methods in this library.

(Pulled from https://github.com/googleapis/python-pubsub/blob/main/google/cloud/pubsub_v1/futures.py)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's not clear to me why we need to make this breaking change to begin with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment.

I removed the wrapped argument because ReadRowsStream assumes it will always get a valid stream with it. However, the RESOURCE_EXHAUSTED error can be the very first thing returned from the bq storage api server. If we keep the old api, we need to handle RESOURCE_EXHAUSTED errors in two different places. With this change we only need to handle it in one place.

@@ -106,6 +106,12 @@ def __init__(self, wrapped, client, name, offset, read_rows_kwargs):
read_rows_kwargs (dict):
Keyword arguments to use when reconnecting to a ReadRows
stream.
retry_delay_callback (Optional[Callable[[float], None]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, I'm a bit confused why this parameter is necessary. Who needs a notification that we're going to sleep?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just for unit testing? In that case, please remove this parameter and use the freezegun library, instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users of the library may choose to provide this callback to be aware of delayed retries. When the users are aware of the delayed retry attempts, they can adjust their autoscaling algorithms. Apache Beam already does with the java sdk. My plan is to do the same with their python sdk.

def _reconnect(self):
"""Reconnect to the ReadRows stream using the most recent offset."""
self._wrapped = self._client.read_rows(
return self._client.read_rows(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this create a new ReadRowsStream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this function doesn't do anything different than before. It just returns the stream instead of assigning to a member variable.

@esert-g esert-g force-pushed the resource_exhausted branch 2 times, most recently from d2dddeb to 869914a Compare January 6, 2022 23:04
@esert-g esert-g requested a review from tswast January 6, 2022 23:05
@esert-g esert-g force-pushed the resource_exhausted branch 2 times, most recently from 41169fb to b207dc1 Compare January 7, 2022 23:24
@@ -123,19 +130,12 @@ def read_rows(
ValueError: If the parameters are invalid.
"""
gapic_client = super(BigQueryReadClient, self)
stream = gapic_client.read_rows(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: A side-effect of this change is that some non-retryable errors won't happen until the user starts iterating through rows.

I would consider this a breaking change even more-so than the ReadRowsStream constructor change, as users who were expecting an exception now will get it later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code so read_rows is called in ReadRowsStream construction and any exception besides the retryable resource exhausted is propagated.

Comment on lines 240 to 243
# Don't reconnect on DeadlineException. This allows user-specified timeouts
# to be respected.
mock_gapic_client.read_rows.assert_not_called()
mock_gapic_client.read_rows.assert_called_once()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is out-of-date now. I believe this is the breaking change side-effect I mentioned earlier.

I'd prefer we don't break this behavior and find a way to implement the reconnect logic in BigQueryReadClient.read_rows and ReadRowsStream. Perhaps a helper function could be created to keep it DRY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comment. As mentioned above, behavior is the same as before.

@esert-g esert-g force-pushed the resource_exhausted branch 3 times, most recently from a69b1ee to 3b97341 Compare January 12, 2022 02:08
@esert-g esert-g requested a review from tswast January 12, 2022 02:35
Comment on lines -214 to -216
# Don't reconnect on DeadlineException. This allows user-specified timeouts
# to be respected.
mock_gapic_client.read_rows.assert_not_called()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a rather important comment. I'd like to make sure we're still testing this behavior somehow (no reconnect on DeadlineException)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.python.org/3/library/unittest.mock.html#unittest.mock.Mock.reset_mock before line 211 may be helpful here, though possibly unnecessary if we move _reconnect out of the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_rows will always be called once because that's what throws the DeadlineException. I updated the comment so I hope the intention is clear now.

self._client = client
self._name = name
self._offset = offset
self._read_rows_kwargs = read_rows_kwargs
self._retry_delay_callback = retry_delay_callback
self._reconnect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing heavy work during construction/initialization time is a bit of an OO anti-pattern. I'd prefer we find another way to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this from construction and added it as an additional statement to wherever ReadRowsStream is constructed previously.

Comment on lines 186 to 192
# ResourceExhausted errors are only retried if a valid
# RetryInfo is provided with the error.
# ResourceExhausted doesn't seem to have details/_details
# fields by default when it is generated by Python 3.6 unit
# tests, so we have to work around that.
# TODO: to remove this logic when we require
# google-api-core >= 2.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ResourceExhausted errors are only retried if a valid
# RetryInfo is provided with the error.
# ResourceExhausted doesn't seem to have details/_details
# fields by default when it is generated by Python 3.6 unit
# tests, so we have to work around that.
# TODO: to remove this logic when we require
# google-api-core >= 2.2.0
# ResourceExhausted errors are only retried if a valid
# RetryInfo is provided with the error.
#
# TODO: Remove hasattr logic when we require google-api-core >= 2.2.0.
# ResourceExhausted added details/_details in google-api-core 2.2.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment as suggested.

BigQuery Storage Read API will start returning retryable
RESOURCE_EXHAUSTED errors in 2022 when certain concurrency limits are
hit, so this PR adds some code to handle them.

Tested with unit tests and system tests. System tests ran successfully
on a test project that intentionally returns retryable
RESOURCE_EXHAUSTED errors.
@tswast tswast added automerge Merge the pull request once unit tests and other checks pass. kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Jan 12, 2022
@tswast tswast changed the title Retryable RESOURCE_EXHAUSTED handling fix: retry RESOURCE_EXHAUSTED errors in read_rows Jan 12, 2022
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jan 12, 2022
@tswast tswast merged commit 33757d8 into googleapis:main Jan 12, 2022
@gcf-merge-on-green gcf-merge-on-green bot removed the automerge Merge the pull request once unit tests and other checks pass. label Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the googleapis/python-bigquery-storage API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants