-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt retry hook for flaky regression test cases. #535
Conversation
Changes Unknown when pulling 09431a8 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*. |
class RetryTestsMetaclass(type): | ||
|
||
NUM_RETRIES = 2 | ||
FLAKY_ERROR_CLASSES = (AssertionError,) |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
I think that for time based errors (e.g. exceeding QPS for a certain API method) we should bake the retry logic into the library itself, providing default values for max retries, error codes and messages to retry on, but allowing the user to override all of those. |
@silvolu AFAIK the regression test issues are not due to QPS issues, just intermittent failures or sometimes due to eventual-ness of consistency (though this "shouldn't" happen since usually this is for queries on data inserted weeks prior via |
Oh sorry, got confused with gcloud-ruby (and didn't read this issue with enough attention) |
09431a8
to
381e1f9
Compare
Changes Unknown when pulling 381e1f9 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*. |
Another one that failed the first time but not on retry: |
Fixes googleapis#531. To "test" that this works, feel free to add a test case like: x = 0 def test_retry(self): # Feel free to vary 3 higher and higher, should always be # NUM_RETRIES in the final error message. if self.x < 3: self.x += 1 self.assertEqual(self.x, object()) # Fails else: self.assertTrue(True)
381e1f9
to
a060dc7
Compare
@tseaver It seems the last contention here was in allowing |
I'll defer to your report that it failed on the first pass, but passed soon after (although I think we likely have something smelly in the testcase, that would be a different issue). |
@tseaver I'm looking into the smelliness. The second failure referenced in the bug is due to a non-transactional There are also three tests cases which rely on Shall I make these transactional and remove Also, looking through history it seems this 404 on storage key delete (during module cleanup) occurs pretty often. Our delete code and / or library code seems to have an issue with data staleness. That also may be smelly and maybe we don't need any retries? The piece I seem to remember failing is in query = datastore.Query(kind='Character', ancestor=datastore.Key('Book', 'GoT'),
[('appearances', '>=', 20)])
expected_matches = 6
entities = list(query.fetch(limit=7))
len(entities) == 6 but I checked all the failed builds and it's in none of those (it still may have occurred, but in a build we retried). |
Also removing AssertionError from list of retry classes.
@tseaver I made the datastore |
LGTM |
@tseaver I did some "soul-searching" on this and realized:
I'm going to submit a PR (#562) with just the transactional puts and then figure out what to do about the rest. Adding don't merge label for now. |
This is to address flaky test failures. See googleapis#535 for more discussion.
@jgeewax Do we have a contact on the We are seeing a non-trivial number of failures of This failure is because,
for key in bucket:
... occurs, hence a fresh API request to list the objects) |
summoning @thobrla , expert on all things storage! |
Here and available for questions. In this particular case, the behavior you are seeing is expected because listing in Google Cloud Storage is eventually consistent. Thus, recently deleted objects may still be returned from the list call. In terms of test cleanup, you should treat a 404 as "already deleted" and continue. You can model this after the gsutil integration test teardown code: https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/tests/testcase/integration_testcase.py#L108 |
@thobrla Thanks for the quick reply. This gives a simple-to-implement fix for our flaky regression test, but doesn't make
|
Fixes googleapis#531. See googleapis#535 for context.
Fixes googleapis#531. See googleapis#535 for context.
|
Thanks for the help @thobrla! I filed #564 to remove the Closing out this now defunct PR. @tseaver I'll keep around the metaclass branch for awhile in case we decide to put it back. Thanks for pushing back about the "something smelly" in our test failures! I'd guess we would need to address |
* feat: added GitIntegrationSettings to the Agent PiperOrigin-RevId: 546946304 Source-Link: googleapis/googleapis@5cfc6d1 Source-Link: googleapis/googleapis-gen@734b6e5 Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiNzM0YjZlNTNmZGY5NzZiMzNkMWNhYjJjN2I1YmNlMDk5OWU5N2ZjZCJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
* chore(python): drop python 3.6 Source-Link: googleapis/synthtool@4f89b13 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:e7bb19d47c13839fe8c147e50e02e8b6cf5da8edd1af8b82208cd6f66cc2829c * add api_description to .repo-metadata.json * require python 3.7+ in setup.py * remove python 3.6 sample configs * remove require check for python 3.6 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Anthonios Partheniou <partheniou@google.com>
* feat: added overrides_by_request_protocol to backend.proto feat: added field proto_reference_documentation_uri to proto reference documentation. feat: added SERVICE_NOT_VISIBLE and GCP_SUSPENDED into error reason PiperOrigin-RevId: 517437454 Source-Link: googleapis/googleapis@ecb1cf0 Source-Link: googleapis/googleapis-gen@8731b8f Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiODczMWI4ZmQyMDQ0YTkzYzMyMzA5ZDYzM2RmYzJlODM2YmYxM2NiZiJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * docs: Fix formatting of request arg in docstring chore: Update gapic-generator-python to v1.9.1 PiperOrigin-RevId: 518604533 Source-Link: googleapis/googleapis@8a085ae Source-Link: googleapis/googleapis-gen@b2ab4b0 Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiYjJhYjRiMGEwYWUyOTA3ZTgxMmMyMDkxOThhNzRlMDg5OGFmY2IwNCJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Fixes #531.
To "test" that this works, feel free to add a test case like: