Attempt retry hook for flaky regression test cases. #535

dhermes · 2015-01-12T22:05:24Z

Fixes #531.

To "test" that this works, feel free to add a test case like:

    x = 0

    def test_retry(self):
        # Feel free to vary 3 higher and higher, should always be
        # NUM_RETRIES in the final error message.
        if self.x < 3:
            self.x += 1
            self.assertEqual(self.x, object())  # Fails
        else:
            self.assertTrue(True)

coveralls · 2015-01-12T22:07:28Z

Changes Unknown when pulling 09431a8 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*.

regression/regression_utils.py

+class RetryTestsMetaclass(type):
+
+    NUM_RETRIES = 2
+    FLAKY_ERROR_CLASSES = (AssertionError,)


silvolu · 2015-01-14T19:04:10Z

I think that for time based errors (e.g. exceeding QPS for a certain API method) we should bake the retry logic into the library itself, providing default values for max retries, error codes and messages to retry on, but allowing the user to override all of those.
This is partially done in apitools

dhermes · 2015-01-14T20:47:38Z

@silvolu AFAIK the regression test issues are not due to QPS issues, just intermittent failures or sometimes due to eventual-ness of consistency (though this "shouldn't" happen since usually this is for queries on data inserted weeks prior via populate_datastore.py)

silvolu · 2015-01-14T20:50:25Z

Oh sorry, got confused with gcloud-ruby (and didn't read this issue with enough attention)

coveralls · 2015-01-15T04:00:54Z

Changes Unknown when pulling 381e1f9 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*.

dhermes · 2015-01-15T17:33:10Z

Another one that failed the first time but not on retry:
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/47141980

I took a snapshot of the failure for posterity:

dhermes · 2015-01-15T20:38:58Z

I didn't think we'd ever see it but two consecutive merges resulted in a 429 Too Many Requests and caused the one I merged to fail. (I somehow lost the screen shot after restarting the build, boo.)

@silvolu Was right that it could occur in our regression tests :)

Fixes googleapis#531. To "test" that this works, feel free to add a test case like: x = 0 def test_retry(self): # Feel free to vary 3 higher and higher, should always be # NUM_RETRIES in the final error message. if self.x < 3: self.x += 1 self.assertEqual(self.x, object()) # Fails else: self.assertTrue(True)

dhermes · 2015-01-17T00:05:36Z

@tseaver It seems the last contention here was in allowing AssertionError as one of the retry-able errors. Still having that hangup?

tseaver · 2015-01-17T01:07:59Z

I'll defer to your report that it failed on the first pass, but passed soon after (although I think we likely have something smelly in the testcase, that would be a different issue).

dhermes · 2015-01-17T18:15:35Z

@tseaver I'm looking into the smelliness.

The second failure referenced in the bug is due to a non-transactional put() followed by a query. I found 3 or 4 more instances of it in old failed builds.

There are also three tests cases which rely on _generic_test_post which also has a non-transactional put() following by a single-key lookup.

Shall I make these transactional and remove AssertionError from the retry exception list?

Also, looking through history it seems this 404 on storage key delete (during module cleanup) occurs pretty often. Our delete code and / or library code seems to have an issue with data staleness.

That also may be smelly and maybe we don't need any retries?

The piece I seem to remember failing is in test_query_simple_filter, essentially:

query = datastore.Query(kind='Character', ancestor=datastore.Key('Book', 'GoT'),
                        [('appearances', '>=', 20)])
expected_matches = 6
entities = list(query.fetch(limit=7))
len(entities) == 6

but I checked all the failed builds and it's in none of those (it still may have occurred, but in a build we retried).

dhermes · 2015-01-17T18:19:23Z

For posterity, here are some more "retry"-able errors:

https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/46097250
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/43549205
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/42891293
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/40251811
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/39885452
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/40240783

Also removing AssertionError from list of retry classes.

coveralls · 2015-01-19T19:31:17Z

Coverage remained the same when pulling 280b7ac on dhermes:retry-flaky-regression-cases into edfd5e2 on GoogleCloudPlatform:master.

dhermes · 2015-01-19T19:33:14Z

@tseaver I made the datastore put()s transactional and removed AssertionError. I double-checked for funny business in Bucket.delete(). It turns out a new Bucket instance is created and all the objects (Keys) are re-retrieved. So the mystery is still out why we have all these 404's in tearDownModule.

tseaver · 2015-01-22T17:58:29Z

LGTM

dhermes · 2015-01-22T18:31:28Z

@tseaver I did some "soul-searching" on this and realized:

The datastore flakes are "our fault" in that we didn't have transactional puts. (I'm curious if this ever happens for gcloud-node. Maybe @silvolu knows? I took a look at their last 10 failures and didn't find any evidence.)
The retry won't fix tearDownModule, which is the last remaining flake.
The storage failure seems to be an API-level issue and manifests in Bucket.delete(). We should loop in storage eng. to see what is expected. (Essentially ls is a lie and it causes delete to fail.)

I'm going to submit a PR (#562) with just the transactional puts and then figure out what to do about the rest.

Adding don't merge label for now.

This is to address flaky test failures. See googleapis#535 for more discussion.

dhermes · 2015-01-22T18:52:24Z

@jgeewax Do we have a contact on the storage team to ask questions?

We are seeing a non-trivial number of failures of Connection.delete_bucket(force=True):
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/46346592#L329
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/42891293#L261

This failure is because, Connection.delete_bucket():

Loops through a list of objects in the bucket.
This creates a new _KeyIterator which gets a list of all the objects from the API (since __iter__ does not bind anything to the Bucket object, a new iterator is returned every time

for key in bucket:
    ...

occurs, hence a fresh API request to list the objects)
3. Tries to delete each retrieved key, some of which no longer exist. (hence the 404) This seems to indicate that the '/bucket-name/o/` request is returned stale data. (hence the desire to talk to eng.)

craigcitro · 2015-01-22T18:59:52Z

summoning @thobrla , expert on all things storage!

thobrla · 2015-01-22T19:05:48Z

Here and available for questions. In this particular case, the behavior you are seeing is expected because listing in Google Cloud Storage is eventually consistent. Thus, recently deleted objects may still be returned from the list call. In terms of test cleanup, you should treat a 404 as "already deleted" and continue.

You can model this after the gsutil integration test teardown code: https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/tests/testcase/integration_testcase.py#L108

dhermes · 2015-01-22T19:40:23Z

@thobrla Thanks for the quick reply.

This gives a simple-to-implement fix for our flaky regression test, but doesn't make Bucket.delete() very happy. If we don't do the "list files", deleting a non-empty bucket gives a 409 Conflict.

Is it better to ignore a 404 on object delete or to check that the object exists before deleting? (I'm inclined to do the one that requires only one HTTP request, but maybe I'm missing something.)
Is there a way to send batch object deletes with the JSON API? (I know it is / was possible with the XML API. I've not seen someone actually use the "discovery" batch feature.)
Is there a way to tell the JSON API to also delete all the objects in addition to the bucket? (i.e. let the backend worry about it)
Is it worth providing Bucket.delete() due to the potential headache / unexpected behavior? (if list objects is eventually consistent, we still may miss new objects.)

Fixes googleapis#531. See googleapis#535 for context.

thobrla · 2015-01-22T20:39:58Z

Ignoring the 404 is fine. Ultimately, you won't get true atomicity either way (see my response to 4. below)
The JSON API does support batch deletes, but it is a performance optimization at best and won't solve this problem.
No, there is no API-level support to delete a bucket with objects in it.
I'd recommend isolating this to the test code. If you provide this functionality, it must be made extremely clear that this is a non-atomic operation that is not expected to succeed if there are concurrent writers to the bucket. The danger here is providing the illusion of atomicity to the client, especially because that illusion is likely to work at small scale and then fail cryptically at large scale.

dhermes · 2015-01-22T21:11:31Z

Thanks for the help @thobrla!

I filed #564 to remove the Bucket.delete() method (due to the global namespace of buckets, it's a very serious operation to make so easy to execute). I also submitted PR #563 to address the flaky regression tests.

Closing out this now defunct PR. @tseaver I'll keep around the metaclass branch for awhile in case we decide to put it back. Thanks for pushing back about the "something smelly" in our test failures!

I'd guess we would need to address 429 Too Many Requests (so maybe RetryTestsMetaclass will come back) but as @silvolu pointed out it may be best to handle such failures in the library.

* feat: added GitIntegrationSettings to the Agent PiperOrigin-RevId: 546946304 Source-Link: googleapis/googleapis@5cfc6d1 Source-Link: googleapis/googleapis-gen@734b6e5 Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiNzM0YjZlNTNmZGY5NzZiMzNkMWNhYjJjN2I1YmNlMDk5OWU5N2ZjZCJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

* chore(python): drop python 3.6 Source-Link: googleapis/synthtool@4f89b13 Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:e7bb19d47c13839fe8c147e50e02e8b6cf5da8edd1af8b82208cd6f66cc2829c * add api_description to .repo-metadata.json * require python 3.7+ in setup.py * remove python 3.6 sample configs * remove require check for python 3.6 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Anthonios Partheniou <partheniou@google.com>

* feat: added overrides_by_request_protocol to backend.proto feat: added field proto_reference_documentation_uri to proto reference documentation. feat: added SERVICE_NOT_VISIBLE and GCP_SUSPENDED into error reason PiperOrigin-RevId: 517437454 Source-Link: googleapis/googleapis@ecb1cf0 Source-Link: googleapis/googleapis-gen@8731b8f Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiODczMWI4ZmQyMDQ0YTkzYzMyMzA5ZDYzM2RmYzJlODM2YmYxM2NiZiJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md * docs: Fix formatting of request arg in docstring chore: Update gapic-generator-python to v1.9.1 PiperOrigin-RevId: 518604533 Source-Link: googleapis/googleapis@8a085ae Source-Link: googleapis/googleapis-gen@b2ab4b0 Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiYjJhYjRiMGEwYWUyOTA3ZTgxMmMyMDkxOThhNzRlMDg5OGFmY2IwNCJ9 * 🦉 Updates from OwlBot post-processor See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>

Removing ambiguous message.

dhermes added api: datastore Issues related to the Datastore API. testing api: storage Issues related to the Cloud Storage API. labels Jan 12, 2015

tseaver reviewed Jan 14, 2015
View reviewed changes

dhermes force-pushed the retry-flaky-regression-cases branch from 09431a8 to 381e1f9 Compare January 15, 2015 03:58

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jan 15, 2015

dhermes added 2 commits January 15, 2015 13:50

Adding NotFound as a flaky error class.

a060dc7

dhermes force-pushed the retry-flaky-regression-cases branch from 381e1f9 to a060dc7 Compare January 15, 2015 21:50

Making regression test put()'s transactional.

280b7ac

Also removing AssertionError from list of retry classes.

dhermes added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 22, 2015

dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015

Making regression test put()'s transactional.

e407a68

This is to address flaky test failures. See googleapis#535 for more discussion.

dhermes mentioned this pull request Jan 22, 2015

Making regression test put()'s transactional. #562

Merged

dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015

Making sure delete does not fail in regression tests.

1c43f92

Fixes googleapis#531. See googleapis#535 for context.

dhermes mentioned this pull request Jan 22, 2015

Making sure bucket delete does not fail in regression tests. #563

Merged

dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015

Making sure delete does not fail in regression tests.

0b90362

Fixes googleapis#531. See googleapis#535 for context.

dhermes mentioned this pull request Jan 22, 2015

Make Bucket.delete() in storage more restricted / opinionated #564

Closed

dhermes closed this Jan 22, 2015

dhermes deleted the retry-flaky-regression-cases branch January 22, 2015 21:12

dhermes mentioned this pull request Apr 16, 2015

Intermittent storage regression failure #836

Closed

This was referenced Aug 24, 2015

Investigate novel pub/sub system test failure (only one message returned) #1081

Closed

Investigate novel datastore system test failures #1089

Closed

dhermes mentioned this pull request Jan 15, 2016

Determine a more appropriate sleep time for BigQuery system tests #1391

Closed

This was referenced Mar 16, 2016

Adopt common timeout/retry/backoff strategy for system tests. #1619

Closed

Add system test for 'logger.log_text' and 'logger.log_struct'. #1609

Merged

dhermes mentioned this pull request May 17, 2016

Flaky System Tests #1796

Closed

parthea pushed a commit that referenced this pull request Oct 21, 2023

chore(deps): update dependency google-cloud-asset to v3.17.0 (#535)

83d0f4c

parthea pushed a commit that referenced this pull request Oct 21, 2023

docs: Update README.rst (#535)

bb93103

Removing ambiguous message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt retry hook for flaky regression test cases. #535

Attempt retry hook for flaky regression test cases. #535

dhermes commented Jan 12, 2015

coveralls commented Jan 12, 2015

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

silvolu commented Jan 14, 2015

dhermes commented Jan 14, 2015

silvolu commented Jan 14, 2015

coveralls commented Jan 15, 2015

dhermes commented Jan 15, 2015

dhermes commented Jan 15, 2015

dhermes commented Jan 17, 2015

tseaver commented Jan 17, 2015

dhermes commented Jan 17, 2015

dhermes commented Jan 17, 2015

coveralls commented Jan 19, 2015

dhermes commented Jan 19, 2015

tseaver commented Jan 22, 2015

dhermes commented Jan 22, 2015

dhermes commented Jan 22, 2015

craigcitro commented Jan 22, 2015

thobrla commented Jan 22, 2015

dhermes commented Jan 22, 2015

thobrla commented Jan 22, 2015

dhermes commented Jan 22, 2015

Attempt retry hook for flaky regression test cases. #535

Attempt retry hook for flaky regression test cases. #535

Conversation

dhermes commented Jan 12, 2015

coveralls commented Jan 12, 2015

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

silvolu commented Jan 14, 2015

dhermes commented Jan 14, 2015

silvolu commented Jan 14, 2015

coveralls commented Jan 15, 2015

dhermes commented Jan 15, 2015

dhermes commented Jan 15, 2015

dhermes commented Jan 17, 2015

tseaver commented Jan 17, 2015

dhermes commented Jan 17, 2015

dhermes commented Jan 17, 2015

coveralls commented Jan 19, 2015

dhermes commented Jan 19, 2015

tseaver commented Jan 22, 2015

dhermes commented Jan 22, 2015

dhermes commented Jan 22, 2015

craigcitro commented Jan 22, 2015

thobrla commented Jan 22, 2015

dhermes commented Jan 22, 2015

thobrla commented Jan 22, 2015

dhermes commented Jan 22, 2015