Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt retry hook for flaky regression test cases. #535

Closed

Conversation

dhermes
Copy link
Contributor

@dhermes dhermes commented Jan 12, 2015

Fixes #531.

To "test" that this works, feel free to add a test case like:

    x = 0

    def test_retry(self):
        # Feel free to vary 3 higher and higher, should always be
        # NUM_RETRIES in the final error message.
        if self.x < 3:
            self.x += 1
            self.assertEqual(self.x, object())  # Fails
        else:
            self.assertTrue(True)

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 09431a8 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*.

@dhermes dhermes added api: datastore Issues related to the Datastore API. testing api: storage Issues related to the Cloud Storage API. labels Jan 12, 2015
class RetryTestsMetaclass(type):

NUM_RETRIES = 2
FLAKY_ERROR_CLASSES = (AssertionError,)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@silvolu
Copy link
Contributor

silvolu commented Jan 14, 2015

I think that for time based errors (e.g. exceeding QPS for a certain API method) we should bake the retry logic into the library itself, providing default values for max retries, error codes and messages to retry on, but allowing the user to override all of those.
This is partially done in apitools

@dhermes
Copy link
Contributor Author

dhermes commented Jan 14, 2015

@silvolu AFAIK the regression test issues are not due to QPS issues, just intermittent failures or sometimes due to eventual-ness of consistency (though this "shouldn't" happen since usually this is for queries on data inserted weeks prior via populate_datastore.py)

@silvolu
Copy link
Contributor

silvolu commented Jan 14, 2015

Oh sorry, got confused with gcloud-ruby (and didn't read this issue with enough attention)

@dhermes dhermes force-pushed the retry-flaky-regression-cases branch from 09431a8 to 381e1f9 Compare January 15, 2015 03:58
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jan 15, 2015
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 381e1f9 on dhermes:retry-flaky-regression-cases into * on GoogleCloudPlatform:master*.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 15, 2015

Another one that failed the first time but not on retry:
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/47141980

I took a snapshot of the failure for posterity:
screen_shot_056

@dhermes
Copy link
Contributor Author

dhermes commented Jan 15, 2015

I didn't think we'd ever see it but two consecutive merges resulted in a 429 Too Many Requests and caused the one I merged to fail. (I somehow lost the screen shot after restarting the build, boo.)

@silvolu Was right that it could occur in our regression tests :)

Fixes googleapis#531.

To "test" that this works, feel free to add a test case like:

    x = 0

    def test_retry(self):
        # Feel free to vary 3 higher and higher, should always be
        # NUM_RETRIES in the final error message.
        if self.x < 3:
            self.x += 1
            self.assertEqual(self.x, object())  # Fails
        else:
            self.assertTrue(True)
@dhermes dhermes force-pushed the retry-flaky-regression-cases branch from 381e1f9 to a060dc7 Compare January 15, 2015 21:50
@dhermes
Copy link
Contributor Author

dhermes commented Jan 17, 2015

@tseaver It seems the last contention here was in allowing AssertionError as one of the retry-able errors. Still having that hangup?

@tseaver
Copy link
Contributor

tseaver commented Jan 17, 2015

I'll defer to your report that it failed on the first pass, but passed soon after (although I think we likely have something smelly in the testcase, that would be a different issue).

@dhermes
Copy link
Contributor Author

dhermes commented Jan 17, 2015

@tseaver I'm looking into the smelliness.

The second failure referenced in the bug is due to a non-transactional put() followed by a query. I found 3 or 4 more instances of it in old failed builds.

There are also three tests cases which rely on _generic_test_post which also has a non-transactional put() following by a single-key lookup.

Shall I make these transactional and remove AssertionError from the retry exception list?


Also, looking through history it seems this 404 on storage key delete (during module cleanup) occurs pretty often. Our delete code and / or library code seems to have an issue with data staleness.

That also may be smelly and maybe we don't need any retries?


The piece I seem to remember failing is in test_query_simple_filter, essentially:

query = datastore.Query(kind='Character', ancestor=datastore.Key('Book', 'GoT'),
                        [('appearances', '>=', 20)])
expected_matches = 6
entities = list(query.fetch(limit=7))
len(entities) == 6

but I checked all the failed builds and it's in none of those (it still may have occurred, but in a build we retried).

Also removing AssertionError from list of retry classes.
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 280b7ac on dhermes:retry-flaky-regression-cases into edfd5e2 on GoogleCloudPlatform:master.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 19, 2015

@tseaver I made the datastore put()s transactional and removed AssertionError. I double-checked for funny business in Bucket.delete(). It turns out a new Bucket instance is created and all the objects (Keys) are re-retrieved. So the mystery is still out why we have all these 404's in tearDownModule.

@tseaver
Copy link
Contributor

tseaver commented Jan 22, 2015

LGTM

@dhermes
Copy link
Contributor Author

dhermes commented Jan 22, 2015

@tseaver I did some "soul-searching" on this and realized:

  1. The datastore flakes are "our fault" in that we didn't have transactional puts. (I'm curious if this ever happens for gcloud-node. Maybe @silvolu knows? I took a look at their last 10 failures and didn't find any evidence.)
  2. The retry won't fix tearDownModule, which is the last remaining flake.
  3. The storage failure seems to be an API-level issue and manifests in Bucket.delete(). We should loop in storage eng. to see what is expected. (Essentially ls is a lie and it causes delete to fail.)

I'm going to submit a PR (#562) with just the transactional puts and then figure out what to do about the rest.

Adding don't merge label for now.

@dhermes dhermes added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jan 22, 2015
dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015
This is to address flaky test failures.

See googleapis#535 for more discussion.
@dhermes
Copy link
Contributor Author

dhermes commented Jan 22, 2015

@jgeewax Do we have a contact on the storage team to ask questions?

We are seeing a non-trivial number of failures of Connection.delete_bucket(force=True):
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/46346592#L329
https://travis-ci.org/GoogleCloudPlatform/gcloud-python/builds/42891293#L261


This failure is because, Connection.delete_bucket():

  1. Loops through a list of objects in the bucket.
  2. This creates a new _KeyIterator which gets a list of all the objects from the API (since __iter__ does not bind anything to the Bucket object, a new iterator is returned every time
for key in bucket:
    ...

occurs, hence a fresh API request to list the objects)
3. Tries to delete each retrieved key, some of which no longer exist. (hence the 404) This seems to indicate that the '/bucket-name/o/` request is returned stale data. (hence the desire to talk to eng.)

@craigcitro
Copy link
Contributor

summoning @thobrla , expert on all things storage!

@thobrla
Copy link

thobrla commented Jan 22, 2015

Here and available for questions. In this particular case, the behavior you are seeing is expected because listing in Google Cloud Storage is eventually consistent. Thus, recently deleted objects may still be returned from the list call. In terms of test cleanup, you should treat a 404 as "already deleted" and continue.

You can model this after the gsutil integration test teardown code: https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/tests/testcase/integration_testcase.py#L108

@dhermes
Copy link
Contributor Author

dhermes commented Jan 22, 2015

@thobrla Thanks for the quick reply.

This gives a simple-to-implement fix for our flaky regression test, but doesn't make Bucket.delete() very happy. If we don't do the "list files", deleting a non-empty bucket gives a 409 Conflict.

  1. Is it better to ignore a 404 on object delete or to check that the object exists before deleting? (I'm inclined to do the one that requires only one HTTP request, but maybe I'm missing something.)
  2. Is there a way to send batch object deletes with the JSON API? (I know it is / was possible with the XML API. I've not seen someone actually use the "discovery" batch feature.)
  3. Is there a way to tell the JSON API to also delete all the objects in addition to the bucket? (i.e. let the backend worry about it)
  4. Is it worth providing Bucket.delete() due to the potential headache / unexpected behavior? (if list objects is eventually consistent, we still may miss new objects.)

dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015
dhermes added a commit to dhermes/google-cloud-python that referenced this pull request Jan 22, 2015
@thobrla
Copy link

thobrla commented Jan 22, 2015

  1. Ignoring the 404 is fine. Ultimately, you won't get true atomicity either way (see my response to 4. below)
  2. The JSON API does support batch deletes, but it is a performance optimization at best and won't solve this problem.
  3. No, there is no API-level support to delete a bucket with objects in it.
  4. I'd recommend isolating this to the test code. If you provide this functionality, it must be made extremely clear that this is a non-atomic operation that is not expected to succeed if there are concurrent writers to the bucket. The danger here is providing the illusion of atomicity to the client, especially because that illusion is likely to work at small scale and then fail cryptically at large scale.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 22, 2015

Thanks for the help @thobrla!

I filed #564 to remove the Bucket.delete() method (due to the global namespace of buckets, it's a very serious operation to make so easy to execute). I also submitted PR #563 to address the flaky regression tests.


Closing out this now defunct PR. @tseaver I'll keep around the metaclass branch for awhile in case we decide to put it back. Thanks for pushing back about the "something smelly" in our test failures!

I'd guess we would need to address 429 Too Many Requests (so maybe RetryTestsMetaclass will come back) but as @silvolu pointed out it may be best to handle such failures in the library.

@dhermes dhermes closed this Jan 22, 2015
@dhermes dhermes deleted the retry-flaky-regression-cases branch January 22, 2015 21:12
@dhermes dhermes mentioned this pull request May 17, 2016
parthea pushed a commit that referenced this pull request Aug 15, 2023
* feat: added GitIntegrationSettings to the Agent

PiperOrigin-RevId: 546946304

Source-Link: googleapis/googleapis@5cfc6d1

Source-Link: googleapis/googleapis-gen@734b6e5
Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiNzM0YjZlNTNmZGY5NzZiMzNkMWNhYjJjN2I1YmNlMDk5OWU5N2ZjZCJ9

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
vchudnov-g pushed a commit that referenced this pull request Sep 20, 2023
* chore(python): drop python 3.6

Source-Link: googleapis/synthtool@4f89b13
Post-Processor: gcr.io/cloud-devrel-public-resources/owlbot-python:latest@sha256:e7bb19d47c13839fe8c147e50e02e8b6cf5da8edd1af8b82208cd6f66cc2829c

* add api_description to .repo-metadata.json

* require python 3.7+ in setup.py

* remove python 3.6 sample configs

* remove require check for python 3.6

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Co-authored-by: Anthonios Partheniou <partheniou@google.com>
parthea pushed a commit that referenced this pull request Oct 21, 2023
* feat: added overrides_by_request_protocol to backend.proto
feat: added field proto_reference_documentation_uri to proto reference documentation.
feat: added SERVICE_NOT_VISIBLE and GCP_SUSPENDED into error reason

PiperOrigin-RevId: 517437454

Source-Link: googleapis/googleapis@ecb1cf0

Source-Link: googleapis/googleapis-gen@8731b8f
Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiODczMWI4ZmQyMDQ0YTkzYzMyMzA5ZDYzM2RmYzJlODM2YmYxM2NiZiJ9

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* docs: Fix formatting of request arg in docstring

chore: Update gapic-generator-python to v1.9.1
PiperOrigin-RevId: 518604533

Source-Link: googleapis/googleapis@8a085ae

Source-Link: googleapis/googleapis-gen@b2ab4b0
Copy-Tag: eyJwIjoiLmdpdGh1Yi8uT3dsQm90LnlhbWwiLCJoIjoiYjJhYjRiMGEwYWUyOTA3ZTgxMmMyMDkxOThhNzRlMDg5OGFmY2IwNCJ9

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

---------

Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
parthea pushed a commit that referenced this pull request Oct 21, 2023
Removing ambiguous message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. api: storage Issues related to the Cloud Storage API. cla: yes This human has signed the Contributor License Agreement. do not merge Indicates a pull request not ready for merge, due to either quality or timing. testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build in regression test case retries
7 participants