Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce flakiness of Monitoring system tests. #2095

Merged
merged 1 commit into from
Aug 18, 2016
Merged

Reduce flakiness of Monitoring system tests. #2095

merged 1 commit into from
Aug 18, 2016

Conversation

tseaver
Copy link
Contributor

@tseaver tseaver commented Aug 11, 2016

  • Add retries for 503 ServiceUnavailable (eventual) on creation.
  • Add retries for 404 NotFound (eventual consistency) on deletion.
  • Ensure that parent groups get cleaned up even if child group deletion fails.

Towards: #2092, #2093.

- Add retries for 503 ServiceUnavailable (eventual) on creation.

- Add retries for 404 NotFound (eventual consistency) on deletion.

- Ensure that parent groups get cleaned up even if child group deletion
  fails.

Towards:  #2092, #2093.
@tseaver tseaver added testing api: monitoring Issues related to the Cloud Monitoring API. flaky labels Aug 11, 2016
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Aug 11, 2016
from system_test_utils import unique_resource_id

retry_404 = RetryErrors(NotFound)
retry_503 = RetryErrors(ServiceUnavailable)

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@dhermes
Copy link
Contributor

dhermes commented Aug 11, 2016

No real concerns, but I hesitate to give the LGTM because I don't understand why any of the errors occur

@dhermes
Copy link
Contributor

dhermes commented Aug 11, 2016

Crazy we've got one service using "503 Service Unavailable" to mean "409 Conflict" and another using "403 Forbidden" to mean "429 Too Many Requests".

@tseaver
Copy link
Contributor Author

tseaver commented Aug 11, 2016

"We don't need no stinking RFCs."

@dhermes
Copy link
Contributor

dhermes commented Aug 11, 2016

@rimey Can you chime in here on (and / or on #2092 and #2093) about write / delete conflicts for groups (seen as a 503) as well as eventual consistency of a delete after a create (seen as a 404).

@rimey
Copy link
Contributor

rimey commented Aug 11, 2016

@dhermes This is a matter of debugging each particular mysterious error, starting with experimentation to isolate the circumstances under which it occurs. If @supriyagarg has time to work on this, and some reproducible error remains mysterious, she and I can work together internally to try to trace it to the underlying cause. A good first step would be to create some issues.

@dhermes
Copy link
Contributor

dhermes commented Aug 11, 2016

@rimey are #2092 and #2093 isolated enough?

@rimey
Copy link
Contributor

rimey commented Aug 11, 2016

The issues are good. supriyagarg@ is willing to work on reproducing them in isolation.

@rimey
Copy link
Contributor

rimey commented Aug 12, 2016

@dhermes Have you seen a delete return 404 after a create?

@dhermes
Copy link
Contributor

dhermes commented Aug 12, 2016

@rimey I have not but maybe @tseaver has

@tseaver
Copy link
Contributor Author

tseaver commented Aug 17, 2016

@rimey That is why I added the retries for descriptor / group deletes. I assume that the newly-created entity hasn't yet propagated to the host / layer handling the deletions.

@tseaver
Copy link
Contributor Author

tseaver commented Aug 17, 2016

ISTM we should just merge these retries before the PR bitrots, unless @rimey or @supriyagarg need us to hold off for their own testing.

@rimey
Copy link
Contributor

rimey commented Aug 17, 2016

I'd like to have a clear record of what errors have been observed before retries are added everywhere. #2092 reports a 503 on deleting a group. #2093 reports a 503 on creating group. This is not completely consistent with the discussion above. Where did you definitely see 404s, and did you see errors for metric descriptors?

@dhermes
Copy link
Contributor

dhermes commented Aug 17, 2016

I'd like to have a clear record of what errors have been observed before retries are added everywhere

Ditto

@dhermes
Copy link
Contributor

dhermes commented Aug 17, 2016

ISTM we should just merge these retries before the PR bitrots

I agree. @rimey and @supriyagarg can we put a fixed end time to get this resolved before merging? Let's say EOD Monday August 22?

@tseaver
Copy link
Contributor Author

tseaver commented Aug 17, 2016

@rimey I saw the 404s for metric and descriptor delete operations when running the tests on my local machine: I have not seen those failures on Travis.

@rimey
Copy link
Contributor

rimey commented Aug 17, 2016

We failed to reproduce any of these transient errors internally. Nevertheless, we have clear reports of 503s on creation and on deletion of groups in #2093 and #2092, respectively. Thank you for those. We would be grateful for any additional reports of other transient errors.

@tseaver You mention "404s for metric and descriptor delete operations" in the comment above. Do you mean metric descriptor delete operations?

I urge you not to add retries except where you have actually observed the error code in question on that type of request.

We currently believe that 503 is the appropriate error code for the errors reported in #2093 and #2092. We will be updating the error message to be less misleading.

@tseaver
Copy link
Contributor Author

tseaver commented Aug 17, 2016

@rimey yes, I saw 404s explicitly here for custom metric descriptors and here for groups.

@dhermes
Copy link
Contributor

dhermes commented Aug 17, 2016

We currently believe that 503 is the appropriate error code for the errors

A 5xx error signals to a developer that the backend has failed to handle the request. The fact that the API can successfully give an error with useful information means the backend is working just fine, but the operation doesn't work for some reason.


UPDATE: Your service, your call, but it will confound more developers than just me.

@rimey
Copy link
Contributor

rimey commented Aug 17, 2016

@dhermes That is correct. In particular, the 503 is signaling that the operation failed due to a transient condition and can be retried as-is.

@tseaver
Copy link
Contributor Author

tseaver commented Aug 18, 2016

@rimey Are you asking me to back out the retry_404 bits of this PR until we see the error show up on Travis?

@rimey
Copy link
Contributor

rimey commented Aug 18, 2016

@tseaver No. If it happened anywhere, we know it can happen.

I'll leave the details of this PR up to you and @dhermes, but I want to make one more (admittedly unhelpful) comment: While it's correct by definition to retry on 503, retrying on 404 is generally questionable. We presume that the 404 is because the resource doesn't exist yet, but it could also be because it has been deleted. Nevertheless, I'm okay with retrying on 404 in situations like this where you have good reason to presume that it's because the resource doesn't exist yet.

@tseaver
Copy link
Contributor Author

tseaver commented Aug 18, 2016

@rimey I'm in violent agreement that we don't want to sprinkle retry_404 everywhere: in the case of our system tests, we know the entity was just created, and are trying to tear it down for cleanup (or to test the delete method directly), so retrying seems correct.

@tseaver tseaver merged commit d968ae4 into googleapis:master Aug 18, 2016
@tseaver tseaver deleted the 2092-2093-monitoring-flaky-system-tests branch August 18, 2016 18:22
@rimey
Copy link
Contributor

rimey commented Aug 18, 2016

For the record, we changed the message for this particular 503 from "Write collision, please retry." to "The service is currently unavailable, please retry." The change is expected to roll out next week.

@dhermes
Copy link
Contributor

dhermes commented Aug 18, 2016

Thanks for the heads up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: monitoring Issues related to the Cloud Monitoring API. cla: yes This human has signed the Contributor License Agreement. testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants