Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: TestGameServerUnhealthyAfterDeletingPod #968

Conversation

markmandel
Copy link
Member

1 minute may not be enough time. Let's find out.

@markmandel markmandel added kind/cleanup Refactoring code, fixing up documentation, etc area/tests Unit tests, e2e tests, anything to make sure things don't break labels Jul 31, 2019
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 25ab8109-adaa-4353-9ed4-e28891d88cbc

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel markmandel force-pushed the flaky/TestGameServerUnhealthyAfterDeletingPod branch from 5b4591c to 278e713 Compare July 31, 2019 01:46
@roberthbailey
Copy link
Member

Is it easy to run this test say 100 times before and after this change to a) see if you can reproduce the error and b) verify that this fixes it?

Also, I'm always a bit leery of changes to tests that just twiddle timeouts as they can mask actual issues that should be investigated.

@markmandel
Copy link
Member Author

I can do that locally for sure!

In this case, I figured that the gameserver being removed is an event that either happens, or it doesn't.

Given that the default terminationGracePeriodSeconds is 30 seconds, and controller resync occurs every 30 seconds - I'm wondering if there is a chance that sometimes the pod takes ~ 30 seconds to actually be deleted, and then if the event gets missed by the controller for some reason, then it'll take another 30 seconds to resync the event and ping that it is missing - which hits the 60 second timeout on this test -- but otherwise it would still actually pass. So I figured a bump in timeout may be appropriate here.

If my theory is off, the test will remain being flaky, in which case I'm totally wrong, and something else is afoot.

@roberthbailey
Copy link
Member

Your theory makes sense. It would be awesome to have some supporting evidence (e.g. 2/100 tests failed before, now 0/100 fail).

@markmandel
Copy link
Member Author

Love it! 👍

https://varunksaini.com/blog/run-go-test-multiple-times/

Makes it very easy to do. I will do this! (just not right now, I have to run off)

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 38ace15c-b585-44ec-b37a-10a26776b4a2

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/968/head:pr_968 && git checkout pr_968
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.12.0-278e713

@roberthbailey
Copy link
Member

Since this only changes test code, I'm ok merging it between the 0.12.0 RC and 0.12.0 release.

@markmandel
Copy link
Member Author

I have run it around 50 time so far, without an error - but hit the default go test timeout of 10 minutes 😕 will rerun it with more soon.

1 minute may not be enough time. Let's find out.
@markmandel
Copy link
Member Author

Ran 100 times, with the 3m timeout, no errors.

For reference (i.e. in case I need to find it again), here is the command I used:
go test -mod=vendor -v -timeout=1h -count=100 agones.dev/agones/test/e2e -run ^TestGameServerUnhealthyAfterDeletingPod$

Running it 100 times with the 1m timeout, also passed... so now I have no idea.

I figured this was a good first educated guess at a solution.

@markmandel markmandel force-pushed the flaky/TestGameServerUnhealthyAfterDeletingPod branch from 278e713 to 51db834 Compare August 13, 2019 04:09
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 7e149715-d334-4920-9166-131b8e9d7acb

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/968/head:pr_968 && git checkout pr_968
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.0.0-51db834

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 01d7c27d-1476-474a-829b-f8805b6a2ab4

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 983aef01-343b-4864-a798-f7c8709d9c47

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 36cd084e-4730-4051-b967-1b66015eae7b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 4677894f-7a16-4159-a7a5-300e7c322700

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/968/head:pr_968 && git checkout pr_968
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.0.0-ce8c8e2

@markmandel markmandel merged commit a19b418 into googleforgames:master Aug 13, 2019
@markmandel markmandel added this to the 1.0 milestone Aug 13, 2019
@markmandel markmandel deleted the flaky/TestGameServerUnhealthyAfterDeletingPod branch August 14, 2019 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break kind/cleanup Refactoring code, fixing up documentation, etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants