Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky: TestUnhealthyGameServersWithoutFreePorts #1376

Closed
drichardson opened this issue Feb 28, 2020 · 4 comments · Fixed by #1480
Closed

Flaky: TestUnhealthyGameServersWithoutFreePorts #1376

drichardson opened this issue Feb 28, 2020 · 4 comments · Fixed by #1480
Assignees
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break kind/bug These are bugs.
Milestone

Comments

@drichardson
Copy link
Contributor

What happened:

TestUnhealthyGameServersWithoutFreePorts failed after an unrelated documentation change commit: 3bef111

--- FAIL: TestUnhealthyGameServersWithoutFreePorts (153.06s)
    gameserver_test.go:202: 
        	Error Trace:	gameserver_test.go:202
        	Error:      	Received unexpected error:
        	            	timed out waiting for the condition
        	            	waiting for GameServer to be Unhealthy default/udp-serverhswrq
        	            	agones.dev/agones/test/e2e/framework.(*Framework).WaitForGameServerState
        	            		/go/src/agones.dev/agones/test/e2e/framework/framework.go:162
        	            	agones.dev/agones/test/e2e.TestUnhealthyGameServersWithoutFreePorts
        	            		/go/src/agones.dev/agones/test/e2e/gameserver_test.go:201
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:909
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1357
        	Test:       	TestUnhealthyGameServersWithoutFreePorts

What you expected to happen:

Test passes.

How to reproduce it (as minimally and precisely as possible):

Don't know how. It's another flaky e2e test.

Anything else we need to know?:
no

Environment:
From CI using Google Cloud Build

@aLekSer
Copy link
Collaborator

aLekSer commented Mar 4, 2020

@akremsa
Copy link
Contributor

akremsa commented Apr 16, 2020

I tried to run this particular test more than 200 times - no errors.
It seems this test could be reproduced in scope with other tests.

@aLekSer
Copy link
Collaborator

aLekSer commented Apr 16, 2020

First of all this test on gcloud-test-cluster creates 6 GameServers (why should be 4, as gameserver NodePool is of size 4).
In this test we are trying Static PortPolicy with port 7515 from the range available for other GameServers was used (7000-8000). Then test is creating node_count (for range nodes.Items in the test) more GameServers with the same port, after creating one more and wait until it would be Unhealthy. But what would happen if someone has used this port for other GameServer in a separate test?

I assume in this scenario this GameServers could free this port (in parallel test) before node_count+1 GameServer would be created and it could become Ready instead, port now is free, and Unhealthy could be port number node_count - 1 ( any in the first for loop).
We also need to add CreateGameServerAndWaitUntilReady() for all first node_count gameservers.

So there are two solutions:

  1. Use Port not from the range (under 7000)
  2. Remove t.Parallel() from this test

@aLekSer
Copy link
Collaborator

aLekSer commented Apr 16, 2020

In order to try to reproduce scenario, mentioned in comment above, you can do (change ./examples/gameserver.yaml name to generateName):

kubectl apply -f ./examples/simple-udp/fleet.yaml
kubectl get gs

Copy port and change ./examples/gameserver.yaml HostPort to one of HostPort which are used.

$ kubectl get gs | grep ${PORT} | wc -l
1
$ for i in {0..4} ; do kubectl apply -f ./examples/gameserver.yaml ; done
$ kubectl delete fleet simple-udp

Wait for port being deallocated.

$ kubectl get gs | grep ${PORT} | wc -l
3
$ kubectl apply -f ./examples/gameserver.yaml 

Last GameServer could now be in a Ready state, but it is also Unhealthy. But usually pod is started for previous to last GS (it stays in Unhealthy state but pod is running now) and last one N+1 become Unhealthy, but in theory second to last and last could took that freed out port from simple-udp with similar probability.

@markmandel markmandel added this to the 1.6.0 milestone May 19, 2020
@markmandel markmandel added the area/tests Unit tests, e2e tests, anything to make sure things don't break label May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests Unit tests, e2e tests, anything to make sure things don't break kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants