-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agones fails to start the pod after update cpu limits to 1000m #1184
Comments
Can you explain what "crashing of the pods" means? If you could provide a Are you able to reproduce with our simple-udp example? Having a repro case would also be useful |
It should happen with the simple-udp, I can later try to reproduce in minikube a create a full example. I rephrased "crashing of the pods" with "pods are not able to start" basically the behaviour is that new gameserver pod never start, they are intermediately terminated. I will try to create a simple-udp example to help to debug the issue |
@topochan I was able to reproduce the issue - Gameservers stuck in a Creating state after changing of CPU limits to |
Errors in logs of Agones controller:
Some additional logs which might be useful:
|
I assume that we end up overcommitting nodes in Kubernetes cluster, I used standard One more option to reproduce similar situation is to edit the which would results in only 10 Ready replicas and nodes being overcommitted:
|
From GCloud console these 2 pods (out of 12) are unschedulable:
In similar case when we are configuring the Fleet updating the tag (this subj) we have a bigger issue that pods become terminated and new are started and Memory consumption of the Agones Controller becomes to raise linearly. |
What is the resultant pod config that is trying and failing to start? |
@markmandel There are two cases I mentioned above:
And then updating next line in a Ready (2 of 2 GS is Ready) fleet to
would produce an infinite loop of GSSets creations and terminations occurs:
|
My conclusion of an issue root cause is that we have |
Changing scheduling parameter from Packed to Distributed does not help.
|
One more detail about the problem that actually we are creating more GameServerSet that they should be ( should be 2 at most). I assume we should create one, but we have (according to GS prefix) 6 at a time:
|
Bug reproduces with 1 Replicas in a Fleet and |
It might be related to kubernetes/kubernetes#66450
And we trap into situation with inifinite creating GSSets. Equivalent full Containers spec without an issue:
|
This is exactly the issue, funny enough when is created "1000m" became "1" cpu but not when we update. checking kubernetes/kubernetes#66450 and the comment you linked looks like that is the same behaviour, the question if this weird state of the pods is related to kubernetes or agones (probably kubernetes due the limits are in the template section for pods). |
I was going to ask this next - glad to see it works. That is a weird one. We've started testing on 1.13, and are planning on moving to it in the next release -- do we know if this also happens on 1.13? (I'm also assuming we are all using the 1.12 version of kubectl at well?) |
I was testing on the most recent master today, which uses 1.13:
|
First: the issue itself is about changing the scale of Requests by Kubernetes. I have added additional debug and find out that in a Fleet Controller that we receive different strings but equal vallue in filterGameServerSetByActive():
So we need to use resource Cmp() function: |
Tested my proposed solution with changing |
What happened:
After update deployment we saw the pods not be able to start, the update was change it from 1500m to 1000m
What you expected to happen:
nothing crashes
How to reproduce it (as minimally and precisely as possible):
deploy a fleet with gameserver cpu limits to 1000m, it will work without any issue, update the fleet with other image tag, label or other change and
kubectl apply -f
the manifest again, you will see all the pods not being able to start.Anything else we need to know?:
Looks like when the fleet is created does the conversion from 1000m to 1 in the cpu limit, but not in the update of the fleet.
Environment:
kubectl version
):Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.10", GitCommit:"e3c134023df5dea457638b614ee17ef234dc34a6", GitTreeState:"clean", BuildDate:"2019-07-08T03:40:54Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: