Gameserver is not removed when node hosting gameserver pod is shutdown #1102

djsell · 2019-10-06T05:13:49Z

What happened:
After a node hosting a game server shuts down, the game server is not removed from the game server list. A fleet will not replace the missing game server.

What you expected to happen:
The gameserver should be removed once the pod is no longer running, and the fleet should start a new gameserver up.

How to reproduce it (as minimally and precisely as possible):
I'm using GKE.
Create a cluster with node size 1
Create a fleet that is running a game server (I'm using a fleet autoscaler with bufferSize: 1, minReplicas: 1, maxReplicas: 5)
Resize the cluster to 0
GS will still be listed, even though no pod is listed
Resize the cluster to 1
Old GS will still be listed, no pod is running, no new pod is started

Anything else we need to know?:

Environment:

Agones version: chart: agones-1.0.0 app-version: 1.0.0
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T14:25:20Z", GoVersion:"go1.12.7", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.3-gke.11", GitCommit:"cde86d2e1416a0c6c4bb964e1a13e8fa0a83a616", GitTreeState:"clean", BuildDate:"2019-08-12T20:57:47Z", GoVersion:"go1.12.5b4", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): helm
Troubleshooting guide log(s):
Others:

Cluster size 1:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   11h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-kthld   1/1     Running   0          20h
ncc-fleet-b6zb5-ckp8c                  2/2     Running   0          11h

Resize cluster to 0:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   20h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-zzxrk   0/1     Pending   0          61s

Resize cluster to 1:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   21h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-zzxrk   1/1     Running   0          8m25s

The text was updated successfully, but these errors were encountered:

Omegastick · 2019-10-07T09:04:53Z

I'm getting this too. Running Kubernetes on GKE with Agones v0.12.0. I have to delete the gameservers manually every time I 'turn on' (scale from 0 to 1) my development cluster.

roberthbailey · 2019-10-07T16:39:38Z

I've also found that if you (accidentally or purposefully) delete a pod backing a gameserver (e.g. kubectl delete pod <NAME> instead of kubectl delete gs <NAME>) that you get into this situation as well.

Last time I did it, I asked @markmandel if this was expected behavior and (if I'm remembering correctly) he said that if someone is manually tinkering with resources under the gameserver it's ok to let people shoot themselves in the foot. And that the system should eventually self heal (but it sounds like that may not be happening for you).

On the other hand, if scaling the cluster (rather than manually deleting resources) can also get you into this situation, then I think that might raise the priority for addressing it.

Another question - how long did you wait with the pod in the pending state? I would expect that the pod going pending would cause the gameserver to go unhealthy. At which point if the game server is running as part of a fleet then the fleet controller would replace it with a fresh game server.

It's possible that the check for game servers no longer running (in the controller since the sidecar can no longer provide health checks) doesn't catch this edge case.

Omegastick · 2019-10-08T01:15:18Z

@roberthbailey In my case, the system has been left in that state over the weekend, and come Monday hasn't repaired itself. I always delete the gameservers shortly after scaling the cluster back up though, so maybe it only self-heals after scaling back up.

It's worth noting that if I do kubectl get pods while the cluster has no nodes, rather than returning pods in a 'pending' state, it doesn't return any pods at all. Not sure if that is expected behaviour or not.

djsell · 2019-10-08T01:41:50Z

Just to be clear, that pending pod has nothing to do with agones or my game servers. It is just another k8s Deployment in the same namespace. It is 0/1 pending because the cluster is at size 0 (no nodes available to run the pod).

djsell · 2019-10-08T01:52:20Z

And to recover my game server state after a scale to 0 and back to 1, I just normally run kubectl --namespace ncc delete gs --all

Also, I should mention that I get in this situation not because I actually am scaling my cluster to 0, but because I have a cheap dev cluster of size 1 using GKE with preemptible instances. Every now and then the instance will be terminated and afterwards I need to manually delete non-existent game servers in order for the fleet to recover.

While I think this specific case is unlikely to happen for production setups, I do worry that some other events could lead a production system to get into this condition.

aLekSer · 2019-10-14T14:47:06Z

@DJSel
Have you set Health check to disabled? (https://agones.dev/site/docs/reference/agones_crd_api_reference/#Health)
1st step was make gcloud-test-cluster.
I run such a scenario, I did not touch agones-system Node Pool. Then I scaled default Node Pool down to 0 -> all gameservers in or without a fleet become unhealthy, then I see such a state:

simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   5m                                                                                             
simple-udp-lvprx             Starting                                                                     45s                                                                                            
simple-udp-mj6n5-89bb6       Starting                                                                     4m

After scaling to 3 nodes in default:

                                                                                            
simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   57m                                                                                            
simple-udp-lvprx             Unhealthy   35.203.152.215   7182   gke-test-cluster-default-4c43097e-5krk   52m                                                                                            
simple-udp-mj6n5-k5mck       Ready       35.203.152.215   7112   gke-test-cluster-default-4c43097e-5krk   49m                                                                                            
simple-udp-mj6n5-zc9rm       Ready       35.203.152.215   7467   gke-test-cluster-default-4c43097e-5krk   49m

So gameservers in a simple-udp (simple-udp/fleet.yaml) were restarted (simple-udp-mj6n5-k5mck and simple-udp-mj6n5-...):

simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   57m                                                                                            
simple-udp-lvprx             Unhealthy   35.203.152.215   7182   gke-test-cluster-default-4c43097e-5krk   52m                                                                                            
simple-udp-mj6n5-k5mck       Ready       35.203.152.215   7112   gke-test-cluster-default-4c43097e-5krk   49m                                                                                            
simple-udp-mj6n5-zc9rm       Ready       35.203.152.215   7467   gke-test-cluster-default-4c43097e-5krk   49m

aLekSer · 2019-10-14T14:53:47Z

@djsell Daniel,

Were you scaling all your nodes in a cluster to 0, which led to removing agones-controller?

If I scale both agones-system and default (gameservers node pool) node pools to 0 than I am able to reproduce the issue, however agones-controller is not running, so we need to search for a way how other projects work around similar issue when all CRD pods and controller are both down.

So probably it is better to have an issue, agones-controller after restart does not update the statuses of Fleets and Gameservers.

After restarting the controller state of the fleets are not accurate either current does not set to 0:

# kubectl get gs
simple-udp-mj6n5-bg4gd       Ready       34.83.22.194     7279   gke-test-cluster-default-4c43097e-c9d3   36m                                                                                            
simple-udp-mj6n5-z5t5s       Ready       34.83.22.194     7801   gke-test-cluster-default-4c43097e-c9d3   36m                                        
# kubectl get fleets
kubectNAME             SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-example               Packed       2         2         0           2       49m
simi2wqple-udp              Packed       2         2         0           2       1h
simple-udp                  Packed       2         2         0           2       1h

aLekSer · 2019-10-14T16:00:47Z

I was thinking that we need OwnerReference for GameServer backing pod, then I assume that Kubernetes garbage collector automatically should remove the GameServer CRD:
https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

djsell · 2019-10-14T19:57:03Z

@aLekSer Thanks for checking up on this. No, I did not disable health check.

Yes, I am using the same cluster for agones-system and my game server namespace, so the scale down results in 0 agones-controller as well.

So probably it is better to have an issue, agones-controller after restart does not update the statuses of Fleets and Gameservers.

I believe this is correct.

aLekSer · 2020-02-19T16:44:11Z

I will retest this, I think missingPodsController added recently should handle this properly as well.

aLekSer · 2020-02-19T17:48:51Z

Testing on current GKE cluster configuration make gcloud-test-cluster defined in cluster.yml.jinja with 6 nodes, 1 for agones-system, 4 for default namespace (for GameServers).
First step I resized agones-system to 0 nodes, then Node Pool default to 0 nodes:

kubectl get pods
No resources found.
$ kubectl get gs
NAME                     STATE   ADDRESS          PORT   NODE                                     AGE                                                                                                   
simple-udp-lm95l-tk6kb   Ready   35.203.147.246   7290   gke-test-cluster-default-6990460b-dkrr   7m15s                                                                                                 
simple-udp-lm95l-vj24q   Ready   35.203.147.246   7383   gke-test-cluster-default-6990460b-dkrr   7m15s                                                                                                 
$ kubectl get pods --namespace agones-system
NAME                                READY   STATUS        RESTARTS   AGE
agones-allocator-7d4fc49475-bsclx   0/1     Pending       0          36s
agones-allocator-7d4fc49475-kgqtp   0/1     Terminating   0          3m54s
agones-allocator-7d4fc49475-mms4p   0/1     Pending       0          35s
agones-allocator-7d4fc49475-qpljk   0/1     Pending       0          36s
agones-controller-75657bc95-8kf42   0/1     Pending       0          36s
agones-ping-6685656c5d-tx54b        0/1     Pending       0          36s
agones-ping-6685656c5d-w7g8k        0/1     Pending       0          35s

Then I resized agones-system back to 1.
After controller have restarted I see:

kubectl get gs
NAME                     STATE   ADDRESS          PORT   NODE                                     AGE                                                                                                   
simple-udp-lm95l-tk6kb   Ready   35.203.147.246   7290   gke-test-cluster-default-6990460b-dkrr   13m                                                                                                   
simple-udp-lm95l-vj24q   Ready   35.203.147.246   7383   gke-test-cluster-default-6990460b-dkrr   13m                                                                                                   
$ kubectl get gs
NAME                     STATE      ADDRESS   PORT   NODE   AGE
simple-udp-lm95l-qj529   Starting                           1s
simple-udp-lm95l-xvw2x   Starting                           1s
$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
simple-udp-lm95l-qj529   0/2     Pending   0          7s
simple-udp-lm95l-xvw2x   0/2     Pending   0          7s

Events for the one of gameservers:

kubectl get events | grep simple-udp-lm95l-tk6kb
18m         Normal    PortAllocation                                                                                               gameserver/simple-udp-lm95l-tk6kb                   Port allocated   
18m         Normal    Creating                                                                                                     gameserver/simple-udp-lm95l-tk6kb                   Pod simple-udp-lm95l-tk6kb created
18m         Normal    Scheduled                                                                                                    pod/simple-udp-lm95l-tk6kb                          Successfully assigned default/simple-udp-lm95l-tk6kb to gke-test-cluster-default-6990460b-dkrr
18m         Normal    Scheduled                                                                                                    gameserver/simple-udp-lm95l-tk6kb                   Address and port populated
18m         Normal    Pulled                                                                                                       pod/simple-udp-lm95l-tk6kb                          Container image "gcr.io/agones-images/udp-server:0.17" already present on machine
18m         Normal    Created                                                                                                      pod/simple-udp-lm95l-tk6kb                          Created container
simple-udp
18m         Normal    Started                                                                                                      pod/simple-udp-lm95l-tk6kb                          Started container
simple-udp
18m         Normal    Pulled                                                                                                       pod/simple-udp-lm95l-tk6kb                          Container image "gcr.io/agones-images/agones-sdk:1.4.0-25339b5" already present on machine
18m         Normal    Created                                                                                                      pod/simple-udp-lm95l-tk6kb                          Created container
agones-gameserver-sidecar
18m         Normal    Started                                                                                                      pod/simple-udp-lm95l-tk6kb                          Started container
agones-gameserver-sidecar
18m         Normal    RequestReady                                                                                                 gameserver/simple-udp-lm95l-tk6kb                   SDK state change 
18m         Normal    Ready                                                                                                        gameserver/simple-udp-lm95l-tk6kb                   SDK.Ready() complete
12m         Normal    Killing                                                                                                      pod/simple-udp-lm95l-tk6kb                          Stopping container simple-udp
12m         Normal    Killing                                                                                                      pod/simple-udp-lm95l-tk6kb                          Stopping container agones-gameserver-sidecar
5m33s       Warning   Unhealthy                                                                                                    gameserver/simple-udp-lm95l-tk6kb                   Pod is missing   
5m33s       Normal    Shutdown                                                                                                     gameserver/simple-udp-lm95l-tk6kb                   Deletion started 
18m         Normal    SuccessfulCreate                                                                                             gameserverset/simple-udp-lm95l                      Created gameserver: simple-udp-lm95l-tk6kb
5m33s       Normal    SuccessfulDelete                                                                                             gameserverset/simple-udp-lm95l                      Deleted gameserver in state Unhealthy: simple-udp-lm95l-tk6kb

@markmandel I think we can close this ticket now.

markmandel · 2020-02-19T18:04:01Z

Awesome. Closing ticket!

djsell added the kind/bug These are bugs. label Oct 6, 2019

markmandel added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Oct 23, 2019

markmandel closed this as completed Feb 19, 2020

markmandel added this to the 1.4.0 milestone Feb 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

djsell commented Oct 6, 2019

Omegastick commented Oct 7, 2019

roberthbailey commented Oct 7, 2019

Omegastick commented Oct 8, 2019

djsell commented Oct 8, 2019

djsell commented Oct 8, 2019 •

edited

Loading

aLekSer commented Oct 14, 2019 •

edited

Loading

aLekSer commented Oct 14, 2019 •

edited

Loading

aLekSer commented Oct 14, 2019

djsell commented Oct 14, 2019

aLekSer commented Feb 19, 2020

aLekSer commented Feb 19, 2020 •

edited

Loading

markmandel commented Feb 19, 2020

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

Comments

djsell commented Oct 6, 2019

Omegastick commented Oct 7, 2019

roberthbailey commented Oct 7, 2019

Omegastick commented Oct 8, 2019

djsell commented Oct 8, 2019

djsell commented Oct 8, 2019 • edited Loading

aLekSer commented Oct 14, 2019 • edited Loading

aLekSer commented Oct 14, 2019 • edited Loading

aLekSer commented Oct 14, 2019

djsell commented Oct 14, 2019

aLekSer commented Feb 19, 2020

aLekSer commented Feb 19, 2020 • edited Loading

markmandel commented Feb 19, 2020

djsell commented Oct 8, 2019 •

edited

Loading

aLekSer commented Oct 14, 2019 •

edited

Loading

aLekSer commented Oct 14, 2019 •

edited

Loading

aLekSer commented Feb 19, 2020 •

edited

Loading