Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

Closed
djsell opened this issue Oct 6, 2019 · 12 comments
Closed

Gameserver is not removed when node hosting gameserver pod is shutdown #1102

djsell opened this issue Oct 6, 2019 · 12 comments
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Milestone

Comments

@djsell
Copy link

djsell commented Oct 6, 2019

What happened:
After a node hosting a game server shuts down, the game server is not removed from the game server list. A fleet will not replace the missing game server.

What you expected to happen:
The gameserver should be removed once the pod is no longer running, and the fleet should start a new gameserver up.

How to reproduce it (as minimally and precisely as possible):
I'm using GKE.
Create a cluster with node size 1
Create a fleet that is running a game server (I'm using a fleet autoscaler with bufferSize: 1, minReplicas: 1, maxReplicas: 5)
Resize the cluster to 0
GS will still be listed, even though no pod is listed
Resize the cluster to 1
Old GS will still be listed, no pod is running, no new pod is started

Anything else we need to know?:

Environment:

  • Agones version: chart: agones-1.0.0 app-version: 1.0.0
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T14:25:20Z", GoVersion:"go1.12.7", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.3-gke.11", GitCommit:"cde86d2e1416a0c6c4bb964e1a13e8fa0a83a616", GitTreeState:"clean", BuildDate:"2019-08-12T20:57:47Z", GoVersion:"go1.12.5b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): helm
  • Troubleshooting guide log(s):
  • Others:

Cluster size 1:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   11h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-kthld   1/1     Running   0          20h
ncc-fleet-b6zb5-ckp8c                  2/2     Running   0          11h

Resize cluster to 0:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   20h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-zzxrk   0/1     Pending   0          61s

Resize cluster to 1:

daniels-mbp:~ dsell$ kubectl --namespace ncc get gs
NAME                    STATE   ADDRESS         PORT   NODE                                       AGE
ncc-fleet-b6zb5-ckp8c   Ready   35.232.204.72   7359   gke-cluster-2-default-pool-8f79228e-h24w   21h
daniels-mbp:~ dsell$ kubectl --namespace ncc get po
NAME                                   READY   STATUS    RESTARTS   AGE
director-deployment-7bbb5fb7fd-zzxrk   1/1     Running   0          8m25s
@djsell djsell added the kind/bug These are bugs. label Oct 6, 2019
@Omegastick
Copy link

I'm getting this too. Running Kubernetes on GKE with Agones v0.12.0. I have to delete the gameservers manually every time I 'turn on' (scale from 0 to 1) my development cluster.

@roberthbailey
Copy link
Member

I've also found that if you (accidentally or purposefully) delete a pod backing a gameserver (e.g. kubectl delete pod <NAME> instead of kubectl delete gs <NAME>) that you get into this situation as well.

Last time I did it, I asked @markmandel if this was expected behavior and (if I'm remembering correctly) he said that if someone is manually tinkering with resources under the gameserver it's ok to let people shoot themselves in the foot. And that the system should eventually self heal (but it sounds like that may not be happening for you).

On the other hand, if scaling the cluster (rather than manually deleting resources) can also get you into this situation, then I think that might raise the priority for addressing it.

Another question - how long did you wait with the pod in the pending state? I would expect that the pod going pending would cause the gameserver to go unhealthy. At which point if the game server is running as part of a fleet then the fleet controller would replace it with a fresh game server.

It's possible that the check for game servers no longer running (in the controller since the sidecar can no longer provide health checks) doesn't catch this edge case.

@Omegastick
Copy link

@roberthbailey In my case, the system has been left in that state over the weekend, and come Monday hasn't repaired itself. I always delete the gameservers shortly after scaling the cluster back up though, so maybe it only self-heals after scaling back up.

It's worth noting that if I do kubectl get pods while the cluster has no nodes, rather than returning pods in a 'pending' state, it doesn't return any pods at all. Not sure if that is expected behaviour or not.

@djsell
Copy link
Author

djsell commented Oct 8, 2019

Just to be clear, that pending pod has nothing to do with agones or my game servers. It is just another k8s Deployment in the same namespace. It is 0/1 pending because the cluster is at size 0 (no nodes available to run the pod).

@djsell
Copy link
Author

djsell commented Oct 8, 2019

And to recover my game server state after a scale to 0 and back to 1, I just normally run kubectl --namespace ncc delete gs --all

Also, I should mention that I get in this situation not because I actually am scaling my cluster to 0, but because I have a cheap dev cluster of size 1 using GKE with preemptible instances. Every now and then the instance will be terminated and afterwards I need to manually delete non-existent game servers in order for the fleet to recover.

While I think this specific case is unlikely to happen for production setups, I do worry that some other events could lead a production system to get into this condition.

@aLekSer
Copy link
Collaborator

aLekSer commented Oct 14, 2019

@DJSel
Have you set Health check to disabled? (https://agones.dev/site/docs/reference/agones_crd_api_reference/#Health)
1st step was make gcloud-test-cluster.
I run such a scenario, I did not touch agones-system Node Pool. Then I scaled default Node Pool down to 0 -> all gameservers in or without a fleet become unhealthy, then I see such a state:

simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   5m                                                                                             
simple-udp-lvprx             Starting                                                                     45s                                                                                            
simple-udp-mj6n5-89bb6       Starting                                                                     4m    

After scaling to 3 nodes in default:

                                                                                            
simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   57m                                                                                            
simple-udp-lvprx             Unhealthy   35.203.152.215   7182   gke-test-cluster-default-4c43097e-5krk   52m                                                                                            
simple-udp-mj6n5-k5mck       Ready       35.203.152.215   7112   gke-test-cluster-default-4c43097e-5krk   49m                                                                                            
simple-udp-mj6n5-zc9rm       Ready       35.203.152.215   7467   gke-test-cluster-default-4c43097e-5krk   49m

So gameservers in a simple-udp (simple-udp/fleet.yaml) were restarted (simple-udp-mj6n5-k5mck and simple-udp-mj6n5-...):

simple-udp-cc8m6             Unhealthy   35.203.152.215   7170   gke-test-cluster-default-4c43097e-l3tr   57m                                                                                            
simple-udp-lvprx             Unhealthy   35.203.152.215   7182   gke-test-cluster-default-4c43097e-5krk   52m                                                                                            
simple-udp-mj6n5-k5mck       Ready       35.203.152.215   7112   gke-test-cluster-default-4c43097e-5krk   49m                                                                                            
simple-udp-mj6n5-zc9rm       Ready       35.203.152.215   7467   gke-test-cluster-default-4c43097e-5krk   49m

@aLekSer
Copy link
Collaborator

aLekSer commented Oct 14, 2019

@djsell Daniel,

Were you scaling all your nodes in a cluster to 0, which led to removing agones-controller?

If I scale both agones-system and default (gameservers node pool) node pools to 0 than I am able to reproduce the issue, however agones-controller is not running, so we need to search for a way how other projects work around similar issue when all CRD pods and controller are both down.

So probably it is better to have an issue, agones-controller after restart does not update the statuses of Fleets and Gameservers.

After restarting the controller state of the fleets are not accurate either current does not set to 0:

# kubectl get gs
simple-udp-mj6n5-bg4gd       Ready       34.83.22.194     7279   gke-test-cluster-default-4c43097e-c9d3   36m                                                                                            
simple-udp-mj6n5-z5t5s       Ready       34.83.22.194     7801   gke-test-cluster-default-4c43097e-c9d3   36m                                        
# kubectl get fleets
kubectNAME             SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-example               Packed       2         2         0           2       49m
simi2wqple-udp              Packed       2         2         0           2       1h
simple-udp                  Packed       2         2         0           2       1h

@aLekSer
Copy link
Collaborator

aLekSer commented Oct 14, 2019

I was thinking that we need OwnerReference for GameServer backing pod, then I assume that Kubernetes garbage collector automatically should remove the GameServer CRD:
https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/

@djsell
Copy link
Author

djsell commented Oct 14, 2019

@aLekSer Thanks for checking up on this. No, I did not disable health check.

Yes, I am using the same cluster for agones-system and my game server namespace, so the scale down results in 0 agones-controller as well.

So probably it is better to have an issue, agones-controller after restart does not update the statuses of Fleets and Gameservers.

I believe this is correct.

@markmandel markmandel added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Oct 23, 2019
@aLekSer
Copy link
Collaborator

aLekSer commented Feb 19, 2020

I will retest this, I think missingPodsController added recently should handle this properly as well.

@aLekSer
Copy link
Collaborator

aLekSer commented Feb 19, 2020

Testing on current GKE cluster configuration make gcloud-test-cluster defined in cluster.yml.jinja with 6 nodes, 1 for agones-system, 4 for default namespace (for GameServers).
First step I resized agones-system to 0 nodes, then Node Pool default to 0 nodes:

kubectl get pods
No resources found.
$ kubectl get gs
NAME                     STATE   ADDRESS          PORT   NODE                                     AGE                                                                                                   
simple-udp-lm95l-tk6kb   Ready   35.203.147.246   7290   gke-test-cluster-default-6990460b-dkrr   7m15s                                                                                                 
simple-udp-lm95l-vj24q   Ready   35.203.147.246   7383   gke-test-cluster-default-6990460b-dkrr   7m15s                                                                                                 
$ kubectl get pods --namespace agones-system
NAME                                READY   STATUS        RESTARTS   AGE
agones-allocator-7d4fc49475-bsclx   0/1     Pending       0          36s
agones-allocator-7d4fc49475-kgqtp   0/1     Terminating   0          3m54s
agones-allocator-7d4fc49475-mms4p   0/1     Pending       0          35s
agones-allocator-7d4fc49475-qpljk   0/1     Pending       0          36s
agones-controller-75657bc95-8kf42   0/1     Pending       0          36s
agones-ping-6685656c5d-tx54b        0/1     Pending       0          36s
agones-ping-6685656c5d-w7g8k        0/1     Pending       0          35s

Then I resized agones-system back to 1.
After controller have restarted I see:

kubectl get gs
NAME                     STATE   ADDRESS          PORT   NODE                                     AGE                                                                                                   
simple-udp-lm95l-tk6kb   Ready   35.203.147.246   7290   gke-test-cluster-default-6990460b-dkrr   13m                                                                                                   
simple-udp-lm95l-vj24q   Ready   35.203.147.246   7383   gke-test-cluster-default-6990460b-dkrr   13m                                                                                                   
$ kubectl get gs
NAME                     STATE      ADDRESS   PORT   NODE   AGE
simple-udp-lm95l-qj529   Starting                           1s
simple-udp-lm95l-xvw2x   Starting                           1s
$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
simple-udp-lm95l-qj529   0/2     Pending   0          7s
simple-udp-lm95l-xvw2x   0/2     Pending   0          7s

Events for the one of gameservers:

kubectl get events | grep simple-udp-lm95l-tk6kb
18m         Normal    PortAllocation                                                                                               gameserver/simple-udp-lm95l-tk6kb                   Port allocated   
18m         Normal    Creating                                                                                                     gameserver/simple-udp-lm95l-tk6kb                   Pod simple-udp-lm95l-tk6kb created
18m         Normal    Scheduled                                                                                                    pod/simple-udp-lm95l-tk6kb                          Successfully assigned default/simple-udp-lm95l-tk6kb to gke-test-cluster-default-6990460b-dkrr
18m         Normal    Scheduled                                                                                                    gameserver/simple-udp-lm95l-tk6kb                   Address and port populated
18m         Normal    Pulled                                                                                                       pod/simple-udp-lm95l-tk6kb                          Container image "gcr.io/agones-images/udp-server:0.17" already present on machine
18m         Normal    Created                                                                                                      pod/simple-udp-lm95l-tk6kb                          Created container
simple-udp
18m         Normal    Started                                                                                                      pod/simple-udp-lm95l-tk6kb                          Started container
simple-udp
18m         Normal    Pulled                                                                                                       pod/simple-udp-lm95l-tk6kb                          Container image "gcr.io/agones-images/agones-sdk:1.4.0-25339b5" already present on machine
18m         Normal    Created                                                                                                      pod/simple-udp-lm95l-tk6kb                          Created container
agones-gameserver-sidecar
18m         Normal    Started                                                                                                      pod/simple-udp-lm95l-tk6kb                          Started container
agones-gameserver-sidecar
18m         Normal    RequestReady                                                                                                 gameserver/simple-udp-lm95l-tk6kb                   SDK state change 
18m         Normal    Ready                                                                                                        gameserver/simple-udp-lm95l-tk6kb                   SDK.Ready() complete
12m         Normal    Killing                                                                                                      pod/simple-udp-lm95l-tk6kb                          Stopping container simple-udp
12m         Normal    Killing                                                                                                      pod/simple-udp-lm95l-tk6kb                          Stopping container agones-gameserver-sidecar
5m33s       Warning   Unhealthy                                                                                                    gameserver/simple-udp-lm95l-tk6kb                   Pod is missing   
5m33s       Normal    Shutdown                                                                                                     gameserver/simple-udp-lm95l-tk6kb                   Deletion started 
18m         Normal    SuccessfulCreate                                                                                             gameserverset/simple-udp-lm95l                      Created gameserver: simple-udp-lm95l-tk6kb
5m33s       Normal    SuccessfulDelete                                                                                             gameserverset/simple-udp-lm95l                      Deleted gameserver in state Unhealthy: simple-udp-lm95l-tk6kb

@markmandel I think we can close this ticket now.

@markmandel
Copy link
Member

Awesome. Closing ticket!

@markmandel markmandel added this to the 1.4.0 milestone Feb 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

5 participants