Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

runtarm · 2023-07-26T10:49:16Z

What happened: The fleet get stuck, no new game servers created.

What you expected to happen: The fleet scales up successfully.

How to reproduce it (as minimally and precisely as possible):

Create a fleet with 10 replicas.
Allocate a game server.
Update the fleet (e.g. change port).
Allocate 3 game servers.
Scale the fleet down to 2 replicas.
Then scale the fleet back to 10 replicas.

The fleet will get stuck in this state:

$ kubectl get fleet
NAME                 SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-foo            Packed       10        4         4           0       15m

The gss will look like this:

$ kubectl get gss --sort-by=.metadata.creationTimestamp
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-foo-lkh2r            Packed       0         1         1           0       17m
fleet-foo-rxz92            Packed       1         3         3           0       15m (active)

Notice the state of the active gss that the .Spec.Replicas (DESIRED) become lower than both the .Status.Replicas (CURRENT) and .Status.AllocatedReplicas (ALLOCATED) columns, and stay like that forever.

The fact that .Spec.Replicas < .Status.Replicas caused the fleet rolling update logic to always skip the active gss, see pkg/fleets/controller.go#L459. This also means .Spec.Replicas won't change.

The .Status.Replicas won't change either, since it cannot be lower than .Status.AllocatedReplicas. Therefore, the active gss cannot get out of the stuck state on their own unless those allocated game servers quit.

The code that cause .Spec.Replicas to become lower than .Status.AllocatedReplicas is probably this particular line: pkg/fleets/controller.go#L471.

Anything else we need to know?:

Possibly related to #2617 and #2574. The fix there doesn't solve this corner case.

Environment:

Agones version: 1.30
Kubernetes version (use kubectl version): 1.23
Cloud provider or hardware configuration: GKE
Install method (yaml/helm): yaml
Troubleshooting guide log(s):
Others:

The text was updated successfully, but these errors were encountered:

markmandel · 2023-07-26T19:40:07Z

Ooh tricky! Thanks for the details replication steps!

markmandel · 2023-07-26T22:59:02Z

I've replicated the issue in an e2e test 👍🏻
https://github.com/markmandel/agones/blob/bug/fleet-scaling-alloc/test/e2e/fleet_test.go#L125-L208

markmandel · 2023-07-27T00:19:20Z

Got a fix that seems to be working, now I just need to tidy it up and make sure I didn't break anything else:
https://github.com/markmandel/agones/blob/bug/fleet-scaling-alloc/pkg/fleets/controller.go#L388-L396

Fixes bug wherein if a set of Allocations occurred across two or more GameServerSets that had yet to be deleted for a RollingUpdate (because of Allocated GameServers), and a scale down operation moved the Fleet replica count to below the current number of Allocated GameServers -- scaling back up would not move above the current number of Allocated GameServers. Or to put it another way, the current Fleet update logic didn't consider old GameServerSets with Allocated GameServers but a 0 value for `Spec.Replicas` as a complete rollout when scaling back up, so the logic went back into rolling update logic, and it all went sideways. This short circuits that scenario up front. Close googleforgames#3287

Fixes bug wherein if a set of Allocations occurred across two or more GameServerSets that had yet to be deleted for a RollingUpdate (because of Allocated GameServers), and a scale down operation moved the Fleet replica count to below the current number of Allocated GameServers -- scaling back up would not move above the current number of Allocated GameServers. Or to put it another way, the current Fleet update logic didn't consider old GameServerSets with Allocated GameServers but a 0 value for `Spec.Replicas` as a complete rollout when scaling back up, so the logic went back into rolling update logic, and it all went sideways. This short circuits that scenario up front. Close #3287 Co-authored-by: Mengye (Max) Gong <8364575+gongmax@users.noreply.github.com>

runtarm added the kind/bug These are bugs. label Jul 26, 2023

markmandel self-assigned this Jul 26, 2023

markmandel mentioned this issue Jul 28, 2023

Fix for scaling split allocated GameServerSets #3292

Merged

markmandel closed this as completed in #3292 Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

runtarm commented Jul 26, 2023

markmandel commented Jul 26, 2023

markmandel commented Jul 26, 2023

markmandel commented Jul 27, 2023

Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

Comments

runtarm commented Jul 26, 2023

markmandel commented Jul 26, 2023

markmandel commented Jul 26, 2023

markmandel commented Jul 27, 2023