Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to scale fleet after an update when there are allocated game servers from both old and new version #3287

Closed
runtarm opened this issue Jul 26, 2023 · 3 comments · Fixed by #3292
Assignees
Labels
kind/bug These are bugs.

Comments

@runtarm
Copy link

runtarm commented Jul 26, 2023

What happened: The fleet get stuck, no new game servers created.

What you expected to happen: The fleet scales up successfully.

How to reproduce it (as minimally and precisely as possible):

  1. Create a fleet with 10 replicas.
  2. Allocate a game server.
  3. Update the fleet (e.g. change port).
  4. Allocate 3 game servers.
  5. Scale the fleet down to 2 replicas.
  6. Then scale the fleet back to 10 replicas.

The fleet will get stuck in this state:

$ kubectl get fleet
NAME                 SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-foo            Packed       10        4         4           0       15m

The gss will look like this:

$ kubectl get gss --sort-by=.metadata.creationTimestamp
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
fleet-foo-lkh2r            Packed       0         1         1           0       17m
fleet-foo-rxz92            Packed       1         3         3           0       15m (active)

Notice the state of the active gss that the .Spec.Replicas (DESIRED) become lower than both the .Status.Replicas (CURRENT) and .Status.AllocatedReplicas (ALLOCATED) columns, and stay like that forever.

The fact that .Spec.Replicas < .Status.Replicas caused the fleet rolling update logic to always skip the active gss, see pkg/fleets/controller.go#L459. This also means .Spec.Replicas won't change.

The .Status.Replicas won't change either, since it cannot be lower than .Status.AllocatedReplicas. Therefore, the active gss cannot get out of the stuck state on their own unless those allocated game servers quit.

The code that cause .Spec.Replicas to become lower than .Status.AllocatedReplicas is probably this particular line: pkg/fleets/controller.go#L471.

Anything else we need to know?:

Possibly related to #2617 and #2574. The fix there doesn't solve this corner case.

Environment:

  • Agones version: 1.30
  • Kubernetes version (use kubectl version): 1.23
  • Cloud provider or hardware configuration: GKE
  • Install method (yaml/helm): yaml
  • Troubleshooting guide log(s):
  • Others:
@runtarm runtarm added the kind/bug These are bugs. label Jul 26, 2023
@markmandel
Copy link
Member

Ooh tricky! Thanks for the details replication steps!

@markmandel
Copy link
Member

@markmandel markmandel self-assigned this Jul 26, 2023
@markmandel
Copy link
Member

Got a fix that seems to be working, now I just need to tidy it up and make sure I didn't break anything else:
https://github.com/markmandel/agones/blob/bug/fleet-scaling-alloc/pkg/fleets/controller.go#L388-L396

markmandel added a commit to markmandel/agones that referenced this issue Jul 28, 2023
Fixes bug wherein if a set of Allocations occurred across two or
more GameServerSets that had yet to be deleted for a RollingUpdate
(because of Allocated GameServers), and a scale down operation moved
the Fleet replica count to below the current number of Allocated
GameServers -- scaling back up would not move above the current number
of Allocated GameServers.

Or to put it another way, the current Fleet update logic didn't consider
old GameServerSets with Allocated GameServers but a 0 value for
`Spec.Replicas` as a complete rollout when scaling back up, so the logic
went back into rolling update logic, and it all went sideways.

This short circuits that scenario up front.

Close googleforgames#3287
markmandel added a commit to markmandel/agones that referenced this issue Jul 28, 2023
Fixes bug wherein if a set of Allocations occurred across two or
more GameServerSets that had yet to be deleted for a RollingUpdate
(because of Allocated GameServers), and a scale down operation moved
the Fleet replica count to below the current number of Allocated
GameServers -- scaling back up would not move above the current number
of Allocated GameServers.

Or to put it another way, the current Fleet update logic didn't consider
old GameServerSets with Allocated GameServers but a 0 value for
`Spec.Replicas` as a complete rollout when scaling back up, so the logic
went back into rolling update logic, and it all went sideways.

This short circuits that scenario up front.

Close googleforgames#3287
markmandel added a commit that referenced this issue Aug 2, 2023
Fixes bug wherein if a set of Allocations occurred across two or
more GameServerSets that had yet to be deleted for a RollingUpdate
(because of Allocated GameServers), and a scale down operation moved
the Fleet replica count to below the current number of Allocated
GameServers -- scaling back up would not move above the current number
of Allocated GameServers.

Or to put it another way, the current Fleet update logic didn't consider
old GameServerSets with Allocated GameServers but a 0 value for
`Spec.Replicas` as a complete rollout when scaling back up, so the logic
went back into rolling update logic, and it all went sideways.

This short circuits that scenario up front.

Close #3287

Co-authored-by: Mengye (Max) Gong <8364575+gongmax@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants