Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GameServers in allocated state can respond to fleet updating #2682

Closed
keisni opened this issue Jul 26, 2022 · 13 comments · Fixed by #3131
Closed

GameServers in allocated state can respond to fleet updating #2682

keisni opened this issue Jul 26, 2022 · 13 comments · Fixed by #3131
Assignees
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones

Comments

@keisni
Copy link

keisni commented Jul 26, 2022

Is your feature request related to a problem? Please describe.
a question about fleet update. it seems that the controller leaves GameServers in allocated state alone in the process of fleet updating, whether rolling or recreating. but some of our gameservers have long ttls (they are not specific to battles, but players can enter and leave at any time). so i`m wondering how to tell these gameservers to start the shutting down procedure (start by telling players to leave game or just kicking them out).

Describe the solution you'd like
We could set an annotation / label (I think an annotation would be better?) on an Allocated GameServer that is left behind after a Fleet update - then the WatchGameServer SDK could pick it up?

Describe alternatives you've considered
N/A

Additional context
N/A

@keisni keisni added the kind/feature New features for Agones label Jul 26, 2022
@markmandel
Copy link
Member

markmandel commented Aug 2, 2022

Super interesting problem! We were discussing this a bit internally and came up with some other user journeys. Would love feedback on the below design.


Objective

In a usage of Agones in which Allocated GameServers are utilised for a long time, such as a Lobby GameServer, or potentially a session based GameServer that is being reused multiple times in a row, it would be useful if that GameServer process could be notified in some way if and when a Fleet has been updated such that an Allocated GameServer, may want to actively perform a graceful shutdown and release its resources such that it can be replaced by a new version.

Requirements

There are two scenarios of which to consider in which we might want to notify a GameServer that a Fleet has been updated that will need to have a solution:

Fleet Update (Rolling/Recreate)

This is a standard Fleet Rolling Update. When a rolling update starts, there should be an ability to notify any Allocated GameServers within that Fleet that a rolling update has begun.

Out of scope: Notifying GameServers when a rolling update has finished.

Canary Deploy Through Multiple Fleets

If the user is updating a Fleet through a “Two (or more) Fleets Strategy” to canary a new version there is no rolling update, but instead the current Fleet is manually scaled down, while the replacement Fleet is manually scaled up.

In this scenario, the only indication that a Fleet is being replaced, is that the Fleet replica count could drop below the number of Allocated GameServers. In this scenario, the correct number to notify would equal:

Total Allocated GameServers - Fleet Replica Count

As an example - if a Fleet has 3 Allocated GameServers, and the Fleet's replica count is modified to be 1, two of the Allocated GameServers should be notified that they may wish to gracefully shut down.

It is worth noting that the GameServers notified will need to follow the scheduling setting of the Fleet to maintain the appropriate optimisation of resources.

Other Notes

This solution must also be backward compatible.

Ideally, if not utilised, this solution should not put any extra load in the Kubernetes control plane.

Ideally, the solution should be extensible, if other notification strategies are required over time.

Background

Design ideas

The general idea is that on either a Fleet update that update the GameServer template, or a Fleet update that causes the replica count to drop below the number of currently Allocated GameServers - the Agones system should apply a user-provided set of labels and/or annotations to the Allocated GameServers.

This will serve two purposes:

  1. The GameServer SDK WatchGameServer command could be utilised to pickup this annotation and/or label change to indicate the Fleet system change, and the gameserver binary could react accordingly.
  2. This could also be used to proactively update GameServer labels, to effect change in allocation strategy - such as removing the GameServers from being allocated while they perform their graceful shutdown.

Example yaml configuration:

apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-game-server
spec:
  replicas: 2
  allocationOverflow: # applied to the GameServerSet's number Allocated GameServers that are over the desired replicas
    labels:
      mykey: myvalue
      version: "" # empty an existing label value
    annotations:
      otherkey: setthisvalue
  template:
    spec:
      ports:
        - name: default
          containerPort: 7654
      template:
        spec:
          containers:
            - name: simple-game-server
              image: gcr.io/agones-images/simple-game-server:0.13

This will work the same across Fleet resizing and Rolling Updates, in that the implementation will respond to the underlying GameServerSet being shrunk in either scenario to a value smaller than the number of Allocated GameServers.

The benefit of the above configuration is that if it is not applied, then no extra overhead exists for a Fleet. The user has to actively opt-in, and also provide what labels and/or annotations make sense for their Agones implementation.

This should also be developed behind a feature flag, I would suggest: FleetAllocatenOverflow.

Outstanding Questions:

  • Any objections to allocationOverflow as a name? Is it relatively intuitive to what it does?

Alternatives considered

Rather than (or in addition to) updating labels, we could instead send a SIGTERM signal, much like is done on Pod shutdown. This would require the end user to actively utilise terminationGracePeriodSeconds to specify how to keep the Allocated GameServer running before being shut down by Kubernetes.

This has several advantages:

  1. If the end user is already doing this to account for node shutdown, there is no repetition of work.
  2. This would automatically remove the GameServer from the allocation pool, since it skips any GameServer that has a DeletionTimeStamp value.

But there are the following disadvantages:

  1. All GameServer binaries would be forced to respond to the SIGTERM signal as well as implement a terminationGracePeriodSeconds configuration, otherwise they would always be shut down. With labels and annotations, they can be safely ignored.
  2. With SIGTERM there is no indication of why the GameServer is being asked to gracefully shutdown (it could be a node shutdown!) - whereas with a label/annotation, the end user can pass along whatever information they wish.

That all being said, if we get user feedback to have an option of deleting Allocated GameServers on Fleet update (to create a SIGTERM signal), we could add it at a later date with the above design.

@markmandel markmandel added the kind/design Proposal discussing new features / fixes and how they should be implemented label Aug 2, 2022
@JJhuk
Copy link
Contributor

JJhuk commented Aug 3, 2022

I had similar concerns. https://agones.slack.com/archives/C9DGM5DS8/p1647513867843959

Now, inevitably, when the match server makes an assignment request to agones, by entering the version in the selector,
Ask 'Please assign the latest version as priority', then
In the multi server, I wrote a code that watches the player.count and shuts down if there is no user for a while (30 minutes).

The first thing I thought about before was how to use SIGTERM suggested by @markmandel. But, as mark said, using SIGTERM means you don't know "what" reason SIGTERM was called for, and the game server must support it. (SIGTERM comes and it is possible to check the reason by checking the annotation, but it looks unnatural.)

The 'onUpdate' method, on the other hand, seems to have a more natural flow.

@markmandel
Copy link
Member

Community feedback from my livestream when I wrote the above design:

Brittan J Thomas
4 days ago
RE: "onUpdate.replica" does "replicaOverflow" make more sense?

replacing the key replica even with just overflow - I think that would work well. How do people feel?

@roberthbailey
Copy link
Member

@zmerlynn - This might interest you, since it's similar to disruption cases you've been exploring.

@markmandel
Copy link
Member

Updated the design a little bit.

Was looking at this again, and realised that I don't think we need two different configs for Fleet scaling and Fleet updates - they are essentially the same thing from a GameServer point of view -- we have too many Allocated GameServers.

One thing I did note though, is that when doing a rolling update, I saw the following on two GameServerSets

kubectl get gsset
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-hzh2h   Packed       3         3         0           3       49s
simple-game-server-x4m5s   Packed       1         2         2           0       2m51s

(simple-game-server-x4m5s is the older GameServerSet)

We can see that it's got a desired state of 1 , when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.

@markmandel
Copy link
Member

We can see that it's got a desired state of 1 , when I'm wondering if it should be 0? Might be tied to the 25% default burst / rolling update values -- so I'm going to ignore it for the initial pass of this, and see if it becomes an issue.

Right as I wrote this, i scaled the Fleet up to 10 items, and confirmed that it is, since we end up at:

$ kubectl get gsset
NAME                       SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-game-server-hzh2h   Packed       8         8         0           8       18m
simple-game-server-x4m5s   Packed       0         2         2           0       20m

TL;DR - no need to edit any of the GameServerSet scaling code here.

@markmandel markmandel self-assigned this Dec 27, 2022
@markmandel
Copy link
Member

Getting close on this... not sure if it'll make RC next week, but definitely the release after I reckon.

markmandel added a commit to markmandel/agones that referenced this issue Feb 13, 2023
markmandel added a commit to markmandel/agones that referenced this issue Feb 13, 2023
roberthbailey pushed a commit that referenced this issue Feb 15, 2023
markmandel added a commit to markmandel/agones that referenced this issue Feb 16, 2023
Implementation of yaml and backing Go code for the CRDs for supporting
Fleet Allocation Overflow tracking.

Work on #googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Feb 16, 2023
Implementation of yaml and backing Go code for the CRDs for supporting
Fleet Allocation Overflow tracking.

Work on googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Feb 22, 2023
Implementation of yaml and backing Go code for the CRDs for supporting
Fleet Allocation Overflow tracking.

Work on googleforgames#2682
@markmandel
Copy link
Member

Release is next Tuesday. I have the functionality pretty much done, but still splitting PRs. This might slip into the next release... but we'll see how will go. I give it 60% that it will land in time.

markmandel added a commit that referenced this issue Feb 24, 2023
Implementation of yaml and backing Go code for the CRDs for supporting
Fleet Allocation Overflow tracking.

Work on #2682
@markmandel
Copy link
Member

markmandel commented Feb 27, 2023

Given that 1.30.0 is Tomorrow, this will slip to 1.31.0.

@gongmax
Copy link
Collaborator

gongmax commented Apr 13, 2023

Just a small question, should we also apply this to the Reserved GS of the old GSS? Similar as the Allocated GS, I think it might also make sense to notify the Reserved GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.

@markmandel
Copy link
Member

Given that 1.30.0 is Tomorrow, this will slip to 1.31.0

...and it slipped again, because GDC.

@markmandel
Copy link
Member

Just a small question, should we also apply this to the Reserved GS of the old GSS? Similar as the Allocated GS, I think it might also make sense to notify the Reserved GS that the Fleet is being updated so customer can take action accordingly eg. might release its resources such that it can be replaced by a new version.

That's a good question. I'm leaning towards "no" , or at least "not yet" (and wait and see what feedback is)? Just because since reservation tends to be on a per-gameserver basis, each game server doesn't know if it should let go of a reservation or not if there's a new version - as if they all do, there are no gameservers for anyone to go to.

I think it's probably better to allow reservation -> Ready / Allocated flow to continue through, and then let the process manage things from there -- less opportunity for disruption.

But definitely would like to be open to perspectives from people actually using Reserved.

@gongmax
Copy link
Collaborator

gongmax commented Apr 13, 2023

Makes sense to me. Also, thinking about the potential use cases, I feel that Reserved is a more transient state than Allocated so allow reservation -> Ready / Allocated flow to continue sounds reasonable, although by definition a GS can be reserved forever.

markmandel added a commit to markmandel/agones that referenced this issue Apr 14, 2023
Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Apr 14, 2023
Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Apr 14, 2023
Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Apr 14, 2023
Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on googleforgames#2682
markmandel added a commit to markmandel/agones that referenced this issue Apr 22, 2023
Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on googleforgames#2682
markmandel added a commit that referenced this issue Apr 22, 2023
* Allocated GameServers updated on Fleet update

Functional implementation and testing of applying labels and/or
annotations to any `Allocated` `GameServers` that are overflowed past
the configured replica count.

Next ➡️ write some guides to close out the ticket.

Work on #2682

* Review updates.
markmandel added a commit to markmandel/agones that referenced this issue Apr 28, 2023
Added documentation to the "Fleet Update" section on how notifications
can occur for Fleet updates and related operations.

Closes googleforgames#2682
markmandel added a commit that referenced this issue Apr 29, 2023
Added documentation to the "Fleet Update" section on how notifications
can occur for Fleet updates and related operations.

Closes #2682

Co-authored-by: Mengye (Max) Gong <8364575+gongmax@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants