Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling updates should wait for batches to become healthy before iterating #1625

Closed
jonbuckley33 opened this issue Jun 12, 2020 · 5 comments · Fixed by #1626
Closed

Rolling updates should wait for batches to become healthy before iterating #1625

jonbuckley33 opened this issue Jun 12, 2020 · 5 comments · Fixed by #1626
Assignees
Labels
kind/bug These are bugs.
Milestone

Comments

@jonbuckley33
Copy link

Is your feature request related to a problem? Please describe.
Our game servers take a little while to boot up (a few min), so when we perform a rolling update with the default settings (25% maxUnavailable/maxSurge), the whole fleet is taken down before any of the updated game servers can become "Ready."

Describe the solution you'd like
We would like for Agones to wait for each batch of updated game servers to become Ready before updating another batch of game servers. (One potential extension of this is that Agones could abort an update if the game servers don't become Ready within a certain period.)

Describe alternatives you've considered
Reducing the maxUnavailable/maxSurge percentage may mitigate the issue.

@jonbuckley33 jonbuckley33 added the kind/feature New features for Agones label Jun 12, 2020
@aLekSer
Copy link
Collaborator

aLekSer commented Jun 15, 2020

Thanks for posting this issue. There is a Note section in the documents:
https://agones.dev/site/docs/guides/fleet-updates/#rolling-update-strategy
An issue is actually related with boot up time.

When Fleet update contains only changes to the replicas parameter, then new GameServers will be created/deleted straight away, which means in that case maxSurge and maxUnavailable parameters for a RollingUpdate will not be used. The RollingUpdate strategy takes place when you update spec parameters other than replicas.

Do you use FleetAutoScaler with a Fleet?

I was able to reproduce your issue.

@aLekSer aLekSer self-assigned this Jun 15, 2020
@aLekSer aLekSer added kind/bug These are bugs. and removed kind/feature New features for Agones labels Jun 15, 2020
@aLekSer
Copy link
Collaborator

aLekSer commented Jun 15, 2020

Steps to reproduce are below. @jonbuckley33 please add more comments if your steps was a bit different.

  1. kubectl apply -f the fleet:
fleet.yaml
apiVersion: "agones.dev/v1"
kind: Fleet
metadata:
  name: simple-udp
spec:
  replicas: 20
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      ports:
      - name: default
        containerPort: 7654
      template:
        spec:
          containers:
          - name: simple-udp
            image: gcr.io/agones-images/udp-server:0.21
            resources:
              requests:
                memory: "64Mi"
                cpu: "20m"
              limits:
                memory: "64Mi"
                cpu: "20m"
            args:
              - "-ready=true"
  1. Wait 20 Ready GameServers to become available.

  2. edit fleet.yaml. Add
    args:
    - "-ready=false"
    Which means all gameservers could become ready only after connecting to them.

  3. Apply the fleet with -ready=false

  4. See that all new GameServers in portions of 25% would become in a Scheduled state.

@aLekSer
Copy link
Collaborator

aLekSer commented Jun 15, 2020

Deployment is waiting Pods to become ready before the second wave of scaling down old Pods https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy

For example, when this value [max unavailable] is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further,

@jonbuckley33
Copy link
Author

Thanks for posting concise repro steps -- that's exactly the behavior I am seeing.

Do you use FleetAutoScaler with a Fleet?

No, I do not.

Excited to see that you have a PR out that may fix this. I appreciate the quick turnaround! :)

@aLekSer
Copy link
Collaborator

aLekSer commented Sep 13, 2020

I can add steps number 6 and 7 to verify:
6. Edit a fleet and change containerPort
k get gss would return 3 gss now
7. Fix Container Port and set ready=true:
Now k get gss should return only one gss.
On master after all these steps we have actually one GSS. Fixing my PR to fix this new regression.

@aLekSer aLekSer pinned this issue Sep 18, 2020
@markmandel markmandel added this to the 1.9.0 milestone Sep 22, 2020
@markmandel markmandel unpinned this issue Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants