New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix Fleets RollingUpdate #1626

Merged

markmandel merged 15 commits into googleforgames:master from aLekSer:fix/rolling-update

Sep 22, 2020

Collaborator

aLekSer commented Jun 15, 2020

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup
/kind documentation
/kind feature
/kind hotfix

What this PR does / Why we need it:
If creating new GameServers take more than 30 seconds there is a situation when all GameServers would go down to 0 and all new GameServers would be in a Scheduled state.

Which issue(s) this PR fixes:

Closes #1625

Special notes for your reviewer:

There are steps to reproduce inline with a ticket. Will create a simple E2E test to make sure this functionality is covered.

google-oss-robot added the do-not-merge/work-in-progress label

google-oss-robot requested review from cyriltovena and pooneh-m

June 15, 2020 15:32

googlebot added the cla: yes label

google-oss-robot added the size/M label

aLekSer force-pushed the fix/rolling-update branch from f698d3a to 80e3466 Compare

June 15, 2020 16:42

google-oss-robot added size/XS and removed size/M labels

Collaborator Author

aLekSer commented Jun 15, 2020

Flaky CSharp SDK conformance test:

For more information on configuring HTTPS see https://go.microsoft.com/fwlink/?linkid=848054.
/usr/share/dotnet/sdk/2.2.402/NuGet.targets(123,5): error : The file '/go/src/agones.dev/agones/sdks/csharp/sdk/obj/csharp-sdk.csproj.nuget.g.props' already exists. [/go/src/agones.dev/agones/sdks/csharp/test/csharp-sdk-test.csproj]
includes/sdk.mk:88: recipe for target 'run-sdk-command' failed
make[1]: *** [run-sdk-command] Error 1
includes/sdk.mk:84: recipe for target 'run-sdk-command-csharp' failed

aLekSer force-pushed the fix/rolling-update branch from 80e3466 to 1966b0e Compare

June 15, 2020 16:58

Collaborator Author

aLekSer commented Jun 15, 2020

The proposed solution with steps from the original bug, would stops shutdown gameservers a bit after the right point (50% left), but the approach is right:

k get fleets
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       20        25        0           10      62s

We should do RollingUpdate strategy similar to what Deployment has https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
So looking at the ReadyReplicas count is crucial.

Member

markmandel commented Jun 15, 2020 •

edited

Loading

Since this is a change in strategy, for such an important piece of infrastructure, should we put this behind a feature flag that we move from alpha->beta->stable?

Collaborator Author

aLekSer commented Jun 15, 2020 •

edited

Loading

Yes, feature flag would be a must for such change. By the way, additional point to think of:
https://github.com/kubernetes/kubernetes/blob/323f34858de18b862d43c40b2cced65ad8e24052/pkg/controller/deployment/rolling.go#L192

Collaborator Author

aLekSer commented Jun 25, 2020 •

edited

Loading

Original Kubernetes code does similar thing: loop through all available replicaSets and calculates totalAvailableReplicas += rs.Status.AvailableReplicas :
https://github.com/kubernetes/kubernetes/blob/e529bd0bcad66fd9afe4e7ad248acbc13563aaa0/pkg/controller/deployment/util/deployment_util.go#L723:1

Collaborator Author

aLekSer commented Jun 25, 2020 •

edited

Loading

Need to see if we need to add kubectl rolling-update fleet command for a Fleet, as well as to see if Unhealthy GameServers can lead to issues similar to kubernetes/kubernetes#16737

google-oss-robot added size/M and removed size/XS labels

Collaborator

agones-bot commented Jun 25, 2020

Build Failed 😱

Build Id: 82c5bd51-0dd0-44e3-ae30-b133937471d3

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

aLekSer force-pushed the fix/rolling-update branch from eaa4a19 to 7bad67d Compare

June 26, 2020 16:10

google-oss-robot added size/L and removed size/M labels

Collaborator

agones-bot commented Jun 26, 2020

Build Failed 😱

Build Id: 23319269-6f17-4b90-90a8-8504efdb2fde

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

aLekSer force-pushed the fix/rolling-update branch from 96ec224 to 1737b60 Compare

June 26, 2020 16:48

Collaborator

agones-bot commented Jun 26, 2020

Build Failed 😱

Build Id: 2ab866da-433a-465c-8e33-54e7c2863229

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

aLekSer force-pushed the fix/rolling-update branch 2 times, most recently from 877958a to 445c52a Compare

June 26, 2020 16:59

Collaborator

agones-bot commented Jun 26, 2020

Build Failed 😱

Build Id: 789c54e0-6469-4c0b-a56e-6c4f914c431d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Collaborator

agones-bot commented Jun 26, 2020

Build Failed 😱

Build Id: 5512e1dc-4fb4-4a3e-92a7-6328389ff965

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Collaborator

agones-bot commented Jun 26, 2020

Build Failed 😱

Build Id: 944942be-a355-4876-b1dd-34971173cf3d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

aLekSer marked this pull request as ready for review

June 26, 2020 18:30

Collaborator

agones-bot commented Sep 16, 2020

Build Failed 😱

Build Id: 76eee144-e1f1-4bee-8326-ce700027108c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Collaborator Author

aLekSer commented Sep 16, 2020

Hugo panic:


fatal error: concurrent map read and map write

goroutine 202 [running]:
runtime.throw(0x1e2fd5a, 0x21)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc003407428 sp=0xc0034073f8 pc=0x4f1712

aLekSer force-pushed the fix/rolling-update branch from 3081096 to dae07dd Compare

September 16, 2020 22:39

Collaborator

agones-bot commented Sep 16, 2020

Build Succeeded 👏

Build Id: 3be32637-3662-456b-9d22-915a1e011502

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.9.0-3081096
image: gcr.io/agones-images/agones-sdk:1.9.0-3081096
image: gcr.io/agones-images/agones-ping:1.9.0-3081096
Linux C++ SDK (build): agonessdk-1.9.0-3081096-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.9.0-3081096.zip

A preview of the website (the last 30 builds are retained):

https://3081096-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-3081096

Collaborator

agones-bot commented Sep 16, 2020

Build Succeeded 👏

Build Id: 51c81abf-65f0-4188-a0c6-194472b8d930

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.9.0-dae07dd
image: gcr.io/agones-images/agones-sdk:1.9.0-dae07dd
image: gcr.io/agones-images/agones-ping:1.9.0-dae07dd
Linux C++ SDK (build): agonessdk-1.9.0-dae07dd-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.9.0-dae07dd.zip

A preview of the website (the last 30 builds are retained):

https://dae07dd-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-dae07dd

markmandel reviewed

View reviewed changes

Member

markmandel left a comment

Looks like mostly some doc things, and maybe a unit test, but this looks good 👍

Sorry it took me a while to review.

site/content/en/docs/Guides/fleet-updates.md Outdated

		{{< alpha title="Rolling Update on Ready" gate="RollingUpdateOnReady" >}}

		If we are updating the Fleet configuration, the new GameServerSet would be created with 0 GameServers at the beginning, if RollingUpdate deployment strategy is used. After creating a first batch of `MaxSurge` GameServers, old GameServerSet should be waiting before some of them become Ready, before scaling down GameServers which belong to an old GameServerSet.

Member

markmandel Sep 22, 2020

This feels to me like it's written in the negative. I would recommend writing it in the postive.

Something like:

When this feature is enabled, Fleets will wait for the new GameSevers to become Ready during a Rolling Update, to ensure there is always a set of Ready GameServers before attempting to shut down the previous version Fleet's GameServers

This ensures a Fleet cannot accidentally have 0 GameServers Ready if something goes wrong during a RollingUpdate, or GameServers have a long delay when moving to a Ready state.

What do you think of that?

Collaborator Author

aLekSer Sep 22, 2020

Thanks for an advice. I will rewrite it today.

pkg/apis/agones/v1/fleet.go

+              // SumSpecReplicas returns the total number of
+              // Spec.Replicas in the list of GameServerSets
+              func SumSpecReplicas(list []*GameServerSet) int32 {

Member

markmandel Sep 22, 2020

Should these functions have their own Unit tests?

Collaborator Author

aLekSer Sep 22, 2020

Yes, that's true, adding those tests.

pkg/fleets/controller.go

               // rollingUpdateRest applies the rolling update to the inactive GameServerSets
-              func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, rest []*agonesv1.GameServerSet) error {
+              func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, active *agonesv1.GameServerSet, rest []*agonesv1.GameServerSet) error {
+              	if runtime.FeatureEnabled(runtime.FeatureRollingUpdateOnReady) {

Member

markmandel Sep 22, 2020

This is so much nicer 👍

Collaborator Author

aLekSer Sep 22, 2020

thanks, I think so too.

aLekSer added 15 commits

September 22, 2020 13:06


          Fix RollingUpdate

3eaa39a

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.


          Max Unavailable gameservers is bound to the param

1fe8f6e

And never goes up to 2 times bigger than max unavailable.


          Add FeatureGate and count ReadyReplicas in all GSS

f503272

Enhancement in this featureGates we can scale down multiple GSS at once,
but a fraction of maxUnavailable.


          Update feature name and add new feature to make

73ac819

Run E2E tests with new feature.


          Updates to lower bound of maxUnavailable

5c83d7c


          Fix test for the new FeatureGate

4b5559e

Note that test in parallel can have featureGate enabled or disabled at
random.


          Fix tests, add proper FeatureEnabled for old logic

1bab964

When we base our MaxUnavailable on ReadyReplicas a fleet_test check
 should be more precise and can contain more than replicas + maxSurge
 + maxUnavailable. Tested with deployments.


          Change in the name of a featureGate

c95b76e


          Fix flaky new test

ff9a567


          Aplying most of the comments

52afd55

What is left: add comment and perform refactoring: create a separate
function for this fix. Add a more verbose error message on errors
connected with MaxUnavailable and MaxSurge.


          Split rollingUpdateRest into two separate funcs

85e295a

Apply comments from PR, this change focus only on rollingUpdateRest,
Active left as it was.


          E2E test made more readable

896dcb9

Add one more additional check of maxScaledDown.


          Add cleanup unhealthy replicas function

08723b4

Fixed all issues with scale down, even if all replicas are in Unhealthy
or Scheduled state in a previous GSS.


          Applying comments

2e95990

Fixing tests: E2E and unit test. Adding docs section.


          Add Unit tests for sum functions

87ff91f

Add nil value check for SumSpecReplicas.

aLekSer force-pushed the fix/rolling-update branch from dae07dd to 87ff91f Compare

September 22, 2020 13:12

Collaborator

agones-bot commented Sep 22, 2020

Build Succeeded 👏

Build Id: 3dbe0d26-76af-4f82-a098-ac6026ae2f62

The following development artifacts have been built, and will exist for the next 30 days:

image: gcr.io/agones-images/agones-controller:1.9.0-87ff91f
image: gcr.io/agones-images/agones-sdk:1.9.0-87ff91f
image: gcr.io/agones-images/agones-ping:1.9.0-87ff91f
Linux C++ SDK (build): agonessdk-1.9.0-87ff91f-linux-arch_64.tar.gz
SDK Server: agonessdk-server-1.9.0-87ff91f.zip

A preview of the website (the last 30 builds are retained):

https://87ff91f-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-87ff91f

markmandel approved these changes

View reviewed changes

google-oss-robot assigned markmandel

google-oss-robot added the lgtm label

google-oss-robot commented Sep 22, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aLekSer, markmandel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [aLekSer,markmandel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

markmandel merged commit dc56175 into googleforgames:master

markmandel added this to the 1.9.0 milestone

ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request


          Fix Fleets RollingUpdate (googleforgames#1626)

d1e665c

* Fix RollingUpdate

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cla: yes kind/feature lgtm size/L