Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Fleets RollingUpdate #1626

Merged
merged 15 commits into from
Sep 22, 2020
Merged

Conversation

aLekSer
Copy link
Collaborator

@aLekSer aLekSer commented Jun 15, 2020

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup
/kind documentation
/kind feature
/kind hotfix

What this PR does / Why we need it:
If creating new GameServers take more than 30 seconds there is a situation when all GameServers would go down to 0 and all new GameServers would be in a Scheduled state.

Which issue(s) this PR fixes:

Closes #1625

Special notes for your reviewer:

There are steps to reproduce inline with a ticket. Will create a simple E2E test to make sure this functionality is covered.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 15, 2020

Flaky CSharp SDK conformance test:

For more information on configuring HTTPS see https://go.microsoft.com/fwlink/?linkid=848054.
/usr/share/dotnet/sdk/2.2.402/NuGet.targets(123,5): error : The file '/go/src/agones.dev/agones/sdks/csharp/sdk/obj/csharp-sdk.csproj.nuget.g.props' already exists. [/go/src/agones.dev/agones/sdks/csharp/test/csharp-sdk-test.csproj]
includes/sdk.mk:88: recipe for target 'run-sdk-command' failed
make[1]: *** [run-sdk-command] Error 1
includes/sdk.mk:84: recipe for target 'run-sdk-command-csharp' failed

@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 15, 2020

The proposed solution with steps from the original bug, would stops shutdown gameservers a bit after the right point (50% left), but the approach is right:

k get fleets
NAME         SCHEDULING   DESIRED   CURRENT   ALLOCATED   READY   AGE
simple-udp   Packed       20        25        0           10      62s

We should do RollingUpdate strategy similar to what Deployment has https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
So looking at the ReadyReplicas count is crucial.

@markmandel
Copy link
Member

markmandel commented Jun 15, 2020

Since this is a change in strategy, for such an important piece of infrastructure, should we put this behind a feature flag that we move from alpha->beta->stable?

@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 15, 2020

Yes, feature flag would be a must for such change. By the way, additional point to think of:
https://github.com/kubernetes/kubernetes/blob/323f34858de18b862d43c40b2cced65ad8e24052/pkg/controller/deployment/rolling.go#L192

@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 25, 2020

Original Kubernetes code does similar thing: loop through all available replicaSets and calculates totalAvailableReplicas += rs.Status.AvailableReplicas :
https://github.com/kubernetes/kubernetes/blob/e529bd0bcad66fd9afe4e7ad248acbc13563aaa0/pkg/controller/deployment/util/deployment_util.go#L723:1

@aLekSer
Copy link
Collaborator Author

aLekSer commented Jun 25, 2020

Need to see if we need to add kubectl rolling-update fleet command for a Fleet, as well as to see if Unhealthy GameServers can lead to issues similar to kubernetes/kubernetes#16737

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 82c5bd51-0dd0-44e3-ae30-b133937471d3

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 23319269-6f17-4b90-90a8-8504efdb2fde

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 2ab866da-433a-465c-8e33-54e7c2863229

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@aLekSer aLekSer force-pushed the fix/rolling-update branch 2 times, most recently from 877958a to 445c52a Compare June 26, 2020 16:59
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 789c54e0-6469-4c0b-a56e-6c4f914c431d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 5512e1dc-4fb4-4a3e-92a7-6328389ff965

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 944942be-a355-4876-b1dd-34971173cf3d

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@aLekSer aLekSer marked this pull request as ready for review June 26, 2020 18:30
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 76eee144-e1f1-4bee-8326-ce700027108c

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@aLekSer
Copy link
Collaborator Author

aLekSer commented Sep 16, 2020

Hugo panic:


fatal error: concurrent map read and map write

goroutine 202 [running]:
runtime.throw(0x1e2fd5a, 0x21)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc003407428 sp=0xc0034073f8 pc=0x4f1712

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 3be32637-3662-456b-9d22-915a1e011502

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-3081096

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 51c81abf-65f0-4188-a0c6-194472b8d930

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-dae07dd

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like mostly some doc things, and maybe a unit test, but this looks good 👍

Sorry it took me a while to review.

{{< alpha title="Rolling Update on Ready" gate="RollingUpdateOnReady" >}}

If we are updating the Fleet configuration, the new GameServerSet would be created with 0 GameServers at the beginning, if RollingUpdate deployment strategy is used. After creating a first batch of `MaxSurge` GameServers, old GameServerSet should be waiting before some of them become Ready, before scaling down GameServers which belong to an old GameServerSet.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels to me like it's written in the negative. I would recommend writing it in the postive.

Something like:

When this feature is enabled, Fleets will wait for the new GameSevers to become Ready during a Rolling Update, to ensure there is always a set of Ready GameServers before attempting to shut down the previous version Fleet's GameServers

This ensures a Fleet cannot accidentally have 0 GameServers Ready if something goes wrong during a RollingUpdate, or GameServers have a long delay when moving to a Ready state.

What do you think of that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for an advice. I will rewrite it today.


// SumSpecReplicas returns the total number of
// Spec.Replicas in the list of GameServerSets
func SumSpecReplicas(list []*GameServerSet) int32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these functions have their own Unit tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true, adding those tests.

// rollingUpdateRest applies the rolling update to the inactive GameServerSets
func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, rest []*agonesv1.GameServerSet) error {
func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, active *agonesv1.GameServerSet, rest []*agonesv1.GameServerSet) error {
if runtime.FeatureEnabled(runtime.FeatureRollingUpdateOnReady) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much nicer 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think so too.

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.
And never goes up to 2 times bigger than max unavailable.
Enhancement in this featureGates we can scale down multiple GSS at once,
but a fraction of maxUnavailable.
Run E2E tests with new feature.
Note that test in parallel can have featureGate enabled or disabled at
random.
When we base our MaxUnavailable on ReadyReplicas a fleet_test check
 should be more precise and can contain more than replicas + maxSurge
 + maxUnavailable. Tested with deployments.
What is left: add comment and perform refactoring: create a separate
function for this fix. Add a more verbose error message on errors
connected with MaxUnavailable and MaxSurge.
Apply comments from PR, this change focus only on rollingUpdateRest,
Active left as it was.
Add one more additional check of maxScaledDown.
Fixed all issues with scale down, even if all replicas are in Unhealthy
or Scheduled state in a previous GSS.
Fixing tests: E2E and unit test. Adding docs section.
Add nil value check for SumSpecReplicas.
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 3dbe0d26-76af-4f82-a098-ac6026ae2f62

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/1626/head:pr_1626 && git checkout pr_1626
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.9.0-87ff91f

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aLekSer, markmandel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@markmandel markmandel merged commit dc56175 into googleforgames:master Sep 22, 2020
@markmandel markmandel added this to the 1.9.0 milestone Sep 22, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
* Fix RollingUpdate

Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rolling updates should wait for batches to become healthy before iterating
6 participants