-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Fleets RollingUpdate #1626
Fix Fleets RollingUpdate #1626
Conversation
f698d3a
to
80e3466
Compare
Flaky CSharp SDK conformance test:
|
80e3466
to
1966b0e
Compare
The proposed solution with steps from the original bug, would stops shutdown
We should do |
Since this is a change in strategy, for such an important piece of infrastructure, should we put this behind a feature flag that we move from alpha->beta->stable? |
Yes, feature flag would be a must for such change. By the way, additional point to think of: |
Original Kubernetes code does similar thing: loop through all available |
Need to see if we need to add |
Build Failed 😱 Build Id: 82c5bd51-0dd0-44e3-ae30-b133937471d3 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
eaa4a19
to
7bad67d
Compare
Build Failed 😱 Build Id: 23319269-6f17-4b90-90a8-8504efdb2fde To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
96ec224
to
1737b60
Compare
Build Failed 😱 Build Id: 2ab866da-433a-465c-8e33-54e7c2863229 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
877958a
to
445c52a
Compare
Build Failed 😱 Build Id: 789c54e0-6469-4c0b-a56e-6c4f914c431d To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 5512e1dc-4fb4-4a3e-92a7-6328389ff965 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 944942be-a355-4876-b1dd-34971173cf3d To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Build Failed 😱 Build Id: 76eee144-e1f1-4bee-8326-ce700027108c To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
Hugo panic:
|
3081096
to
dae07dd
Compare
Build Succeeded 👏 Build Id: 3be32637-3662-456b-9d22-915a1e011502 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
Build Succeeded 👏 Build Id: 51c81abf-65f0-4188-a0c6-194472b8d930 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like mostly some doc things, and maybe a unit test, but this looks good 👍
Sorry it took me a while to review.
{{< alpha title="Rolling Update on Ready" gate="RollingUpdateOnReady" >}} | ||
|
||
If we are updating the Fleet configuration, the new GameServerSet would be created with 0 GameServers at the beginning, if RollingUpdate deployment strategy is used. After creating a first batch of `MaxSurge` GameServers, old GameServerSet should be waiting before some of them become Ready, before scaling down GameServers which belong to an old GameServerSet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels to me like it's written in the negative. I would recommend writing it in the postive.
Something like:
When this feature is enabled, Fleets will wait for the new GameSevers to become Ready during a Rolling Update, to ensure there is always a set of Ready GameServers before attempting to shut down the previous version Fleet's GameServers
This ensures a Fleet cannot accidentally have 0 GameServers Ready if something goes wrong during a RollingUpdate, or GameServers have a long delay when moving to a Ready state.
What do you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for an advice. I will rewrite it today.
|
||
// SumSpecReplicas returns the total number of | ||
// Spec.Replicas in the list of GameServerSets | ||
func SumSpecReplicas(list []*GameServerSet) int32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these functions have their own Unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true, adding those tests.
// rollingUpdateRest applies the rolling update to the inactive GameServerSets | ||
func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, rest []*agonesv1.GameServerSet) error { | ||
func (c *Controller) rollingUpdateRest(fleet *agonesv1.Fleet, active *agonesv1.GameServerSet, rest []*agonesv1.GameServerSet) error { | ||
if runtime.FeatureEnabled(runtime.FeatureRollingUpdateOnReady) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so much nicer 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I think so too.
Make sure that GameServers actually are Ready before scaling down inactive GameServerSet.
And never goes up to 2 times bigger than max unavailable.
Enhancement in this featureGates we can scale down multiple GSS at once, but a fraction of maxUnavailable.
Run E2E tests with new feature.
Note that test in parallel can have featureGate enabled or disabled at random.
When we base our MaxUnavailable on ReadyReplicas a fleet_test check should be more precise and can contain more than replicas + maxSurge + maxUnavailable. Tested with deployments.
What is left: add comment and perform refactoring: create a separate function for this fix. Add a more verbose error message on errors connected with MaxUnavailable and MaxSurge.
Apply comments from PR, this change focus only on rollingUpdateRest, Active left as it was.
Add one more additional check of maxScaledDown.
Fixed all issues with scale down, even if all replicas are in Unhealthy or Scheduled state in a previous GSS.
Fixing tests: E2E and unit test. Adding docs section.
Add nil value check for SumSpecReplicas.
dae07dd
to
87ff91f
Compare
Build Succeeded 👏 Build Id: 3dbe0d26-76af-4f82-a098-ac6026ae2f62 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aLekSer, markmandel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Fix RollingUpdate Make sure that GameServers actually are Ready before scaling down inactive GameServerSet.
Make sure that GameServers actually are Ready before scaling down
inactive GameServerSet.
What type of PR is this?
/kind bug
What this PR does / Why we need it:
If creating new
GameServers
take more than 30 seconds there is a situation when all GameServers would go down to 0 and all newGameServers
would be in aScheduled
state.Which issue(s) this PR fixes:
Closes #1625
Special notes for your reviewer:
There are steps to reproduce inline with a ticket. Will create a simple E2E test to make sure this functionality is covered.