Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Rolling Update with Allocated GameServers #2420

Merged
merged 18 commits into from
Mar 17, 2022

Conversation

WVerlaek
Copy link
Contributor

@WVerlaek WVerlaek commented Jan 5, 2022

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind feature
/kind hotfix

What this PR does / Why we need it:

See discussion in #2397. This PR reproduces the issue there, where a Fleet rolling update can get stuck if there's a significant percentage of GameServers in an Allocated state.

This is demonstrated by updating the existing e2e test for Fleet Rolling Update, by repeatedly allocating and shutting down GameServers in the Fleet, keeping around ~half of the Fleet in an Allocated state.

Recording of a run of the e2e test:

Screen.Recording.2022-01-05.at.15.44.02.mov

First, half of the fleet gets Allocated, then the Fleet update is made which creates the second GameServerSet. These two GSSs then get stuck indefinitely, the old GSS never fully scales down.

Which issue(s) this PR fixes:

Closes #2397

Special notes for your reviewer:

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 873efebb-0a62-45b0-a32a-d491860e94a0

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 118e3d6e-effc-4498-8993-1a6fec50c2a8

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@WVerlaek
Copy link
Contributor Author

WVerlaek commented Jan 5, 2022

The potential solution in #2397 (comment) which would change

newGSSUnavailablePodCount := active.Spec.Replicas - active.Status.ReadyReplicas

to

newGSSUnavailablePodCount := active.Spec.Replicas - active.Status.ReadyReplicas - active.Status.AllocatedReplicas

(in https://github.com/googleforgames/agones/blob/main/pkg/fleets/controller.go#L532)

does scale down the old GSS, while leaving Ready replicas up:

Screen.Recording.2022-01-05.at.17.38.18.mov

@markmandel
Copy link
Member

Just letting you know we haven't forgotten about this - been clearing up a bunch of flakey test issues so we're able to get better velocity overall.

@SaitejaTamma SaitejaTamma added the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Feb 8, 2022
@markmandel
Copy link
Member

Oh man, this has been languishing for too long. Sorry! Perf season has hit, and we all got swamped. I haven't forgotten about this.

If it helps at all, happy for you try your proposed fix in this PR and see if it passed all our CI checks.

Copy link
Contributor Author

@WVerlaek WVerlaek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmandel updated the PR to include the proposed fix, still need to make some other changes/cleanup if this works, but will leave as is to see if the tests pass

Comment on lines 226 to 228
fixtures := []bool{true} // , false} // TODO Enable these again
maxSurge := []string{"25%"} // , "10%"} // TODO
doCycle := true // TODO: fixture?
Copy link
Contributor Author

@WVerlaek WVerlaek Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doCycle should be a fixture as well (to test both with and without allocated GS), but that would double the number of tests to run from 4 to 8? Doesn't seem great, especially since they're all running in parallel in the same cluster?

@WVerlaek WVerlaek changed the title e2e test to reproduce issue#2397 Fix Rolling Update with Allocated GameServers Feb 14, 2022
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: aca927c1-717f-4e97-b529-4235b147bd09

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@SaitejaTamma SaitejaTamma removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Feb 16, 2022
@markmandel
Copy link
Member

markmandel commented Mar 3, 2022

Finally come back around. I'm going to bump this CI test again (and upgrade to main) since this test failure feels flaky to me, but maybe it's not. Would like to confirm.

e2e-stable:

--- FAIL: TestHealthCheckDisable (70.76s)
    gameserver_test.go:132: 
        	Error Trace:	gameserver_test.go:132
        	Error:      	Not equal: 
        	            	expected: "Ready"
        	            	actual  : "Unhealthy"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(v1.GameServerState) (len=5) "Ready"
        	            	+(v1.GameServerState) (len=9) "Unhealthy"
        	            	 
        	Test:       	TestHealthCheckDisable

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 081f46c4-bf8a-4fd0-8063-ee4fb5204e99

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

Looks like e2e-stable was not happy:

--- FAIL: TestFleetRollingUpdate (137.90s)
    --- FAIL: TestFleetRollingUpdate/Use_fleet_Patch_true_25%_cycle_true (76.79s)
        fleet_test.go:299: 
            	Error Trace:	fleet_test.go:299
            	Error:      	Received unexpected error:
            	            	resource name may not be empty
            	Test:       	TestFleetRollingUpdate/Use_fleet_Patch_true_25%_cycle_true
        fleet_test.go:311: 
            	Error Trace:	fleet_test.go:311
            	Error:      	Received unexpected error:
            	            	timed out waiting for the condition
            	Test:       	TestFleetRollingUpdate/Use_fleet_Patch_true_25%_cycle_true

I'll run it one more time, see if we get consistent failures.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: da010ed2-7d62-4229-aeab-40e08b5b1bf1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@WVerlaek
Copy link
Contributor Author

WVerlaek commented Mar 9, 2022

Hmm, was able to reproduce the failure locally as well. Will take a look at what's going on

The most recent build does look like it failed on a different test:

Step #21 - "e2e-feature-gates":         	Messages:   	Wrong number of GameServerSets
Step #21 - "e2e-feature-gates":         	Test:       	TestUpdateFleetReplicaAndSpec
Step #21 - "e2e-feature-gates":         	Error:      	Condition never satisfied
Step #21 - "e2e-feature-gates":         	Error Trace:	fleet_test.go:416
Step #21 - "e2e-feature-gates":     fleet_test.go:416: 
Step #21 - "e2e-feature-gates": --- FAIL: TestUpdateFleetReplicaAndSpec (82.18s)

but this PR's test was running at the same time, might be related? Could the test cluster be running out of space?

@markmandel
Copy link
Member

but this PR's test was running at the same time,

We lock on each build so only one e2e test is running at a time. I also have some dashboards to check if the cluster is full.

There is some flakiness in the e2e test suite, but this doesn't 100% look like our usual flakes.

But I would focus on the reproducible issues, and then we can dig deeper.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 97b3b297-b7c1-4889-a0be-74ae139f8dde

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2420/head:pr_2420 && git checkout pr_2420
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.22.0-088a1c8

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 2d178aa5-e01b-477a-92c3-2cdf20fbe95b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 7cd45406-508d-451a-9fdc-72a344778a16

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@WVerlaek
Copy link
Contributor Author

#2420 (comment) was a conflict in updating the Fleet again. Should hopefully be fixed now after using require.Eventually

Second build failure was a lint error

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 44276cda-2370-44cf-bde5-9ee8b0082429

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member

Something got stuck maybe?

goroutine 431 [select]:
k8s.io/apimachinery/pkg/util/wait.WaitFor(0xc00104b1d0, 0x0, 0xc0008fc6c0)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:539 +0x16a
k8s.io/apimachinery/pkg/util/wait.pollInternal(0xc00081bc58, 0x1f4b6c0)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:427 +0xb0
k8s.io/apimachinery/pkg/util/wait.pollImmediateInternal(0x3b9aca00, 0x45d964b800)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:452 +0x65
k8s.io/apimachinery/pkg/util/wait.PollImmediate(0x2320a00, 0xc000a0c000, 0x0)
	/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:441 +0x46
agones.dev/agones/test/e2e.TestFleetRollingUpdate.func1(0xc000a0c000)
	/go/src/agones.dev/agones/test/e2e/fleet_test.go:343 +0x1565
testing.tRunner(0xc000a0c000, 0xc0007b6040)
	/usr/local/go/src/testing/testing.go:1259 +0x230
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1306 +0x727

@WVerlaek
Copy link
Contributor Author

wait.PollImmediateUntil is returning a timeout error when the context gets cancelled at the end of the test, updated the test to ignore a timeout err

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: db367282-675b-4cec-abb5-f5eee48aef3d

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2420/head:pr_2420 && git checkout pr_2420
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.22.0-57bcd5f

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 3b17e8b6-0ab6-4036-91e1-2a8bcf475365

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/2420/head:pr_2420 && git checkout pr_2420
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.22.0-6bf956c

@markmandel
Copy link
Member

I was going to take this for a spin, but looks like it's consistently passing, so that is awesome.

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took it for a spin! Worked!

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, WVerlaek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet RollingUpdate gets stuck when Fleet has high number of allocated GameServers
4 participants