-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: delay deleting GameServers in Error state #3428
Conversation
Build Failed 😱 Build Id: 13e33ccd-16f8-4131-9cd2-dedf099810fe To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
9536ae4
to
1e531a3
Compare
Build Failed 😱 Build Id: d903925f-d74e-4593-ad97-6e62597a0529 To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
1e531a3
to
bea2147
Compare
Build Succeeded 👏 Build Id: f34877e9-20ee-4582-aae2-97177cbd8e9e The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
We managed to do some testing on a production system, and captured the results of etcd. This was on a fresh cluster that was scaled to 200 GameServers with a ResourceQuota with a hard limit of 32 GameServers, so ~168 GameServers were in Error state. First this build was deployed at ~16:20 and back to Agones v1.34 at ~16:30. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, but looks good! Just a couple of small questions for you, but otherwise, looks good to go 👍🏻
Nice testing btw as well!
Build Failed 😱 Build Id: 574ef8ad-75a2-41e8-97a2-5502ac21a80b To get permission to view the Cloud Build view, join the agones-discuss Google Group. |
68256c1
to
98017db
Compare
Build Succeeded 👏 Build Id: fd9b7391-18c6-448b-8e92-1b568bf846d1 The following development artifacts have been built, and will exist for the next 30 days:
A preview of the website (the last 30 builds are retained): To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice change!
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: markmandel, nrwiersma The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / Why we need it:
This PR addresses an issue in Agones when constrained by a ResourceQuota. A Fleet, specifically the active GameServerSet will attempt to scale past the ResourceQuota causing a large amount of network traffic on the Node running the Agones controller (~50Mb/s) as well as high load on etcd. GameServers where the pod creation is disallowed move into the Error state, immediately being deleted and a new GameServer created.
This issue is addressed by setting an annotation (
agoned.dev/errored-at
) with the timestamp of when it moved it the Error state. The GameServerSet controller will delay the deletion of these GameServers for at least 10s, in this time counting the GameServer as up and pending. As the reason for a GameServer moving into the Error state are limited (Incorrect spec or not being allowed to create the Pod) the slows the creation of GameServers in this case only, without affecting other areas of scaling.In testing it was observed that the traffic on the Node went from ~50Mb/s using Agones v1.34.0 to ~5Mb/s using this patch.
Which issue(s) this PR fixes:
Closes #3384
Special notes for your reviewer: