Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: delay deleting GameServers in Error state #3428

Merged
merged 1 commit into from
Oct 17, 2023

Conversation

nrwiersma
Copy link
Contributor

@nrwiersma nrwiersma commented Oct 12, 2023

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking

/kind bug

/kind cleanup
/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

This PR addresses an issue in Agones when constrained by a ResourceQuota. A Fleet, specifically the active GameServerSet will attempt to scale past the ResourceQuota causing a large amount of network traffic on the Node running the Agones controller (~50Mb/s) as well as high load on etcd. GameServers where the pod creation is disallowed move into the Error state, immediately being deleted and a new GameServer created.

This issue is addressed by setting an annotation (agoned.dev/errored-at) with the timestamp of when it moved it the Error state. The GameServerSet controller will delay the deletion of these GameServers for at least 10s, in this time counting the GameServer as up and pending. As the reason for a GameServer moving into the Error state are limited (Incorrect spec or not being allowed to create the Pod) the slows the creation of GameServers in this case only, without affecting other areas of scaling.

In testing it was observed that the traffic on the Node went from ~50Mb/s using Agones v1.34.0 to ~5Mb/s using this patch.
Screenshot 2023-10-12 at 13 25 03

Which issue(s) this PR fixes:

Closes #3384

Special notes for your reviewer:

@github-actions github-actions bot added the kind/bug These are bugs. label Oct 12, 2023
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 13e33ccd-16f8-4131-9cd2-dedf099810fe

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d903925f-d74e-4593-ad97-6e62597a0529

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: f34877e9-20ee-4582-aae2-97177cbd8e9e

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3428/head:pr_3428 && git checkout pr_3428
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.36.0-dev-bea2147-amd64

@nrwiersma
Copy link
Contributor Author

We managed to do some testing on a production system, and captured the results of etcd. This was on a fresh cluster that was scaled to 200 GameServers with a ResourceQuota with a hard limit of 32 GameServers, so ~168 GameServers were in Error state. First this build was deployed at ~16:20 and back to Agones v1.34 at ~16:30.

image

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, but looks good! Just a couple of small questions for you, but otherwise, looks good to go 👍🏻

Nice testing btw as well!

pkg/gameserversets/controller.go Outdated Show resolved Hide resolved
pkg/gameserversets/controller_test.go Show resolved Hide resolved
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 574ef8ad-75a2-41e8-97a2-5502ac21a80b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: fd9b7391-18c6-448b-8e92-1b568bf846d1

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3428/head:pr_3428 && git checkout pr_3428
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.36.0-dev-98017db-amd64

@markmandel markmandel enabled auto-merge (squash) October 17, 2023 17:03
Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice change!

@markmandel markmandel merged commit 93a4c4a into googleforgames:main Oct 17, 2023
3 checks passed
@google-oss-prow google-oss-prow bot added the lgtm label Oct 17, 2023
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, nrwiersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nrwiersma nrwiersma deleted the resource-quotas branch October 17, 2023 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ResourceQuota causes high GameServer churn rate
3 participants