feat: support webhook fallback #3718

nrwiersma · 2024-03-19T13:18:47Z

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

This adds support for fallback policies that are applied when the webhook fails. If the webhook were to fail, the autoscaler will apply the configured fallback policy.

Which issue(s) this PR fixes:

Closes #3686

Special notes for your reviewer:

I found no way to get the CRDs to be self referential. The only way I got it to work was to use x-kubernetes-preserve-unknown-fields: true on the fallback policy.

agones-bot · 2024-03-19T13:27:20Z

Build Failed 😱

Build Id: 3c506d32-fcb3-4b46-9596-548133f4dbab

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2024-03-19T13:47:01Z

Build Failed 😱

Build Id: 0f6ba3cf-0391-4048-a520-508fcef4a091

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

github-actions · 2024-03-19T14:14:20Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

agones-bot · 2024-03-19T15:20:10Z

Build Succeeded 👏

Build Id: f863e17c-00e7-4376-a6d7-c53b9e960692

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

https://ba564dd-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-ba564dd-amd64

github-actions · 2024-03-19T16:41:18Z

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

agones-bot · 2024-03-19T19:36:02Z

Build Succeeded 👏

Build Id: e0b5d167-b6b9-4a48-8bf0-07f9436b0ecb

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

https://f82f492-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-f82f492-amd64

markmandel

Haven't had a chance to go deep, but figured I'd send you my first thing at least!

markmandel · 2024-03-28T00:13:23Z

install/helm/agones/templates/crds/fleetautoscaler.yaml

+                          properties:
+                            policy:
+                              type: object
+                              x-kubernetes-preserve-unknown-fields: true


I'd rather we have an actual spec here.

My thought here would be to take the webhook element, and turn it into a Helm include or template - much like we do for _gameserverstatus.yaml and use that in both spots.

The include will need a conditional though to make sure it only recurses one level, but since you can pass in a context structure that should be doable.

agones-bot · 2024-03-28T07:41:06Z

Build Failed 😱

Build Id: 7cb2b5df-afd2-4749-90e6-2b5b0586b43b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

agones-bot · 2024-03-28T09:07:34Z

Build Succeeded 👏

Build Id: c7a1fb0e-4084-4702-b706-55fc6ac60a03

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

https://dae3be5-dot-preview-dot-agones-images.appspot.com/

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-dae3be5-amd64

markmandel

Looking good!

One thing we'll definitely need is some additional docs here:
https://agones.dev/site/docs/getting-started/create-fleetautoscaler/

Our docs publish on merge, so have a look here https://agones.dev/site/docs/contribute/ to see how to use a feature code to hide stuff until next release.

@zmerlynn @nrwiersma I'd love a second opinion -- should this be behind a feature flag? It's simple enough that it's probably not warranted, but wanted some consensus before making accepting as is. WDYT?

markmandel · 2024-03-29T00:08:10Z

install/helm/agones/templates/crds/_fleetautoscalerpolicy.yaml

+# limitations under the License.
+
+{{/* schema for a fleet autoscaler policy */}}
+{{- define "fleetautoscaler.policy" }}


markmandel · 2024-03-29T00:12:49Z

pkg/apis/autoscaling/v1/fleetautoscaler.go

+
+	// Fallback defines how the autoscaler should behave in the event the webhook fails.
+	// +optional
+	Fallback *WebhookFallback `json:"fallback,omitempty"`


I'll ask the question just to be sure -- would we ever want a fallback for other policies? Or just webhook? (i.e. should this be further up).

I'm fairly sure the answer is "no", but wanted to triple check just in case.

IMO no. The other policies are purely internal and cannot really fail. This is unique to Webhooks.

markmandel · 2024-03-29T00:35:52Z

pkg/fleetautoscalers/fleetautoscalers.go

@@ -162,6 +175,20 @@ func applyWebhookPolicy(w *autoscalingv1.WebhookPolicy, f *agonesv1.Fleet) (repl
 		return 0, false, err
 	}

+	defer func() {


Rather than this fancy defer stuff 😄 should we create a new function called applyWebhookPolicyWithFallback()

That replaces the call above to:

case autoscalingv1.WebhookPolicyType: return applyWebhookPolicyWithFallback(pol.Webhook, f, gameServerLister, nodeCounts)

Which can call applyWebhookPolicy and then does the appropriate fallback handling with error management, without the complexity of working out what is happening in a defer statement? 😄

zmerlynn · 2024-03-29T14:00:28Z

I’ll review this morning - I’d like to take a look though

zmerlynn · 2024-03-30T00:10:26Z

@zmerlynn @nrwiersma I'd love a second opinion -- should this be behind a feature flag? It's simple enough that it's probably not warranted, but wanted some consensus before making accepting as is. WDYT?

Let's go ahead and feature flag it. We've got resourcing enough that we were going to work on scheduled fleet autoscalers soon, and while I've done some thinking on an API for that, I haven't done a ton.

I have some thoughts on using something like the API added here, but instead creating a new Chain type of policy that operates as a first-match chain of policies, with conditionals. Feels like we could do scheduling pretty easy that way - and we could just implement webhook fallback that way as well (I.e. a chain of Webhook -> Buffer where Webhook matches unless it fails). But that said, I don't think these APIs are incompatible, I just want a little time since we think we're going to be changing this for something else soon.

ETA: Outside of that, the code looks reasonable and love the helm templating. I'll defer to Mark for detailed review.

markmandel · 2024-04-01T23:31:36Z

Chatting with @zmerlynn - I agree, let's feature flag it, so that if we need / want to adjust the API surface once we also tackle autoscaler scheduling, we can break things if need be.

For steps on feature flagging:

And to be fair - better safe than sorry 👍🏻 and this way we can get this in for the next release, you can use it - without having to wait for the scheduling implementation (which is really the point in feature flags!)

aRestless · 2024-04-02T13:15:31Z

@nrwiersma and I discussed this internally and if the decision for a Chain policy basically stands, then a Webhook -> Buffer chain is the more generic solution. It's pretty much guaranteed then that the fallback in the Webhook policy that is being introduced in this PR will (and should be!) abandoned as soon as the Chain policy is available.

So instead of investing time into a solution path that is highly likely to get deprecated soon, it makes more sense to us to contribute to the next proper iteration of this feature.

Which brings me to the question if there is already some basic minimum set of definitions for a Chain policy that can be agreed upon, e.g.

a Chain policy contains a list of policies
the first matching policy of a Chain policy will be applied
a policy in the list does not "match" if it errors
a policy also does not match if it has a "condition" attached to it and its condition is not fulfilled (with conditions and their syntax to be defined later)

If that's the case, we'd rather close this PR and do a first pass on the Chain policy instead.

zmerlynn · 2024-04-02T15:30:49Z

@aRestless If y'all are willing to start on that, it would be much appreciated - but I had a counterpoint view: I think even if we wanted to add in the concept of a Chain policy, having fallback explicit makes sense to me. I say that because it just seems kind of messy to have an API thats like [ policy1, policy2, policy3 ] but the conditional for policy1 is failure to execute the policy. In other words, It's ok to chain the policies and do some sort of first match thing if the controller can evaluate the chain using values it has in front of it at initial execution time (e.g. date-time, whatever else we might add).

Now, I mostly wanted to express that view, but I do like your framing here: a policy falls through to the next policy either if it fails (Webhook or whatever else we might add that has error deps), or if some conditional fails. That seems easy enough to express and means that the structure is "flat" (whereas my odd differentiation between "failure" and "conditional" feels oddly branchy.) So I feel like on more on your side.

Which actually brings me to.. Should we allow Chain of Chain? There's no particular reason not to disallow it but maybe in the interest of avoiding too much complexity, we start with just a simple linear Chain of other non-compound policies?

(I'm also not tied to Chain but it made sense to me.)

markmandel · 2024-04-02T23:54:31Z

Oooh, this is interesting stuff 👍🏻

Since we're heading into design discussion, I took my design thoughts over to #3008 (comment) and tagged you all to discuss (I find audit trails for decisions easier to track in Issues, than PRs).

nrwiersma · 2024-04-03T08:40:47Z

It seems there is agreement on something like Chain. I will then close this.

github-actions bot added kind/feature New features for Agones size/M labels Mar 19, 2024

github-actions bot added the size/XL label Mar 19, 2024

markmandel reviewed Mar 28, 2024

View reviewed changes

nrwiersma added 5 commits March 28, 2024 08:34

feat: support fallback for webhook autoscaler

57ddd02

feat: regenerate client

3c239c3

feat: update crds

0121c22

fix: linter and tests

fac3419

feat: update docs

18afd0c

nrwiersma force-pushed the webhook-fallback branch from f82f492 to 18afd0c Compare March 28, 2024 07:10

github-actions bot added the size/L label Mar 28, 2024

fix: cr suggestion

dae3be5

nrwiersma force-pushed the webhook-fallback branch from e945fc3 to dae3be5 Compare March 28, 2024 07:58

markmandel reviewed Mar 29, 2024

View reviewed changes

markmandel mentioned this pull request Apr 2, 2024

Feature Request: Scheduled Autoscalers #3008

Open

nrwiersma closed this Apr 3, 2024

markmandel mentioned this pull request Jun 26, 2024

feat: Add CRD Changes and Feature Flag for chain policy #3880

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support webhook fallback #3718

feat: support webhook fallback #3718

nrwiersma commented Mar 19, 2024

agones-bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

markmandel left a comment

markmandel Mar 28, 2024

agones-bot commented Mar 28, 2024

agones-bot commented Mar 28, 2024

markmandel left a comment

markmandel Mar 29, 2024

markmandel Mar 29, 2024

nrwiersma Mar 29, 2024

markmandel Mar 29, 2024

zmerlynn commented Mar 29, 2024

zmerlynn commented Mar 30, 2024 •

edited

Loading

markmandel commented Apr 1, 2024

aRestless commented Apr 2, 2024

zmerlynn commented Apr 2, 2024 •

edited

Loading

markmandel commented Apr 2, 2024

nrwiersma commented Apr 3, 2024

feat: support webhook fallback #3718

feat: support webhook fallback #3718

Conversation

nrwiersma commented Mar 19, 2024

agones-bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

github-actions bot commented Mar 19, 2024

agones-bot commented Mar 19, 2024

markmandel left a comment

Choose a reason for hiding this comment

markmandel Mar 28, 2024

Choose a reason for hiding this comment

agones-bot commented Mar 28, 2024

agones-bot commented Mar 28, 2024

markmandel left a comment

Choose a reason for hiding this comment

markmandel Mar 29, 2024

Choose a reason for hiding this comment

markmandel Mar 29, 2024

Choose a reason for hiding this comment

nrwiersma Mar 29, 2024

Choose a reason for hiding this comment

markmandel Mar 29, 2024

Choose a reason for hiding this comment

zmerlynn commented Mar 29, 2024

zmerlynn commented Mar 30, 2024 • edited Loading

markmandel commented Apr 1, 2024

aRestless commented Apr 2, 2024

zmerlynn commented Apr 2, 2024 • edited Loading

markmandel commented Apr 2, 2024

nrwiersma commented Apr 3, 2024

zmerlynn commented Mar 30, 2024 •

edited

Loading

zmerlynn commented Apr 2, 2024 •

edited

Loading