Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support webhook fallback #3718

Closed
wants to merge 6 commits into from

Conversation

nrwiersma
Copy link
Contributor

What type of PR is this?

Uncomment only one /kind <> line, press enter to put that in a new line, and remove leading whitespace from that line:

/kind breaking
/kind bug
/kind cleanup
/kind documentation
/kind feature
/kind hotfix
/kind release

What this PR does / Why we need it:

This adds support for fallback policies that are applied when the webhook fails. If the webhook were to fail, the autoscaler will apply the configured fallback policy.

Which issue(s) this PR fixes:

Closes #3686

Special notes for your reviewer:

I found no way to get the CRDs to be self referential. The only way I got it to work was to use x-kubernetes-preserve-unknown-fields: true on the fallback policy.

@github-actions github-actions bot added kind/feature New features for Agones size/M labels Mar 19, 2024
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 3c506d32-fcb3-4b46-9596-548133f4dbab

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 0f6ba3cf-0391-4048-a520-508fcef4a091

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: f863e17c-00e7-4376-a6d7-c53b9e960692

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-ba564dd-amd64

Copy link

This PR exceeds the recommended size of 1000 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: e0b5d167-b6b9-4a48-8bf0-07f9436b0ecb

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-f82f492-amd64

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had a chance to go deep, but figured I'd send you my first thing at least!

properties:
policy:
type: object
x-kubernetes-preserve-unknown-fields: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we have an actual spec here.

My thought here would be to take the webhook element, and turn it into a Helm include or template - much like we do for _gameserverstatus.yaml and use that in both spots.

The include will need a conditional though to make sure it only recurses one level, but since you can pass in a context structure that should be doable.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 7cb2b5df-afd2-4749-90e6-2b5b0586b43b

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: c7a1fb0e-4084-4702-b706-55fc6ac60a03

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/googleforgames/agones.git pull/3718/head:pr_3718 && git checkout pr_3718
  • helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.40.0-dev-dae3be5-amd64

Copy link
Member

@markmandel markmandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

One thing we'll definitely need is some additional docs here:
https://agones.dev/site/docs/getting-started/create-fleetautoscaler/

Our docs publish on merge, so have a look here https://agones.dev/site/docs/contribute/ to see how to use a feature code to hide stuff until next release.

@zmerlynn @nrwiersma I'd love a second opinion -- should this be behind a feature flag? It's simple enough that it's probably not warranted, but wanted some consensus before making accepting as is. WDYT?

# limitations under the License.

{{/* schema for a fleet autoscaler policy */}}
{{- define "fleetautoscaler.policy" }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!


// Fallback defines how the autoscaler should behave in the event the webhook fails.
// +optional
Fallback *WebhookFallback `json:"fallback,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the question just to be sure -- would we ever want a fallback for other policies? Or just webhook? (i.e. should this be further up).

I'm fairly sure the answer is "no", but wanted to triple check just in case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no. The other policies are purely internal and cannot really fail. This is unique to Webhooks.

@@ -162,6 +175,20 @@ func applyWebhookPolicy(w *autoscalingv1.WebhookPolicy, f *agonesv1.Fleet) (repl
return 0, false, err
}

defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this fancy defer stuff 😄 should we create a new function called applyWebhookPolicyWithFallback()

That replaces the call above to:

	case autoscalingv1.WebhookPolicyType:
		return applyWebhookPolicyWithFallback(pol.Webhook, f, gameServerLister, nodeCounts)

Which can call applyWebhookPolicy and then does the appropriate fallback handling with error management, without the complexity of working out what is happening in a defer statement? 😄

@zmerlynn
Copy link
Collaborator

I’ll review this morning - I’d like to take a look though

@zmerlynn
Copy link
Collaborator

zmerlynn commented Mar 30, 2024

@zmerlynn @nrwiersma I'd love a second opinion -- should this be behind a feature flag? It's simple enough that it's probably not warranted, but wanted some consensus before making accepting as is. WDYT?

Let's go ahead and feature flag it. We've got resourcing enough that we were going to work on scheduled fleet autoscalers soon, and while I've done some thinking on an API for that, I haven't done a ton.

I have some thoughts on using something like the API added here, but instead creating a new Chain type of policy that operates as a first-match chain of policies, with conditionals. Feels like we could do scheduling pretty easy that way - and we could just implement webhook fallback that way as well (I.e. a chain of Webhook -> Buffer where Webhook matches unless it fails). But that said, I don't think these APIs are incompatible, I just want a little time since we think we're going to be changing this for something else soon.

ETA: Outside of that, the code looks reasonable and love the helm templating. I'll defer to Mark for detailed review.

@markmandel
Copy link
Member

Chatting with @zmerlynn - I agree, let's feature flag it, so that if we need / want to adjust the API surface once we also tackle autoscaler scheduling, we can break things if need be.

For steps on feature flagging:

And to be fair - better safe than sorry 👍🏻 and this way we can get this in for the next release, you can use it - without having to wait for the scheduling implementation (which is really the point in feature flags!)

@aRestless
Copy link

@nrwiersma and I discussed this internally and if the decision for a Chain policy basically stands, then a Webhook -> Buffer chain is the more generic solution. It's pretty much guaranteed then that the fallback in the Webhook policy that is being introduced in this PR will (and should be!) abandoned as soon as the Chain policy is available.

So instead of investing time into a solution path that is highly likely to get deprecated soon, it makes more sense to us to contribute to the next proper iteration of this feature.

Which brings me to the question if there is already some basic minimum set of definitions for a Chain policy that can be agreed upon, e.g.

  • a Chain policy contains a list of policies
  • the first matching policy of a Chain policy will be applied
  • a policy in the list does not "match" if it errors
  • a policy also does not match if it has a "condition" attached to it and its condition is not fulfilled (with conditions and their syntax to be defined later)

If that's the case, we'd rather close this PR and do a first pass on the Chain policy instead.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 2, 2024

@aRestless If y'all are willing to start on that, it would be much appreciated - but I had a counterpoint view: I think even if we wanted to add in the concept of a Chain policy, having fallback explicit makes sense to me. I say that because it just seems kind of messy to have an API thats like [ policy1, policy2, policy3 ] but the conditional for policy1 is failure to execute the policy. In other words, It's ok to chain the policies and do some sort of first match thing if the controller can evaluate the chain using values it has in front of it at initial execution time (e.g. date-time, whatever else we might add).

Now, I mostly wanted to express that view, but I do like your framing here: a policy falls through to the next policy either if it fails (Webhook or whatever else we might add that has error deps), or if some conditional fails. That seems easy enough to express and means that the structure is "flat" (whereas my odd differentiation between "failure" and "conditional" feels oddly branchy.) So I feel like on more on your side.

Which actually brings me to.. Should we allow Chain of Chain? There's no particular reason not to disallow it but maybe in the interest of avoiding too much complexity, we start with just a simple linear Chain of other non-compound policies?

(I'm also not tied to Chain but it made sense to me.)

@markmandel
Copy link
Member

Oooh, this is interesting stuff 👍🏻

Since we're heading into design discussion, I took my design thoughts over to #3008 (comment) and tagged you all to discuss (I find audit trails for decisions easier to track in Issues, than PRs).

@nrwiersma
Copy link
Contributor Author

It seems there is agreement on something like Chain. I will then close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscaler WebHook fallback Policy
5 participants