Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Scheduled Autoscalers #3008

Open
austin-space opened this issue Mar 1, 2023 · 28 comments
Open

Feature Request: Scheduled Autoscalers #3008

austin-space opened this issue Mar 1, 2023 · 28 comments
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/feature New features for Agones

Comments

@austin-space
Copy link
Contributor

austin-space commented Mar 1, 2023

Note

If you want to skip a bunch of discussion and see our @indexjoseph's proposed design, see #3008 (comment)

Is your feature request related to a problem? Please describe.
During a scheduled in-game events or new version releases, we see pretty rapid spikes in usage of either an already high use fleet or a newer unused fleet. Both of these events we know the timing of, and our current options are either:

  1. Prescale agressively: this works but means that unless we are building scheduling logic ourselves to undo the additional scale afterwards, we're paying for a lot of unused capacity.
  2. Webhook autoscaler: this is a viable solution, but requires us to build a service to do so.

Describe the solution you'd like
Introduce the concept of scheduled overrides that contain the following:

  1. a start time(in UTC)
  2. an end time(in UTC)
  3. a priority int(higher the better, much like PriorityClasses)
  4. a buffer autoscaler block

Then on autoscaling evaluation:

  1. collect those overrides for which we are between the start and end time
    a. if there are no matching overrides, just use the default autoscaling rule
  2. of those select the highest priority
  3. apply that buffer autoscaling rule instead of the default

This would allow us to set special scaling windows for events or new version releases. A further extension could be to allow recurring windows to do time of day scheduling so that we could have a buffer window in the off hours and a percentage during higher usage, which could help with issues like that described in #2504

Describe alternatives you've considered
As described at the top, we can either prescale agressively, which either results in us adjusting the autoscaler directly, or using the webhook autoscaler.

@austin-space austin-space added the kind/feature New features for Agones label Mar 1, 2023
@markmandel
Copy link
Member

Brainstorm thought I had while looking at this - we would need to store on the autoscaler CRD if the scheduling had happened or not (past 3 entries?) just in case the controller goes down over the scheduling time period, so it can do the work it was supposed to do before.

@austin-space
Copy link
Contributor Author

austin-space commented Mar 1, 2023

I'm not sure that you would need to. This is intended to be stateless, so it just pretends that it's a normal buffer autoscaler. It just so happens that the current buffer rule would be determined by the highest priority active override.

That being said, it's probably worth updating the CRD with the last scale rule used in order to get a sense of why it might be scaling to some level(or not). In that case a name field would be needed for any overrides.

@markmandel
Copy link
Member

Oh I see - so you the actual result would be something like

default buffer: 10%
Between 1am and 5am: scale buffer 5% (low peak time)
Between 1pm and 5pm: scale buffer 20% (high peak time)

Then on each autoscaler loop you would just be looking at which buffer to apply. That makes a lot of sense 👍🏻

I had it in my head more of a "at 1pm exactly do this thing", but this is better. I like it!

@austin-space
Copy link
Contributor Author

Yeah, that seemed like the way to keep the changes to the autoscaler as minimal and resilient as possible.

Your comment did remind me that there's an important distinction here between "scheduled once" and "scheduled recurring". Ideally I want to be able to do both (which could be as easy as specifying just a time when you want daily recurrence, and a datetime when you want a single occurrence), but I suspect that there will be a desire for day of week recurrence. For now I feel like that can probably be avoided for simplicity, since that complexity can spiral out of control quickly.

@markmandel
Copy link
Member

You could also enable a change in min/max as well during a time period?

@austin-space
Copy link
Contributor Author

Yeah, I think that would be ideal. That way I can articulate the widest range of scheduled events. For example, you provided a very good example of when the % buffer might be useful, but I might actually want to set a lower floor as well at night. Likewise for a scheduled event, I may want to do something like:

  1. Set up a new fleet with the same image type, and a label that indicates it's servers for that event.
  2. 10 minutes before the event have a scheduled scaling event kick in that sets the floor to the amount of additional servers we think we need for the event.
  3. A while into the event, flip the minimum back off so that as people leave, the fleet naturally winds down.
  4. Scale down to 0 after the event is over.

That way we don't try to aggressively scale up as people join the event, we just fill out a fleet that we've already allocated. This puts less load on the cluster during a high stress event, and saves money on overall usage.

@austin-space
Copy link
Contributor Author

It would probably make sense to either fall back to the default or reject the CRD if there are any values not provided so as to avoid configuration mistakes.

@markmandel
Copy link
Member

So coming from the conversation in #3718 (@zmerlynn , @aRestless, @nrwiersma), I wanted to capture some thoughts here, on a potential "policy chain" implementation. I think in actual examples, so this lets me flush things out.

So first thought would backward compatibility, but I think that's easy with the CRD constructs we already exist, we just add a new type parameters to FleetAutoscalerSpec, and default it to Policy - so you could have:

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: simple-game-server-autoscaler
spec:
  fleetName: simple-game-server
  type: Policy # this is the new bit, and "policy" would be the default.
  policy:
    type: Buffer
    buffer:
      bufferSize: 2
      minReplicas: 0
      maxReplicas: 10

But if you wanted to have a chain, then type would be "chain", like so:

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
  name: simple-game-server-autoscaler
spec:
  fleetName: simple-game-server
  type: Chain # so now populate the `chain` child element.
  chain:
    - type: Webhook
        webhook:
          service:
          name: autoscaler-webhook-service
          namespace: default
          path: scale
    - policy:
        type: Buffer
        buffer:
          bufferSize: 2
          minReplicas: 0
          maxReplicas: 10

Not 100% sure how to capture the failthrough on a Webhook (maybe it doesn't need to be captured)? From here, we can probably add different type of Scheduling types? (start and end dates, recurring, cron? etc?) that would allow scheduling.

100% a sacrificial draft, so feel free to play with it, but WDYT?

@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 3, 2024

Let's limit the design for now just to the fallback discussion from #3718 / #3686 - we'll have someone working soon on the scheduling part.

However, lifting a bit from the conversation at #3718 (comment), @aRestless was proposing: a policy falls through to the next policy either if it fails (Webhook or whatever else we might add that has error returns), or if some conditional isn't met. That seems easy enough to reason about, and means we would basically have linear cascading of policies that were either:

  • an external RPC, in which case we fall-through if the RPC fails (e.g. Webhook, but you could imagine tying to a metric or any number of other things that could error)
  • a conditional based on yet-to-be-defined API fields and values we have on hand (datetime being the obvious one) - in which case we fall-through if the conditional fails
  • we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)

@markmandel
Copy link
Member

we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)

Or that's an exercise for the user -- don't put anything at the end you don't want to be the last chance at success.

I don't think we can force it?

@aRestless
Copy link

we require the last element of the chain to be non-conditional / non-erroring so there's always some policy (maybe this isn't necessary? maybe this just means "don't modify it if nothing applies"?)

I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).

There was another topic that was brought up, and that's if a "chain of chains" is valid.

A chain of chains would make it very easy to shoot oneself in the foot by building nested logical constructs that become hard to reason about. In my opinion there is a healthy pragmatism in saying that anything that gets a little bit more complex simply belongs into a webhook. And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).

I'm struggling to even come up with a use case for nested chains - but maybe someone else has thoughts on that.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 3, 2024

I think that "change nothing" is a reasonable default policy if all other steps error or their conditions weren't met. After all, it's the only action that makes sense for an empty list of policies (if that's something we want to allow).

Agreed. Though should we allow a chain of zero? Maybe - it allows someone to construct an object they can manipulate later without having to insert, I.e. if you imagine having scheduling automation that inserts your schedule rules and maybe you just don't have any? Certainly seems fair for a chain of zero just to mean "do nothing".

And since this extension point exists, native support for other functionality might only be warranted for features that cannot be put into the webhook (e.g. fallback for webhook failing) or functionality that is likely to see widespread usage (e.g. the schedules proposed here).

Agreed. With branching chains I'd be awfully tempted to support a unit test element as well. 😆

Sounds like we have consensus that:

  • chains should be linear
  • if you fall off the end of the chain, the autoscaler takes no action.

and possible consensus that chains may be empty even if it's useless.

@markmandel
Copy link
Member

I concur on the above as well. Only thing I'd be explicit about is to put it behind a feature gate, just so we have room to experiment / change things if we need to.

@markmandel
Copy link
Member

Though should we allow a chain of zero? Maybe

I think we should - I don't see any reason not to give people the option. We already let people set a fleet name that is invalid.

@austin-space
Copy link
Contributor Author

Off topic for the scheduled use case, but related to the policy chain: A case I've been running into recently, and it seems like the policy chain could solve(depending on what is allowed in the conditions for falling through) is that I want to set an X% buffer policy, but I also want to make sure that there are N ready game servers available. For example if the peak number of allocated game servers in a cluster is 100 game servers, I may want to have a buffer of 10%. However, when the cluster is at it's lowest usage, it might only have 10 allocated game servers, which would leave only 1 game server in the ready state. I'd love to be able to say "if allocated game servers are below 50, keep a ready buffer of 5, otherwise keep a buffer of 10%" or something to that effect.

That particular kind of feature could be a bit of a footgun in that if a user is not careful to make those boundaries somewhat smooth, then they could end up doing a lot more scaling up and down in situations where they hover around the boundary(e.g. a user says 10% for under 100 allocated instances, and 5% for more would mean that there would be a scale down operation as soon as they cross 100 allocated instances). However, I don't think that's too terrible of an outcome, especially if the autoscaling interval isn't set too low.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 3, 2024

@austin-space Seems like it's something that could be implement as a conditional in the chain, though we'd have to be careful on how we define it so that it's deterministic at evaluation. We'll have a resource working on the scheduling case soon, I can see if we can work on that as a follow-on.

@aRestless
Copy link

Reading about the "footgun" aspects, and the need to debug complex setups, I'm wondering if there's a need to track some aspects of the last scaling decision in the FleetAutoscaler status, e.g. inputs, results, errors that occurred, policy (in the chain) that was actually used.

Or would that be a pure logging topic to you folks?

@markmandel
Copy link
Member

This seems like an appropriate use of the Kubernetes event stream on the Autoscaler - which we already do.

We don't want to spam it too much though, so we should be judicious on what we add as an event - but it should track state changes - and especially if something fails (i.e. if a webhook fails, we should definitely log that as a specific event).

@markmandel
Copy link
Member

Random thought for today - we actually have prior art in Agones for "do the things in a list, in order, if the first once fails, do the next one" - so the concept is definitely not foreign to the project.

In GameServerAllocationSpec we do exactly this with selectors.

Almost makes me wonder if chain should be selectors .. but it doesn't quite fit.

@zmerlynn
Copy link
Collaborator

zmerlynn commented Apr 22, 2024

@nrwiersma Someone will be working on scheduled autoscalers starting in 3-4 weeks - are you interested in re-driving #3718 with the above discussion prior to that? If not, do you mind if we adapt it? Thanks!

@nrwiersma
Copy link
Contributor

@zmerlynn You are welcome to adapt it to your needs.

Copy link

github-actions bot commented Jun 1, 2024

'This issue is marked as Stale due to inactivity for more than 30 days. To avoid being marked as 'stale' please add 'awaiting-maintainer' label or add a comment. Thank you for your contributions '

@github-actions github-actions bot added the stale Pending closure unless there is a strong objection. label Jun 1, 2024
@markmandel markmandel added awaiting-maintainer Block issues from being stale/obsolete/closed and removed stale Pending closure unless there is a strong objection. labels Jun 3, 2024
@markmandel
Copy link
Member

Setting awaiting-maintainer since this is on our roadmap.

@indexjoseph
Copy link
Contributor

indexjoseph commented Jun 17, 2024

Scheduled Fleet Autoscaling Design

TLDR: This document outlines the design for a new feature in Agones that enables scheduled autoscaling for Fleet Autoscalers. This functionality allows users to define time windows for automatic adjustments to game server fleets based on predictable events or usage patterns.

Requirements

Critical User Journeys

  • Content Launch: Have a default autoscaling policy, but on <datetime> scale the fleet up.
    • Version Release
    • Flash Sale
    • E.g. On the 16th of January, 2025 Make the buffer size 20%, instead of the default 10%
  • Periodical Cycle: Between these times, on these days of the week, have a different autoscaling policy than the default.
    • Regional Rollouts - e.g. For clusters in the zone us west-1, at 10:00 AM increase the buffer size to 10000 vs the usual 2000
    • Nightly Maintenance - e.g. Every night at 12:00 am, reduce the buffer size to 5% vs the usual 30%
    • Weekend Cycle - e.g. Every Saturday and Sunday increase the buffer size to 40% vs the usual 30%

Proposed Solution

  1. FleetAutoscaler CRD Field Expansion: Introduce new fields within the existing Fleet Autoscaler Custom Resource Definition (CRD) to accommodate scheduled scaling configurations.
    Implement a chain policy - defines a chain, indicating a sequence of conditions (associated with policies) to be evaluated.
    Implement schedule for chains - defines the scheduling criteria for applying scaling logic.
  2. Feature Gate Implementation: Introduce a feature gate to control access to the new scheduled autoscaling functionality.

Proposed FleetAutoscaler CRD Changes

This section defines the structure of scheduling and applying a policy within an Agones Fleet Autoscaler. It allows you to control when the autoscaler considers scaling the game server fleet based on your specified criteria.

The format includes three parameters for scheduling:

  1. Evaluation Time Window (between): This uses start and end datetimes that must conform to RFC3339 to define time range. The policy application window will only be evaluated within this window. (e.g. Start evaluating the policy applications window 6 months from now and stop 12 months from now).
  2. Policy Application Window (activePeriod): This uses a cron expression (startCron) to define a schedule (e.g., daily, weekly) at which point the policy can be applied. Additionally, duration (optional) specifies the length of time for which the policy should be applied after the scheduled start time. By default, if the duration field isn't specified it'll be interpreted as forever (once startCron has passed, the policy is considered active forever, until the end time has passed). Also, timezone (optional) specifies the timezone used for the startCron, which will be UTC by default.
...
schedule:
  between:
    # Start checking to apply the policy at this time, must conform to RFC3339. 
    start: "2024-02-20T16:04:00Z" # optional
    # End checking to apply the policy at this time, must conform to RFC3339.
    end: "2024-02-24T16:04:00Z" # optional
  activePeriod:
    # Timezone to use for the startCron field.
    # By default this field will be UTC if not specified.
    # Set the timezone to EST.
    timezone: "America/New_York"
    # Start applying the bufferSize everyday at 1:00 AM 
    startCron: "0 1 * * 0" # optional
    # Only apply the bufferSize for this 5 hours
    duration: "5h" # optional	
...

Proposed Chain Policy Implementation

This format defines a Fleet Autoscaler policy that utilizes a chain structure for applying scaling logic based on different conditions. It leverages the concept of "falling through the chain" to achieve flexible scheduling and scaling behavior.

Key Elements:

  • Chain Policy: The type of the overall policy is set to Chain, indicating a sequence of conditions/schedules to be evaluated.

  • Chain Entry: Each entry within the chain list contains an optional condition and a corresponding required policy to be applied if the condition/schedule is valid. If a chain entry has no schedule or condition, the corresponding policy will always be applied when the specified chain element is indexed. A chain entry contains the following:

  • ID: Each chain entry has an id (optional) for easier identification and a type of Schedule. By default the id will be the index of the chain entry within the chain (e.g. First entry is 0, second entry is 1).

  • Schedule a chain entry can have a schedule (optional) .

  • Policy Each chain entry has a policy (required) defines the specific policy the FleetAutoscaler should execute to adjust the fleet. The following are the only allowed policies under this field: Buffer, Counters/Lists, Webhook

Three Execution Flows For Chain Iteration

Schedule/Condition Met - Policy Applied:

  • If the element's schedule can be evaluated and is valid. (i.e., it's within the time range for the between and the cron expression matches the current time), the FleetAutoscaler applies the defined policy within that element.

Schedule/Condition Not Met - Fall Through the Chain:

  • If the element's schedule is not currently active (i.e., it's outside the time window and the cron expression doesn’t align with the current time), the FleetAutoscaler doesn't apply the policy within that element.
  • Crucially, the FleetAutoscaler automatically "falls through" to the next element in the chain. This means it continues evaluating subsequent elements in the sequence.

No Schedule Defined - Default Policy Application:

  • Policy is applied, the lack of a schedule is interpreted as a condition that's always true.

Importance of Chaining:
By chaining multiple elements with different schedules and policies, you can create a layered scaling logic. The FleetAutoscaler keeps checking elements until it finds an active schedule and applies the corresponding policy for scaling. This approach allows for more nuanced scaling behavior based on various conditions throughout the day or week. If no the schedule is applicable, then the fleet autoscaler will not apply any policy unless a default policy is specified or a chain entry's schedule becomes eligible.

Chain Example

apiVersion: autoscaling.agones.dev/v1
kind: FleetAutoscaler
metadata:
    name: simple-game-server-chain-autoscaler
spec:
    policy:
      type: Chain # Chain based policy for autoscaling.
      chain:
        # Id of chain entry.
        # Optional.
      - id: "weekday"
        type: Schedule # Schedule based condition.
        schedule:
          between:
            # The policy becomes eligible for application starting on 
            # Feb 20, 2024 at 4:04 PM EST.
            # Optional.
            start: "2024-02-20T16:04:04-05:00"
            # The policy becomes ineligible for application on 
            # Feb 23, 2024 at 4:04 PM EST.
            # Optional.
            end: "2024-02-23T16:04:04-05:00" # optional
          activePeriod:
            # Timezone to be used for the startCron field.
            # Optional.
            timezone: "America/New_York"
            # Start applying the bufferSize everyday at 1:00  AM EST.
            # (Only eligible starting on Feb 20, 2024 at 4:04 PM.)
            # Optional.
            startCron: "0 1 * * 0"
            # Only apply the bufferSize for this 5 hours
            # Optional.
            duration: "5h"
        # Policy to be applied when the condition is met.
        # Required.
        policy:
          type: Buffer
          buffer:
            bufferSize: 50
            minReplicas: 100
            maxReplicas: 2000
        # Id of chain entry.
        # Optional.
      - id: "weekend" 
        type: Schedule
        schedule:
          between:
            # The policy becomes eligible for application starting on
            # Feb 24, 2024 at 4:05 PM EST.
            # Optional.
            start: "2024-02-24T16:04:05-05:00"
            # The policy becomes ineligible for application starting on
            # Feb 26, 2024 at 4:05 PM EST.
            # Optional.
            end: "2024-02-26T16:04:05-05:00"
          activePeriod:
            # Timezone to be used for the schedule.
            timezone: "America/New_York"
            # Start applying the bufferSize everyday at 1:00  AM EST.
            # (Only eligible starting on Feb 24, 2024 at 4:05 PM EST)
            # Optional.
            startCron: "0 1 * * 0"
            # Only apply the bufferSize for this 7 hours
            # Optional.
            duration: "7h"
        # Policy to be applied when the condition is met.
        # Required.
        policy:
          type: Counter
          counter:
            key: rooms
            bufferSize: 10
            minCapacity: 500
            maxCapacity: 1000
        # Id of chain entry.
      - id: "default"
        # Policy will always be applied when no other policy is applicable.
        # Required.
        policy:
          type: Buffer
          buffer:
            bufferSize: 5
            minReplicas: 100
            maxReplicas: 2000

@zmerlynn
Copy link
Collaborator

Design LGTM! A couple of nits:


It is recommended to use ISO8601 time format if you would like to specify a timezone. If a timezone is specified and RFC3339 format is used, the formatted string will take precedence if the timezones differ.

I would be explicit and use the code formatting to help guide the reader here: e.g. "It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone. If .timezone is specified and .between.start or .between.end includes a timezone as well, the formatted string will take precedence if the timezones differ." Note that ISO8601 can include a timezone, so it's one reason I'm being pedantic here.


Schedule a chain entry can have a schedule (optional) contains a:

This section seems redundant with the definition of the schedule above in the design, maybe drop it or shorthand it more?

@markmandel
Copy link
Member

Evaluation Time Window (between): This uses start and end datetimes that must conform to RFC3339 or ISO8601 to define time range. The policy application window will only be evaluated within this window. (e.g. Start evaluating the policy applications window 6 months from now and stop 12 months from now). It is recommended to use ISO8601 time format without a time zone if you would like to specify a timezone using .timezone. If .timezone is specified and .between.start or .between.end includes a timezone as well, the formatted string will take precedence if the timezones differ.

Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?

Policy Application Window (activePeriod)

I'm assuming activePeriod is optional if a between is not specified - and will default to always essentially?

e.g. for CUJ No. 1 "E.g. On the 16th of January, 2025 Make the buffer size 20%, instead of the default 10%" - there's no need for a activePeriod.

@indexjoseph
Copy link
Contributor

indexjoseph commented Jun 25, 2024

Rather than precedence - could we fail validation if a user provides both? Basically you could do one or the other, but not both?

Yeah, I like that, so if a user provides a .timezone and a start/end time with a timezone, validation fails. We can do the same w/ CRON_TZ/TZ, if the user decides to specify a TZ for the .activePeriod.startCron and it differs from the .timezone, validation fails.

I'm assuming activePeriod is optional if a between is not specified - and will default to always essentially?

Yes, exactly. If the user really wanted to they could set the .activePeriod.startCron to "* * * * *" and leave the duration empty, which would have the same effect as well.

@igooch
Copy link
Collaborator

igooch commented Aug 26, 2024

@indexjoseph @zmerlynn are there any outstanding items, or can we mark this as complete?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-maintainer Block issues from being stale/obsolete/closed kind/feature New features for Agones
Projects
None yet
Development

No branches or pull requests

7 participants