User configurable rate limiting for event recording #236

hwangmoretime · 2023-03-10T18:37:12Z

Tell us about your request

Currently the rate limiting constants for events are hard coded inside: https://github.com/aws/karpenter-core/blob/c74476b4015114e63066c9f36018dcdc150d2b85/pkg/controllers/provisioning/scheduling/events/events.go
There is not rate limiting for PodFailedToSchedule

The request is to

add rate limiting to PodFailedToSchedule
add the ability for users to configure the constants involved in rate limiting event production

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Earlier versions of Karpenter refer to the problem that I'm facing:

https://github.com/aws/karpenter/blob/ce235744438601bd78fc89d23cfd402f6e38cb1c/pkg/events/loadshedding.go#L35

This prevents us from hammering the API server with events that likely aren't useful...

We see that Karpenter hammering the control plane with events, which has impacted the uptime of our control plane.

Are you currently working around this issue?

no current good work arounds.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jonathan-innis · 2023-03-10T19:39:11Z

Can you share how many FailedToSchedule events that you are seeing across all your pods?

hwangmoretime · 2023-03-15T22:13:29Z

5k - 20k FailedToSchedule per hour during our recent incidents

jonathan-innis · 2023-03-16T00:50:37Z

We should only be firing that many events when there is a large number of pods that can't be scheduled on any provisioner. Do you mind sharing the amount of pods that you had that couldn't schedule? Also, how did this compare to other cluster components? My assumption is that if Karpenter is reacting to the pod, there are also events that are coming from the kube-scheduler as well around not being able to schedule the pod.

hwangmoretime · 2023-03-16T02:23:19Z

Without getting into specifics, there were a large amount of pods waiting to be scheduled.

Our events per hour during the incident ranged from 100k - 150k, so Karpenter FailedToSchedule events accounted for 5%-10% of events during the incident.

jonathan-innis · 2023-03-16T16:48:01Z

If we were to make this user-configurable, what would you want to rate-limit it to? Would you want rate-limiting across all events or across certain types of events?

github-actions · 2023-04-06T12:01:23Z

Labeled for closure due to inactivity in 10 days.

hwangmoretime · 2023-04-06T23:05:30Z

@jonathan-innis I'm open to one or both. I think at the very least, rate-limiting across all events.

jonathan-innis · 2023-04-06T23:48:35Z

I think in general we think our current event recording is fine considering we see ourselves as a critical cluster component. Adding the wontfix label for now since I don't think we are planning to take this one up.

github-actions · 2023-04-28T12:01:49Z

Labeled for closure due to inactivity in 10 days.

hwangmoretime · 2023-05-22T17:23:27Z

I think in general we think our current event recording is fine considering we see ourselves as a critical cluster component. Adding the wontfix label for now since I don't think we are planning to take this one up.

FWIW, my context is ML research, where we regularly have more pods to schedule than compute available. This leads to Karpenter routinely introducing a large number of events that introduce more pressure on the control plane.

ellistarn · 2023-05-29T23:15:46Z

Consider using API Priority and Fairness to limit event QPS to ensure control plane performance. Cc @rschalo

anthropic-eli · 2023-05-31T18:16:24Z

Hi all, I'm on the same team as @hwangmoretime and wanted to provide more details about our use case. We currently use karpenter for autoscaling/provisioning in clusters where we have a mix of CPU-only and GPU workloads. For GPU instances, we manage that capacity ourselves and don't want karpenter to autoscale it. We also encourage our users to launch workloads even though they may not immediately schedule because GPU capacity is freed up throughout the day. Thus we really only use karpenter for autoscaling CPU-only instances, but it still tries to find a provisioner for pending GPU workload pods, and this generates a metric boatload of events--which puts a lot of strain on the control plane.

Which brings us to this issue: we'd like some way to configure karpenter to reduce the number of events it emits. Rate limiting is one way to do it, but we'd also be happy if we could configure karpenter to ignore certain workloads and avoid generating those FailedToSchedule events altogether.

engedaam · 2023-05-31T22:50:15Z

@anthropic-eli have you attempted to use API Priority and Fairness to limit event?

engedaam · 2023-06-01T22:19:01Z

I'll be looking into this in the coming days

rschalo · 2023-06-02T00:00:49Z

Hi @anthropic-eli and @hwangmoretime, I'm on EKS Scalability and I'm looking into limiting the impact of events on control plane performance. Out of curiosity, is there a controller you manage that relies upon events? What would the impact be, if any, to your workloads if events were rate-limited to 1 qps?

Alternatively, I've looked into creating a FlowSchema that catches all events and sends them to a PriorityLevelConfiguration that is limited to one concurrency share and can share some of my work there.

engedaam · 2023-06-13T21:40:46Z

@anthropic-eli @hwangmoretime After doing some investigation, karpenter publishes an event if a pod will not be able to be scheduled with any of the provisioners. When a pod is in a pending state and can’t be scheduled karpenter will emit 3 events per minutes. The provisioning reconciliation for karpenter does happen every 10 seconds on pending pods. Karpenter does emit an event for every pod that can’t be scheduled, so that number of events does grow linearly. As the customer is in our intending for pods to stay in a pending state, this is an expected behavior. In contrast, kube-scheduler only emits an event for pods that can’t be scheduled approximately every 5 minutes.

engedaam · 2023-06-16T00:18:31Z

@anthropic-eli @hwangmoretime The team did a deep dive on the issue. There was a bug in the produced events. Karpenter was firing off more events than was intended. Here is the PR for the fix: #372
One change that was introduced along with the bug fix is for karpenter to mirror kube-scheduler in producing FaildToSchedule event every 5 minutes.

jonathan-innis added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2023

jonathan-innis assigned spring1843 and jonathan-innis and unassigned spring1843 Mar 13, 2023

jonathan-innis added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 15, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 6, 2023

jonathan-innis added the triage/unresolved Indicates an issue that can not or will not be resolved. label Apr 6, 2023

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 7, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2023

github-actions bot added the lifecycle/closed label May 10, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 10, 2023

engedaam reopened this Jun 1, 2023

billrayburn assigned engedaam and unassigned jonathan-innis Jun 1, 2023

github-actions bot removed lifecycle/closed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 2, 2023

engedaam closed this as completed Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User configurable rate limiting for event recording #236

User configurable rate limiting for event recording #236

hwangmoretime commented Mar 10, 2023

jonathan-innis commented Mar 10, 2023

hwangmoretime commented Mar 15, 2023

jonathan-innis commented Mar 16, 2023

hwangmoretime commented Mar 16, 2023

jonathan-innis commented Mar 16, 2023

github-actions bot commented Apr 6, 2023

hwangmoretime commented Apr 6, 2023

jonathan-innis commented Apr 6, 2023

github-actions bot commented Apr 28, 2023

hwangmoretime commented May 22, 2023

ellistarn commented May 29, 2023

anthropic-eli commented May 31, 2023

engedaam commented May 31, 2023

engedaam commented Jun 1, 2023

rschalo commented Jun 2, 2023 •

edited

Loading

engedaam commented Jun 13, 2023

engedaam commented Jun 16, 2023 •

edited

Loading

User configurable rate limiting for event recording #236

User configurable rate limiting for event recording #236

Comments

hwangmoretime commented Mar 10, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

jonathan-innis commented Mar 10, 2023

hwangmoretime commented Mar 15, 2023

jonathan-innis commented Mar 16, 2023

hwangmoretime commented Mar 16, 2023

jonathan-innis commented Mar 16, 2023

github-actions bot commented Apr 6, 2023

hwangmoretime commented Apr 6, 2023

jonathan-innis commented Apr 6, 2023

github-actions bot commented Apr 28, 2023

hwangmoretime commented May 22, 2023

ellistarn commented May 29, 2023

anthropic-eli commented May 31, 2023

engedaam commented May 31, 2023

engedaam commented Jun 1, 2023

rschalo commented Jun 2, 2023 • edited Loading

engedaam commented Jun 13, 2023

engedaam commented Jun 16, 2023 • edited Loading

rschalo commented Jun 2, 2023 •

edited

Loading

engedaam commented Jun 16, 2023 •

edited

Loading