-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement priority based evictor #6139
Conversation
1a38ae9
to
e85cb04
Compare
e85cb04
to
95960d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate from the review comments, I'm wondering if this is what we want to do. kubectl drain
also performs node drain, but it doesn't reuse the kubelet logic: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain. Most notably, it doesn't evict DaemonSet pods at all, on the basis that they will be recreated and scheduled on the same node again (DaemonSet pods bypass the usual unready/unschedulable checks while scheduling), and will then be forcibly terminated anyway. I've definitely seen this happen in practice, which makes me think that maybe we should consider reverting to our previous behavior of not evicting DaemonSet pods at all. At which point, maybe this priority feature is not needed.
@x13n @MaciekPytel I'm curious to hear your thoughts on this.
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
My understanding was that DS eviction was introduced to give some of them heads up before deleting a node - e.g. logging agents may need non-negligible amount of time to flush logs to some backend. If we just delete the VM under them, their cleanup may not complete in time. We don't currently set |
/assign @towca (since you're already reviewing it anyway) |
Good point, although with taints it's easy to determine their "ownership" by name, so we know which taints to clean up in error-handling flows. With the In any case, it seems like we have important use cases for evicting DaemonSet pods even if they schedule back on the node afterwards. Let's move forward with this PR @damikag |
This very much does introduce a user-facing behavior change and should have a release note explaining it. |
95960d8
to
f05d89d
Compare
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my previous comments, this approach is way more readable on a high level. Sorry for the delay and the number of new comments. This is a critical part of CA that has historically had huge readability problems, so I'm erring on the side of thoroughness here.
cluster-autoscaler/core/scaledown/actuation/priority_evictor.go
Outdated
Show resolved
Hide resolved
421b6fe
to
6ebffcc
Compare
6ebffcc
to
ad96ca5
Compare
ad96ca5
to
64caecb
Compare
cluster-autoscaler/core/scaledown/actuation/group_deletion_scheduler.go
Outdated
Show resolved
Hide resolved
Could you also add Release Notes to the PR description before we merge? |
64caecb
to
9e220ba
Compare
The release note goes a bit too deep into details for an average reader. I'd put something like this: A new flag (--drain-priority-config) is introduced which allows users to configure drain behavior during scale-down based on pod priority. The new flag is mutually exclusive with --max-graceful-termination-sec. --max-graceful-termination-sec can still be used if the new configuration options are not needed. The default behavior is preserved (simple config, default value of --max-graceful-termination-sec). |
9e220ba
to
4a4c0d9
Compare
4a4c0d9
to
9ffbea4
Compare
Thanks for all the hard work on this! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damikag, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@towca Can you please give an example for this config? can I prioritise drain of node in the same way as priority expander? |
@kost2191 No, this allows you to configure the order in which pods are evicted from a single node, whenever CA decides to scale it down. It doesn't affect the choice of which node is scaled down. An example config could be The feature can be useful if some of the pods depend on other pods (e.g. metric/logging agents) still running on the node during graceful termination. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR implements a priority based evictor for pod eviction in scale down. This evictor can be used to enforce non-critical pods get evicted before critical pods. When this evictor is used it groups pods by priority and evicts group by group from lowest priority to highest.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
This PR introduces a new evictor that can be enabled and configured by
--drain-priority-config
flag. Setting and empty string will disable the feature and use the default unordered evictor.Release-note
A new flag (
--drain-priority-config
) is introduced which allows users to configure drain behavior during scale-down based on pod priority. The new flag is mutually exclusive with--max-graceful-termination-sec
.--max-graceful-termination-sec
can still be used if the new configuration options are not needed. The default behavior is preserved (simple config, default value of--max-graceful-termination-sec
).Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: