Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement priority based evictor #6139

Merged
merged 1 commit into from
Dec 21, 2023

Conversation

damikag
Copy link
Contributor

@damikag damikag commented Sep 25, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR implements a priority based evictor for pod eviction in scale down. This evictor can be used to enforce non-critical pods get evicted before critical pods. When this evictor is used it groups pods by priority and evicts group by group from lowest priority to highest.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

This PR introduces a new evictor that can be enabled and configured by --drain-priority-config flag. Setting and empty string will disable the feature and use the default unordered evictor.

Release-note

A new flag (--drain-priority-config) is introduced which allows users to configure drain behavior during scale-down based on pod priority. The new flag is mutually exclusive with --max-graceful-termination-sec. --max-graceful-termination-sec can still be used if the new configuration options are not needed. The default behavior is preserved (simple config, default value of --max-graceful-termination-sec).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 25, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 27, 2023
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate from the review comments, I'm wondering if this is what we want to do. kubectl drain also performs node drain, but it doesn't reuse the kubelet logic: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain. Most notably, it doesn't evict DaemonSet pods at all, on the basis that they will be recreated and scheduled on the same node again (DaemonSet pods bypass the usual unready/unschedulable checks while scheduling), and will then be forcibly terminated anyway. I've definitely seen this happen in practice, which makes me think that maybe we should consider reverting to our previous behavior of not evicting DaemonSet pods at all. At which point, maybe this priority feature is not needed.

@x13n @MaciekPytel I'm curious to hear your thoughts on this.

cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/metrics/metrics.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/metrics/metrics.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
@x13n
Copy link
Member

x13n commented Oct 10, 2023

My understanding was that DS eviction was introduced to give some of them heads up before deleting a node - e.g. logging agents may need non-negligible amount of time to flush logs to some backend. If we just delete the VM under them, their cleanup may not complete in time.

We don't currently set unschedulable bit in CA, only taint the node before deletion. Proper cordoning (setting unschedulable: true should prevent DS pods from scheduling again. Maybe we should do this first? The problem with DaemonSets is that they sometimes use a wildcard taint - we shouldn't have such nodes schedule again, so setting unschedulable field instead of (or in addition to) the taint seems like a better option.

@x13n
Copy link
Member

x13n commented Oct 10, 2023

/assign @towca

(since you're already reviewing it anyway)

@towca
Copy link
Collaborator

towca commented Oct 10, 2023

Good point, although with taints it's easy to determine their "ownership" by name, so we know which taints to clean up in error-handling flows. With the unschedulable bit, we wouldn't be able to know if it was CA or something else setting it.

In any case, it seems like we have important use cases for evicting DaemonSet pods even if they schedule back on the node afterwards. Let's move forward with this PR @damikag

@MaciekPytel
Copy link
Contributor

Does this PR introduce a user-facing change?

This very much does introduce a user-facing behavior change and should have a release note explaining it.

cluster-autoscaler/core/scaledown/actuation/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/evictor.go Outdated Show resolved Hide resolved
cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
@damikag damikag requested a review from towca November 14, 2023 09:37
Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my previous comments, this approach is way more readable on a high level. Sorry for the delay and the number of new comments. This is a critical part of CA that has historically had huge readability problems, so I'm erring on the side of thoroughness here.

cluster-autoscaler/metrics/metrics.go Outdated Show resolved Hide resolved
cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/config/autoscaling_options.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/drain_test.go Outdated Show resolved Hide resolved
@damikag damikag requested a review from towca November 28, 2023 09:28
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2023
cluster-autoscaler/core/scaledown/actuation/drain.go Outdated Show resolved Hide resolved
cluster-autoscaler/main.go Outdated Show resolved Hide resolved
cluster-autoscaler/core/scaledown/actuation/actuator.go Outdated Show resolved Hide resolved
cluster-autoscaler/config/autoscaling_options.go Outdated Show resolved Hide resolved
@towca
Copy link
Collaborator

towca commented Dec 21, 2023

Could you also add Release Notes to the PR description before we merge?

@towca
Copy link
Collaborator

towca commented Dec 21, 2023

The release note goes a bit too deep into details for an average reader. I'd put something like this:

A new flag (--drain-priority-config) is introduced which allows users to configure drain behavior during scale-down based on pod priority. The new flag is mutually exclusive with --max-graceful-termination-sec. --max-graceful-termination-sec can still be used if the new configuration options are not needed. The default behavior is preserved (simple config, default value of --max-graceful-termination-sec).

@towca
Copy link
Collaborator

towca commented Dec 21, 2023

Thanks for all the hard work on this!
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 21, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damikag, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit fc48d5c into kubernetes:master Dec 21, 2023
6 checks passed
@kost2191
Copy link

kost2191 commented Feb 4, 2024

@towca Can you please give an example for this config? can I prioritise drain of node in the same way as priority expander?

@towca
Copy link
Collaborator

towca commented Feb 8, 2024

@kost2191 No, this allows you to configure the order in which pods are evicted from a single node, whenever CA decides to scale it down. It doesn't affect the choice of which node is scaled down.

An example config could be 0:3600,2000000000:600. This means that whenever CA scales down a node, it does so in two batches. First, all pods with priority in the [0, 2000000000) range are evicted and CA waits 60min before they're force-terminated. After all pods from the first batch are evicted (whether forcibly or not), CA evicts all pods with priority >=2000000000, and waits 10min before force-terminating them.

The feature can be useful if some of the pods depend on other pods (e.g. metric/logging agents) still running on the node during graceful termination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants