From c64c01ee8485fcd8f253523b47d12b59ae0b04c7 Mon Sep 17 00:00:00 2001 From: GreenHand Date: Tue, 12 Mar 2024 14:13:06 +0800 Subject: [PATCH] proposal: support pod customizing numa policy (#1910) Signed-off-by: KunWuLuan --- .../20240131-pod-level-numa-policy.md | 320 ++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 docs/proposals/scheduling/20240131-pod-level-numa-policy.md diff --git a/docs/proposals/scheduling/20240131-pod-level-numa-policy.md b/docs/proposals/scheduling/20240131-pod-level-numa-policy.md new file mode 100644 index 000000000..ec4e6c166 --- /dev/null +++ b/docs/proposals/scheduling/20240131-pod-level-numa-policy.md @@ -0,0 +1,320 @@ +--- +title: Pod Level numa topology policy +authors: + - "@kunwuluan" +reviewers: + - "@eahydra" + - "@zwzhang0107" + - "hormes" +creation-date: 2024-01-31 +last-updated: 2024-01-31 +--- + +# Pod Level numa topology policy + +## Table of Contents + +- [Pod Level numa topology policy](#pod-level-numa-topology-policy) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Motivation](#motivation) + - [Proposal](#proposal) + - [Goal](#goal) + - [Non-Goal](#non-goal) + - [Use Cases](#use-cases) + - [API](#api) + - [SingleNUMANodeExclusive](#singlenumanodeexclusive) + - [Examples](#examples) + - [Work with Node-Level Policy](#work-with-node-level-policy) + - [CPU Policy for Different Policy](#cpu-policy-for-different-policy) + - [Work with Device Joint Allocation](#work-with-device-joint-allocation) + - [Changes In Scheduler](#changes-in-scheduler) + - [NUMA Aware Resource](#numa-aware-resource) + - [Arguments](#arguments) + - [Filter](#filter) + - [Hint](#hint) + - [Score](#score) + - [Graduation criteria](#graduation-criteria) + - [Alternatives](#alternatives) + - [Implementation History](#implementation-history) + +## Summary + +Introduce new api for pod level numa topology policy. With the new api, users can specify numa topology policy for each +pod, so that pods that are more sensitive to latency can decide how they need to be orchestrated, rather than being passively +scheduled according to the numa topology policy on the node. + +## Motivation + +With numa topology policy set on node, users need to set nodeSelector or nodeAffinity to place the application under the +same numa. This means to split the nodes in cluster into static groups. In large clusters, with lots of nodes that can serve +different purposes, it's not unreasonable to dedicate a node (or set of nodes) to a certain numa topology policy, and +direct pods to those nodes using a nodeSelector. It's mostly problematic in smaller clusters where you don't have the +luxury of special-purposing nodes like this and you want to binpack pods onto nodes as tightly as possible (while still +reaping the benefits of topology-alignment). So we need a way to specify numa topology policy for each pod. + +## Proposal + +### Goal + +- Allow users to specify numa topology policy on workloads. +- Protect the QoS requirement for workloads with `SingleNUMANode` policy by preventing pods that cross +numa and pods with `SingleNUMANode` policy from being deployed on the same numa. +- New API should be compatible with numa topology policy on nodes. This means if users don't specify NUMA scheduling +policy on workloads, numa topology policy on nodes should still work. + +### Non-Goal + +- Change the device and CPU, Memory allocation rules during NUMA-aware scheduling. + +### Use Cases + +- As a cluster user, I don't have privileges to set the numa topology policy on the node, and I want to set `SingleNUMANode` +for my pods so that they can be binpacked as tightly as possible. +- As a cluster admin, split my cluster into 2 small cluster according to numa topology policy may reduce gpu utilization. +- As a cluster user, I hope there is a method for me to indicate that my workload do not want to be placed with a cross +numa workload because these workloads may use too much memory and result in my inability to obtain the requested memory +resources. + +[//]: # (- As a cluster admin, I hope there is no inference between jobs with different numa topology policy. This option should be a global ) + +[//]: # (setting so that I can enable and disable it easily.) + +### API + +We will introduce a new property in pod `scheduling.koordinator.sh/resource-spec`: `numaTopologyPolicy`. +The value of this label can be `""`, `Restricted`, `SingleNUMANode`, `BestEffort`. The default value of this label is `""`. +The meaning of these values is same with the value of policy on node. + +#### SingleNUMANodeExclusive + +To protect the QoS for the pod with `SingleNUMANode` policy, pods that cross numa should not be placed on the same numa +with the pod being scheduled on one numa. Sometimes this may lead to situations where workloads that use multiple NUMAs +could never be scheduled. Therefore, we allow some critical workloads to break this restriction. + +We will introduce a new property in pod `scheduling.koordinator.sh/resource-spec`: +`singleNUMANodeExclusive` to reach the goal, the value of this property can be: +- `Preferred`: a numa with a SingleNUMANode pod will not be scheduled another pod with multi-numa if there is another idle numa. +- `Required`: a numa with a SingleNUMANode pod can not be scheduled another pod with multi-numa. + +If `SingleNUMANodeExclusive` not set by user, it will be treated as if 'exclusive' were used to ensure that 'SingleNUMANode' +is not affected by other policies. + +[//]: # (We will introduce a new argument in scheudler's config: `SingleNUMANodeExclusive`, the value of this property +can be:) + +[//]: # (- `none`: a numa with a SingleNUMANode pod can be scheduled another pod with multi-numa.) + +[//]: # (- `besteffort`: a numa with a SingleNUMANode pod will not be scheduled another pod with multi-numa if there is +another idle numa.) + +[//]: # (- `exclusive`: a numa with a SingleNUMANode pod can not be scheduled another pod with multi-numa.) + +[//]: # (`SingleNUMANodeExclusive` should not be set by users, because some users may set all his/her pods as exclusive +so that all pods with restricted policy cannot be scheduled.) + +#### Examples for NUMA topology policy +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "SingleNUMANode", + } +spec: + containers: + resource: + requests: + cpu: 1 + limits: + cpu: 1 +``` +This pod will be allocated with an exclusive cpu. + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "SingleNUMANode", + } +spec: + containers: + resource: + requests: + cpu: 1 + limits: + cpu: 2 +``` +This pod will be allocated in shared pool. + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "SingleNUMANode", + } +spec: + containers: + resource: + requests: + cpu: 1 + nvidia.com/gpu: 1 + limits: + cpu: 1 + nvidia.com/gpu: 1 +``` +GPU and cpu will be aligned in one NUMA node for this pod. + +``` yaml +spec: + containers: + resource: + requests: + cpu: 1 + nvidia.com/gpu: 1 + limits: + cpu: 1 + nvidia.com/gpu: 1 +``` +We will not bind align gpu and cpu for this pod in one numa, because there is no +numa topology policy on pod. + +#### Examples for SingleNUMANodeExclusive + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "SingleNUMANode" + } +``` + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "Restricted", + } +``` + +If pod-a requires `SingleNUMANode` and pod-b requires `Restricted`, all devices and CPU, memory will be alligned in one NUMA node for pod-a. Besides, scheduler will not place pod-b on the same NUMA node with pod-a if pod-b will be placed on multi-NUMA-node because we will set `SingleNUMANodeExclusive` as `Required` by default. + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "SingleNUMANode", + } +``` + +``` yaml +metadata: + annotations:|- + { + "numaTopologyPolicy": "Restricted", + "singleNUMANodeExclusive": "Preferred" + } +``` + +If pod-a has been placed on node, and `SingleNUMANodeExclusive` is set as `Preferred`, then pod-b can be placed on same NUMA node with pod-a. This can used when user want to place pod with `SingleNUMANode` and pod with `Restricted` on the same node. + +### Work with Node-Level Policy + +We want to make the new API compatible with the numa topology policy on node. So if the policy on pod and on node are +different, we will not place the pod on node. If the policy on node is `""`, means the node is able to place any workload, +then we can schedule pod as we like. On the other hand, `""` in workload's policy means the pod do not have any requirement +for scheduling, so it can be placed on any node. Just schedule it as we what we do before. + +So we have these rules: + +- If the policy on pod is not `""`, the scheduler will use the policy set on pod. +Pod with a not-none policy can be scheduled on a node only if the policy +on the node is `""` or the same as the policy on pod. + +- If the policy on pod is not set or is `""`, then the scheduler will use the policy set on node. + +| | SingleNUMANode node | Restricted node | Besteffort node | none node | +|----------------|-----------------------|-----------------|-----------------|-----------| +| SingleNUMANode | ✅ | ❌ | ❌ | ✅ | +| Restricted | ❌ | ✅ | ❌ | ✅ | +| Besteffort | ❌ | ❌ | ✅ | ✅ | +| none | ✅ | ✅ | ✅ | ✅ | + + + +### CPU Policy for Different Policy + +First, we should maintain consistent behavior when schedule a pod without numa topology policy in annotations. For a pod +with numa topology policy in annotations on a node without numa topology policy, our behavior will be consistent with that +of the nodes with labels. + +### Work with Device Joint Allocation + +This should be compatible with Device Joint Allocation, because this feature should maintain consistent behavior with +policy on node. + +### Changes In Scheduler + +#### NUMA Aware Resource + +##### Arguments + +Will add a property in arguments. Like the following: + +``` go +// CPUBindPolicy defines the CPU binding policy +type NUMAExclusivePolicy = string + +const ( + NUMAExclusiveBesteffort = "Preferred" + NUMAExclusiveExclusive = "Reuqired" +) + +type NodeNUMAResourceArgs struct { + ... + SingleNUMANodeExclusive NUMAExclusivePolicy +} +``` + +This argument only work on pods with new label, so we can maintain the consistant behavior as before for the users who +do not use the new label. + +##### Filter + +Topology manager should find the policy on pod and check if the policy on node is same as the policy on pod. If not, the +node will be marked as `UnschedulableAndUnresolvable`. + +If there is no numa topology policy on pod, we should maintain consistent behavior as before. + +##### Hint + +We will get policy from pod prior than node. + +When try to find available hints, if the hint contains a SingleNUMANode pod and `SingleNUMANodeExclusive=Required`, +we will skip the hint. + +When calculate the scores for hints, if the hint contains a SingleNUMANode pod and `SingleNUMANodeExclusive=Preferred`, +score for the hint will be set as 0. + +##### Score + +By default, scheduler will place pods with spread policy, pods that need to placed span multi numas can schedule failed +when every numa node is placed with one SingleNUMANode pod. So we will prioritize placing SingleNUMANode pods together. +Score will be calculated in the following way: +`score=(100-(num-of-numanode)*10)+10*(requested/allocated)` + +`(100-(num-of-numanode)*10)` is the base score, nodes that require multiple numas to satisfy will always score lower +than nodes that can be satisfied with just one numa. + +`10*(requested/allocated)` means nodes with high utilization score higher than those with low utilization. + +## Graduation criteria + +This plugin will not be enabled only when users enable it in scheduler framework and add a label in pods. +So it is safe to be beta. + +* Beta +- [ ] Add E2E tests. + +## Alternatives + +## Implementation History \ No newline at end of file