diff --git a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/fine-grained-device-scheduling.md b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/fine-grained-device-scheduling.md index e27e8a951..14a8bb45f 100644 --- a/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/fine-grained-device-scheduling.md +++ b/i18n/zh-Hans/docusaurus-plugin-content-docs/current/designs/fine-grained-device-scheduling.md @@ -1,43 +1,45 @@ -# Fine-grained device scheduling +# 精细化设备调度 -## Summary +## 摘要 -This proposal provides a fine-grained mechanism for managing GPUs and other devices such as RDMA and FPGA, defines a set of APIs to describe device information on nodes, including GPU, RDMA, and FPGA, and a new set of resource names to flexibly support users to apply at a finer granularity GPU resources. This mechanism is the basis for subsequent other GPU scheduling capabilities such as GPU Share, GPU Overcommitment, etc. +该提案提供了一种细粒度的管理 GPU 和 RDMA、FPGA 等设备的机制,定义了一套 API 来描述节点上的设备信息,包括 GPU、RDMA、FPGA,以及一套新的资源名称来灵活支持用户可以更细粒度地申请 GPU 资源。该机制是后续其他 GPU 调度能力如 GPU Share、GPU Overcommitment 等的基础。 -## Motivation +## 动机 + +GPU 设备具有很强的计算能力,但是价格昂贵。如何更好的利用 GPU 设备,发挥 GPU 的价值,降低成本,是一个亟待解决的问题。在 K8s 社区现有的 GPU分配机制中,GPU 是由 kubelet 分配的,是一个完整的设备分配。这种方法简单可靠,但与 CPU 和内存类似,GPU 也会被浪费。因此,一些用户希望仅使用 GPU 的一部分资源,并将其余部分与其他工作负载共享以节省成本。此外,GPU 具有特殊性。比如下面提到的 NVIDIA GPU 支持的 NVLink 和超卖场景,都需要通过调度器进行中央决策,以获得全局最优的分配结果。 -GPU devices have very strong computing power, but are expensive. How to make better use of GPU equipment, give full play to the value of GPU and reduce costs is a problem that needs to be solved. In the existing GPU allocation mechanism of the K8s community, the GPU is allocated by the kubelet, and it is a complete device allocation. This method is simple and reliable, but similar to the CPU and memory, the GPU will also be wasted. Therefore, some users expect to use only a portion of the GPU's resources and share the rest with other workloads to save costs. Moreover, GPU has particularities. For example, the NVLink and oversold scenarios supported by NVIDIA GPU mentioned below both require a central decision through the scheduler to obtain globally optimal allocation results. ![image](/img/nvlink.jpg) -From the picture, we can find that although the node has 8 GPU instances whose model is A100/V100, the data transmission speed between GPU instances is different. When a Pod requires multiple GPU instances, we can assign the Pod the GPU instances with the maximum data transfer speed combined relationship. In addition, when we want the GPU instances among a group of Pods to have the maximum data transfer speed combined relationship, the scheduler should batch allocate the best GPU instances to these Pods and assign them to the same node. +从图中我们可以发现,虽然该节点有8个 GPU 实例,型号为 A100/V100,但 GPU 实例之间的数据传输速度是不同的。 当一个 Pod 需要多个 GPU 实例时,我们可以为 Pod 分配具有最大数据传输速度组合关系的 GPU 实例。 此外,当我们希望一组 Pod 中的 GPU 实例具有最大数据传输速度组合关系时,调度器应该将最佳 GPU 实例批量分配给这些 Pod,并将它们分配到同一个节点。 + +### 目标 -### Goals +1. 定义 Device CRD 和资源 API。 +1. 在 koordlet 中提供一个 Reporter 组件来报告设备信息和资源容量。 +1. 提供一个调度器插件,支持用户更细粒度地申请 GPU 资源。 +1. 在 koordlet 中提供一个新的运行时钩子插件,以支持更新由调度器分配的 GPU 容器环境变量。 -1. Definition Device CRD and the Resource API. -1. Provides a reporter component in koordlet to report Device information and resource capacities. -1. Provides a scheduler plugin to support users to apply at a finer granularity GPU resources. -1. Provider a new runtime hook plugin in koordlet to support update the environments of containers with GPUs that be allocated by scheduler. -### Non-goals/Future work +### 非目标/未来工作 -1. Define flexible allocation strategies, such as implementing BinPacking or Spread according to GPU resources +1. 定义灵活的分配策略,如根据 GPU 资源实现 Binpacking 和 Spread -## Proposal +## 设计概述 ### API -#### Device resource dimensions +#### 设备资源维度 -Due to GPU is complicated, we will introduce GPU first. As we all know there is compute and GPU Memory capability for the GPU device. Generally user apply GPU like "I want 1/2/4/8 GPUs", but if node support GPU level isolation mechanism, user may apply GPU like "I want 0.5/0.25 GPU resources". Moreover, user may set different compute capability and GPU memory capability for best resource utilization, so the user want apply GPU like "I want X percent of "compute capability and Y percent of memory capability". +由于 GPU 比较复杂,我们先介绍一下 GPU。众所周知,GPU 设备具有计算能力和 GPU 内存容量。通常用户会像“我想要 1/2/4/8 个 GPU”那样申请 GPU,但是如果节点支持 GPU 级别隔离机制,用户可能会像“我想要 0.5/0.25 个 GPU 资源”那样申请 GPU。此外,用户可以设置不同的计算能力和 GPU 内存容量以获得最佳资源利用率,因此用户希望像“我想要 X% 的计算能力和 Y% 的内存容量”那样申请 GPU。 -We abstract GPU resources into different dimensions: +我们将 GPU 资源抽象为不同的维度: -- `kubernetes.io/gpu-core` represents the computing capacity of the GPU. Similar to K8s MilliCPU, we abstract the total computing power of GPU into one hundred, and users can apply for the corresponding amount of GPU computing power according to their needs. -- `kubernetes.io/gpu-memory` represents the memory capacity of the GPU in bytes. -- `kubernetes.io/gpu-memory-ratio` represents the percentage of the GPU's memory. +- `kubernetes.io/gpu-core` 代表 GPU 的计算能力。 与 Kuberetes MilliCPU 类似,我们将 GPU 的总算力抽象为 100,用户可以根据需要申请相应数量的 GPU 算力。 +- `kubernetes.io/gpu-memory` 表示 GPU 的内存容量,以字节为单位。 +- `kubernetes.io/gpu-memory-ratio` 代表 GPU 内存的百分比。 -Assuming that node A has 4 GPU instances, and the total memory of each instance is 8GB, when device reporter reports GPU capacity information to `Node.Status.Allocatable`, it no longer reports nvidia.com/gpu=4, but reports the following information: +假设节点 A 有 4 个 GPU 实例,每个实例的总内存为 8GB,设备上报组件向 `Node.Status.Allocatable` 上报 GPU 容量信息时,不再上报 `nvidia.com/gpu=4`,而是上报以下信息: ```yaml status: @@ -51,9 +53,9 @@ status: kubernetes.io/gpu-memory-ratio: 400 ``` -For the convenience of users, an independent resource name `kubernetes.io/gpu` is defined. For example, when a user wants to use half of the computing resources and memory resources of a GPU instance, the user can directly declare `kubernetes.io/gpu: 50`, and the scheduler will convert it to `kubernetes.io/gpu-core: 50, kubernetes.io/gpu-memory-ratio: 50` +为了方便用户,定义了一个独立的资源名 `kubernetes.io/gpu`。例如,当用户想使用 GPU 实例一半的计算资源和内存资源时,用户可以直接声明 `kubernetes.io/gpu: 50`,调度器会转换为 `kubernetes.io/gpu-core: 50, kubernetes.io/gpu-memory-ratio: 50` -For other devices like RDMA and FPGA, the node has 1 RDMA and 1 FGPA, will report the following information: +对于 RDMA 和 FPGA 等其他设备,节点有 1 个 RDMA 和 1 个 FGPA,会上报以下信息: ```yaml status: @@ -65,20 +67,17 @@ status: kubernetes.io/fpga: 100 ``` -Why do we need `kubernetes.io/gpu-memory-ratio` and `kubernetes.io/gpu-memory` ? -When user apply 0.5/0.25 GPU, the user don't know the exact memory total bytes per GPU, only wants to use -half or quarter percentage of memory, so user can request the GPU memory with `kubernetes.io/gpu-memory-ratio`. -When scheduler assigned Pod on concrete node, scheduler will translate the `kubernetes.io/gpu-memory-ratio` to `kubernetes.io/gpu-memory` by the formulas: ***allocatedMemory = totalMemoryOf(GPU) * `kubernetes.io/gpu-memory-ratio`***, so that the GPU isolation can work. +为什么我们需要 `kubernetes.io/gpu-memory-ratio` 和 `kubernetes.io/gpu-memory`?当用户申请 0.5/0.25 GPU 时,用户不知道每个 GPU 的确切内存总字节数,只想使用一半或四分之一的内存,因此用户可以使用 `kubernetes.io/gpu-memory-ratio` 来申请 GPU 内存。当调度器在具体节点上分配 Pod 时,调度器将通过以下公式将 `kubernetes.io/gpu-memory-ratio` 转换为 `kubernetes.io/gpu-memory`: ***allocatedMemory = totalMemoryOf(GPU) * `kubernetes. io/gpu-memory-ratio`***,这样 GPU 隔离就能发挥作用。 -During the scheduling filter phase, the scheduler will do special processing for `kubernetes.io/gpu-memory` and `kubernetes.io/gpu-memory-ratio`. When a Pod specifies `kubernetes.io/gpu-memory-ratio`, the scheduler checks each GPU instance on each node for unallocated or remaining resources to ensure that the remaining memory on each GPU instance meets the ratio requirement. +在调度过滤阶段,调度器会对 `kubernetes.io/gpu-memory`和`kubernetes.io/gpu-memory-ratio` 做特殊处理。当 Pod 指定 `kubernetes.io/gpu-memory-ratio` 时,调度器会检查每个节点上的每个 GPU 实例是否有未分配或剩余资源,以确保每个 GPU 实例上的剩余内存满足比例要求。 -If the user knows exactly or can roughly estimate the specific memory consumption of the workload, he can apply for GPU memory through `kubernetes.io/gpu-memory`. All details can be seen below. +如果用户确切知道或可以粗略估计工作负载的具体内存消耗,则可以通过 `kubernetes.io/gpu-memory` 申请 GPU 内存。所有细节都可以在下面看到。 -Besides, when dimension's value > 100, means Pod need multi-devices. now only allow the value can be divided by 100. +此外,当维度值 > 100 时,意味着 Pod 需要多设备。现在只允许值可以除以 100。 -#### User apply device resources scenarios +#### 用户申请设备资源场景 -##### Compatible with `nvidia.com/gpu` +##### 兼容 `nvidia.com/gpu` ```yaml resources: @@ -88,7 +87,7 @@ resources: memory: "8Gi" ``` -The scheduler translates the `nvida.com/gpu: 2` to the following spec: +调度器将 `nvida.com/gpu: 2` 转换为以下描述: ```yaml resources: @@ -100,7 +99,7 @@ resources: memory: "8Gi" ``` -##### Apply whole resources of GPU or part resources of GPU +##### 申请 GPU 的​​全部或部分资源 ```yaml resources: @@ -110,7 +109,7 @@ resources: memory: "8Gi" ``` -The scheduler translates the `kubernetes.io/gpu: "50"` to the following spec: +调度器将 `kubernetes.io/gpu: "50"` 转换为以下描述: ```yaml resources: @@ -122,7 +121,7 @@ resources: memory: "8Gi" ``` -##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory-ratio` separately +##### 分别申请 `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory-ratio` ```yaml resources: @@ -133,7 +132,7 @@ resources: memory: "8Gi" ``` -##### Apply `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory` separately +##### 分别申请 `kubernetes.io/gpu-core` and `kubernetes.io/gpu-memory` ```yaml resources: @@ -144,7 +143,7 @@ resources: memory: "8Gi" ``` -##### Apply RDMA +##### 申请 RDMA ```yaml resources: @@ -154,17 +153,17 @@ resources: memory: "8Gi" ``` -### Implementation Details +### 详细设计 -#### Scheduling +#### 调度 -1. Abstract new data structure to describe resources and healthy status per device on the node. -2. Implements the Filter/Reserve/PreBind extenstion points. -3. Automatically recognize different kind devices. When a new device added, we don't need modify any code +1. 抽象新的数据结构来描述节点上每个设备的资源和健康状态。 +2. 实现 Filter/Reserve/PreBind 扩展点。 +3. 自动识别不同种类的设备。添加新设备时,我们不需要修改任何代码。 ##### DeviceAllocation -In the PreBind stage, the scheduler will update the device (including GPU) allocation results, including the device's Minor and resource allocation information, to the Pod in the form of annotations. +在 PreBind 阶段,调度器会将设备(包括 GPU)的分配结果,包括设备的 Minor 和资源分配信息,以注解的形式更新到 Pod。 ```go /* @@ -232,16 +231,16 @@ type deviceResources map[int]corev1.ResourceList ``` -We will register node and device event handler to maintain device account. +我们将注册节点和设备事件 handler 来维护节点设备信息。 -- In Filter, we will make-up each device request by a node(the gpu-memory example), and try compare each device free resource and Pod device request. -- In Reserve/Unreserve, we will update nodeDeviceCache's used/free resource and allocateSet. Now device selection rule just based on device minor id order. -- In PreBind, we will write DeviceAllocations to Pod's annotation. -- In Init stage, we should list all Node/Device/Pods to recover device accounts. +- 在 Filter 阶段,我们将以一个节点来构造每个设备请求(如 gpu-memory), 并尝试比较每个设备的空闲资源和 Pod 设备请求。 +- 在 Reserve/Unreserve 阶段, 我们将更新 nodeDeviceCache 的已用/空闲资源和 allocateSet。现在设备选择规则仅基于设备次要 ID 顺序。 +- 在 PreBind 阶段, 我们将 DeviceAllocations 写到 Pod 的注解中。 +- 在 Init 阶段, 我们会 list 所有 Node/Device/Pods 来恢复节点设备信息. #### Device Reporter -Implements a new component called `Device Reporter` in koordlet to create or update `Device` CRD instance with the resources information and healthy status per device including GPU, RDMA and FPGA, etc. This version we only support GPU. It will execution `nccl` commands to get each minor resource just like k8s-gpu-device-plugins. We will apply community health check logic. +在 koordlet 中实现了一个名为 `Device Reporter` 的新组件来创建或更新 `Device` CRD 实例,其中包含每个设备的资源信息和健康状态,包括 GPU、RDMA 和 FPGA 等。这个版本我们只支持 GPU。它将执行 `nccl` 命令来获取每个次要资源,就像 k8s-gpu-device-plugins 一样。我们将应用社区的健康检查逻辑。 #### Device CRD Scheme definition ```go @@ -288,9 +287,9 @@ type DeviceList struct { } ``` -##### Compatible +##### 兼容性 -Considering that some users already have many existing GPU Pods in their clusters, it is necessary to ensure that Koordinator GPU Scheduling does not repeatedly allocate the GPU devices held by these GPU Pods. Therefore, koord-scheduler needs to obtain the GPU devices's information held by these existing Pods. These GPU devices are allocated by the kubelet and recorded in the local file `/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint`, so the device reporter will parse the file to obtain the GPU Device ID assigned to each Pod. When parsing, it needs to exclude the Pod that allocates GPU through koord-scheduler, and finally update it to Device CRD in the form of annotation. The corresponding annotation key is `node.koordinator.sh/devices-checkpoints`, and the annotation value is defined as follows: +考虑到一些用户的集群中已经存在很多 GPU Pod,需要保证 Koordinator GPU Scheduling 不会重复分配这些 GPU Pod 持有的 GPU 设备。因此,koord-scheduler 需要获取这些现有 Pod 持有的 GPU 设备的信息。这些 GPU 设备由 kubelet 分配并记录在本地文件 `/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint` 中,因此 device reporter 会解析该文件获取分配给每个 Pod 的 GPU Device ID。解析时需要排除掉通过 koord-scheduler 分配 GPU 的 Pod,最后以注解的形式更新到 Device CRD。对应的注解 key 为 `node.koordinator.sh/devices-checkpoints`,注解 value 定义如下: ```go type PodDevicesEntry struct { @@ -304,7 +303,7 @@ type PodDevicesEntry struct { type PodDevicesEntries []PodDevicesEntry ``` -#### CRD Example +#### CRD 示例 ```yaml apiVersion: scheduling.koordinator.sh/v1alpha1 kind: Device @@ -341,17 +340,15 @@ spec: status: {} ``` -#### koordlet and koord-runtime-proxy +#### koordlet 和 koord-runtime-proxy -Our target is to work compatible with origin k8s kubelet and k8s device plugins, so: +我们的目标是与原始的 k8s kubelet 和 k8s 设备插件兼容,因此: -1. We still allow kubelet and device plugin to allocate concrete device, which means no matter there's a k8s device -plugin or not, our design can work well. +1. 我们仍然允许 kubelet 和设备插件分配具体设备,这意味着无论是否有 k8s 设备插件,我们的设计都可以正常工作。 -2. In koord-runtime-proxy, we will use Pod's `DeviceAllocation` in annotation to replace the step1's result of container's -args and envs. +2. 在 koord-runtime-proxy 中,我们会使用 Pod的 `DeviceAllocation` 注解来替换掉步骤1中容器的 args 和 envs 的结果。 -We should modify protocol between koord-runtime-proxy and koordlet to add container env: +我们会修改 koord-runtime-proxy 和 koordlet 之间的协议来添加容器环境变量: ```go type ContainerResourceHookRequest struct { @@ -365,44 +362,39 @@ type ContainerResourceHookResponse struct { } ``` -Then we will add a new `gpu-hook` in koordlet's runtimehooks, registered to `PreCreateContainer` stage. -We will generate new GPU env `NVIDIA_VISIBLE_DEVICES` by Pod GPU allocation result in annotation. +然后我们将在 koordlet 的 runtimehooks 中添加一个新的 `gpu-hook`,注册到 `PreCreateContainer` 阶段。我们将通过 Pod 注解中的 GPU 分配结果生成新的 GPU 环境变量 `NVIDIA_VISIBLE_DEVICES`。 -The koord-runtime-proxy can see these Pod's env, we need koord-runtime-proxy to pass these environments to koordlet, and koordlet parse the GPU related env to find the concrete device ids. +koord-runtime-proxy 可以看到这些 Pod 的环境变量,koord-runtime-proxy 将这些环境变量传递给 koordlet,koordlet 解析 GPU 相关的环境变量来找到具体的设备 id。 -Besides, the koordlet should report GPU model to node labels same as device plugin, this is in-case Koordinator working without device-plugin. +此外,koordlet 应将 GPU 型号上报到节点的标签,与设备插件一样,以防 Koordinator 在没有设备插件的情况下工作。 -Finally, we should modify `ContainerResourceExecutor`'s `UpdateRequest` function in koord-runtime-proxy, and let new GPU env covering old GPU env. +最后,修改 koord-runtime-proxy 中 `ContainerResourceExecutor` 的 `UpdateRequest` 函数,让新的 GPU 环境变量覆盖旧的 GPU 环境变量。 -When we handle hot-update processing, we can handle the existing scheduled Pods without device allocation in Pod's annotation. If GPU allocation info is not in annotation, we will find the GPU allocations from `ContainerResourceHookRequest`'s `Env`, and we will update all GPU allocations to Device CRD instance. +当我们处理热更新时,我们可以处理已调度但是在注解中没有设备分配信息的 Pod。如果 GPU 分配信息不在注解中,将从 `ContainerResourceHookRequest` 的 `Env` 中找到 GPU 分配信息,并将所有 GPU 分配信息更新到 Device CRD 实例。 -### Compatibility +### 兼容性 -As we know, the GPU scheduling in kube-scheduler side has no any different with other scalar resources. The concrete device-level assigning is done by kubelet and GPU device plugin, which will generate container's GPU env. +众所周知,kube-scheduler 的 GPU 调度与其他标量资源没有任何区别。具体的设备级别的分配由 kubelet 和 GPU device plugin 完成,生成容器的GPU 环境变量。 -Our design has no conflict with the above process. Our device reporter reports Koordinator GPU resources for kubelet -updating node resources. Then we schedule device request in our new plugin with new device resource account. In pre-bind -stage, we will update container resources with Koordinator GPU resources, this is for kubelet to check resource limitation. -We will also add device allocation information to Pod's annotation. In node side, the k8s device plugin will first patch -container env, but we will overwrite these envs in runtimeproxy by allocation result in Pod's annotation. +我们的设计与上述流程没有冲突。`Device Reporter` 组件上报 Koordinator GPU 资源用于 kubelet 更新节点资源。然后使用新设备资源帐户在我们的新插件中调度设备请求。在 `pre-bind` 阶段,我们将使用 Koordinator GPU 资源更新容器资源,这是为了让 kubelet 检查资源限制。我们还将在 Pod 的注解中添加设备分配信息。在 node 端,k8s device plugin 会先 patch 容器 env,但是我们会在 runtimeproxy 中通过 Pod 注解中的分配结果覆盖这些 env。 -### Upgrade strategy +### 升级策略 -If using Koordinator GPU Scheduling to schedule GPU Pods in a brand new cluster, simply install Koordinator components. +如果使用 Koordinator GPU Scheduing 在全新的集群中调度 GPU Pod,只需安装 Koordinator 组件即可。 -However, if you want to upgrade to Koordinator GPU Scheduing in an existing cluster, you need to avoid GPU devices being repeatedly allocated because of switching between different scheduling mechanisms. You need to pay attention to the order when upgrading: -1. Install the Koordinator components. In particular, make sure that the koordlets are all started successfully. -2. Stop the system or platform that creates the new GPU Pod. -3. Stop the scheduler currently responsible for the GPU Pod and ensure that there are no pending GPU Pods in the current cluster. -3. Wait a few minutes to ensure that each node's koordlet creates and updates the Device CRD. -4. Modify all components that create GPU Pods to switch the schedulerName of the Pod to koord-scheduler -5. Start trying to create a GPU Pod and verify the koord-scheduler GPU Scheduling scheduling result. -6. Restore the system or platform that created the GPU Pod and the old scheduler. +但是,如果要在现有集群中升级到 Koordinator GPU Scheduing,则需要避免 GPU 设备因为切换不同的调度机制而被重复分配。升级时需要注意顺序: +1. 安装 Koordinator 组件。特别要确保 koordlet 都已成功启动。 +2. 停止创建新 GPU Pod 的系统或平台。 +3. 停止当前负责 GPU Pod 调度的调度器,并确保当前集群中没有 pending 的 GPU Pod。 +4. 等待几分钟以确保每个节点的 koordlet 创建并更新 Device CRD。 +5. 修改所有创建 GPU Pod 的组件,将 Pod 的 schedulerName 切换为 koord-scheduler +6. 开始尝试创建 GPU Pod,验证 koord-scheduler GPU Scheduling 调度结果。 +7. 恢复创建 GPU Pod 和旧调度器的系统或平台。 -In the future Koordinator will provide a webhook to solve the upgrade existing cluster problem. The webhook will identify the GPU Pod and modify the schedulerName of the newly created GPU Pod to koord-scheduler. At the same time, the webhook will take over the Binding operation of the GPU Pod. If the Binding is not initiated by koord-scheduler, it will be rejected. +未来 Koordinator 会提供一个 webhook 来解决现有集群的升级问题。 webhook 会识别 GPU Pod,并将新创建的 GPU Pod 的 schedulerName 修改为 koord-scheduler。同时,webhook 将接管 GPU Pod 的 Binding 操作。如果 Binding 不是由 koord-scheduler 发起的,它将被拒绝。 -## Unsolved Problems +## 未解问题 -## Alternatives +## 可选性 -1. User can choose whether use k8s-device plugin. as mentioned above, we can compatible in both cases. +1. 用户可以选择是否使用 k8s-device 插件。如上所述,我们都可以兼容这两种场景。