Support memory qos using cgroups v2.

kubernetes · Mar 17, 2021 · 221b648 · 221b648
1 parent ada8226
commit 221b648
Show file tree

Hide file tree

Showing 2 changed files with 172 additions and 0 deletions.
diff --git a/keps/sig-node/2570-memory-qos/README.md b/keps/sig-node/2570-memory-qos/README.md
@@ -0,0 +1,165 @@
+# KEP-2570: Support Memory QoS using cgroups v2
+## Summary
+Support memory qos in pod qos class using cgroups v2.
+
+## Motivation
+In traditional cgroups v1 implement in Kubernetes, pod qos class and qos manager can only limit cpu resources, such as cpu_shares, memory qos is not yet implemented. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.
+
+### Goals
+- Support memory qos for pod qos classes
+
+### Non-Goals
+- additional qos design
+- support other resources qos
+
+## Proposal
+Use memory controller interfaces of cgroups v2 to support memory qos of pod qos class.
+
+Kubernetes cgroup manager uses `memory.limit_in_bytes` in cgroups v1 and `memory.max` in cgroups v2 to limit memory capacity and uses `oom_scores` to determine order of killing container process when OOM occurs.
+
+Above implement has following disadvantages:
+- `Guaranteed` memory can not be fully reserved, page cache is at risk of being recycled
+- `Burstable` memory allocation overcommit ( request < limit ) could increase OOM kill possibility because of node memory running out
+
+Currently we only use `memory.limit_in_bytes == limits.memory` to limit memory hardly in cgroups v1 and `memory.max == limits.memory` in cgroups v2.
+
+cgroups v2 introduces a better way to limit and guarantee memory quality.
+
+| File | Description |
+| -------- | -------- |
+| memory.high | memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. |
+| memory.max | memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. |
+| memory.low | memory.low is the best-effort memory protection, a “soft guarantee” that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. |
+| memory.min | memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup’s memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. |
+
+This proposal uses `memory.min` to retain request memory for `Guaranteed` pods. And for `Burstable` pods, it uses `memory.low` to achieve best-effort memory protection as well as using `memory.high` to throttle allocation of memory overcommit when pods specify `requests.memory`. We also set `memory.max = limits.memory` to do hard limit to make OOM killer work.
+
+### User Stories (Optional)
+- Workload is memory-sensitive, this feature can retain memory to reduce allocation latency
+- Workload is overcommitted in memory, this feature can throttle memory overcommit to increase stability by reducing risk of kernel OOM
+
+### Risks and Mitigations
+- `Guaranteed` pods memory reservation could reduce node memory supply
+- `Burstable` pods memory allocation would slow down when memory use goes over request
+
+## Design Details
+### Feature Gate
+To set `--feature-gates=MemoryQOS=true` to enable the feature.
+
+### CGroup Resource
+We create a new `Unified` variable in `ResourceConfig` to place `memory.min / low / high` and for extension of other parameters in cgroups v2 memory controller. Suggest same meaning in OCI runtime specs https://github.com/opencontainers/runtime-spec/pull/1040
+```
+pkg/kubelet/cm/types.go
+// ADDED
+type MemoryQoS string
+const (
+    MemoryLow ResourceQoSType = "memory.low"
+    MemoryHigh ResourceQoSType = "memory.high"
+    MemoryMin ResourceQoSType = "memory.min"
+)
+
+// ResourceConfig holds information about all the supported cgroup resource parameters.
+type ResourceConfig struct {
+    ...
+    // ADDED
+    Unified map[ResourceQoSType]int64
+    ...
+}
+```
+
+### Memory QoS Rules
+| QoS Interface | memory.min | memory.max | memory.low | memory.high |
+| -------- | -------- | -------- | -------- | -------- |
+| Guaranteed | requests.memory | limits.memory | n/a | n/a |
+| Burstable | n/a | limits.memory | requests.memory | requests.memory |
+
+Loop pods and update pod level cgroup configs according above rules. This feature only works in cgroups v2 mode.
+```
+pkg/kubelet/cm/qos_container_manager_linux.go
+func (m *qosContainerManagerImpl) UpdateCgroups() error {
+	m.Lock()
+	defer m.Unlock()
+
+	qosConfigs := map[v1.PodQOSClass]*CgroupConfig{
+		v1.PodQOSBurstable: {
+			Name:               m.qosContainersInfo.Burstable,
+			ResourceParameters: &ResourceConfig{},
+		},
+		v1.PodQOSBestEffort: {
+			Name:               m.qosContainersInfo.BestEffort,
+			ResourceParameters: &ResourceConfig{},
+		},
+	}
+    ...
+	// ADDED
+	var configs []*CgroupConfig
+	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQOS && libcontainercgroups.IsCgroup2UnifiedMode() {
+		configs = m.setMemoryQoSConfig()
+	}
+	...
+	configs = append(configs, qosConfigs[v1.PodQOSGuaranteed], qosConfigs[v1.PodQOSBurstable], qosConfigs[v1.PodQOSBestEffort])
+
+	for _, config := range configs {
+		err := m.cgroupManager.Update(config)
+		if err != nil {
+			klog.Errorf("[ContainerManager]: Failed to update QoS cgroup configuration")
+			return err
+		}
+	}
+
+	klog.V(4).Infof("[ContainerManager]: Updated QoS cgroup configuration")
+	return nil
+}
+```
+
+Add a new function `setMemoryQoSConfig` to set memory qos in cgroup config when pod is `Guaranteed` or `Burstable`.
+```
+pkg/kubelet/cm/qos_container_manager_linux.go
+// ADDED
+func (m *qosContainerManagerImpl) setMemoryQoSConfig() []*CgroupConfig {
+	pods := m.activePods()
+	cgs := make([]*CgroupConfig, 0)
+	for i := range pods {
+		pod := pods[i]
+		qosClass := v1qos.GetPodQOS(pod)
+		// if besteffort, we do nothing
+		if qosClass == v1.PodQOSBestEffort {
+			continue
+		}
+
+		cgroupConfig := CgroupConfig{}
+		cgroupConfig.Name = GetPodContainerName(pod)
+		if qosClass == v1.PodQOSGuaranteed {
+			cgroupConfig.ResourceParameters.Unified[MemoryMin] = reqs.Memory()
+		} else if qosClass == v1.PodQOSBurstable {
+			cgroupConfig.ResourceParameters.Unified[MemoryLow] = reqs.Memory()
+			cgroupConfig.ResourceParameters.Unified[MemoryHigh] = reqs.Memory()
+		}
+		cgs = append(cgs, &cgroupConfig)
+	}
+
+	return cgs
+}
+```
+
+### CGroups v2 Support
+After Kubernetes v1.19, kubelet can indentify cgroups v2 and do the convention. Since [v1.0.0-rc93](https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc93), runc supports `Unified` to pass through cgroups v2 parameters. So we use this variable to pass 'memory.low / memory.min / memory.high' when cgroups v2 mode is detected.
+
+```
+pkg/kubelet/cm/cgroup_manager_linux.go
+func (m *cgroupManagerImpl) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
+    ...
+    if libcontainercgroups.IsCgroup2UnifiedMode() {
+		if v, ok := resources.Unified[MemoryLow]; ok {
+			resources.Unified["memory.low"] = v
+		}
+		if v, ok := resources.MemoryExtras[MemoryHigh]; ok {
+			resources.Unified["memory.high"] = v
+		}
+		if v, ok := resources.MemoryExtras[MemoryMin]; ok {
+			resources.Unified["memory.min"] = v
+		}
+	}
+    ...
+}
+```
diff --git a/keps/sig-node/2570-memory-qos/kep.yaml b/keps/sig-node/2570-memory-qos/kep.yaml
@@ -0,0 +1,7 @@
+title: memory qos
+kep-number: 2570
+authors:
+  - "@xiaoxubeii"
+owning-sig: sig-node
+editor: Tim Xu
+creation-date: 2021-03-14