Skip to content

Commit

Permalink
Support memory qos using cgroups v2.
Browse files Browse the repository at this point in the history
  • Loading branch information
xiaoxubeii committed Mar 17, 2021
1 parent ada8226 commit 221b648
Show file tree
Hide file tree
Showing 2 changed files with 172 additions and 0 deletions.
165 changes: 165 additions & 0 deletions keps/sig-node/2570-memory-qos/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# KEP-2570: Support Memory QoS using cgroups v2
## Summary
Support memory qos in pod qos class using cgroups v2.

## Motivation
In traditional cgroups v1 implement in Kubernetes, pod qos class and qos manager can only limit cpu resources, such as cpu_shares, memory qos is not yet implemented. cgroups v2 brings new capabilities for memory controller and it would help Kubernetes enhance memory isolation quality.

### Goals
- Support memory qos for pod qos classes

### Non-Goals
- additional qos design
- support other resources qos

## Proposal
Use memory controller interfaces of cgroups v2 to support memory qos of pod qos class.

Kubernetes cgroup manager uses `memory.limit_in_bytes` in cgroups v1 and `memory.max` in cgroups v2 to limit memory capacity and uses `oom_scores` to determine order of killing container process when OOM occurs.

Above implement has following disadvantages:
- `Guaranteed` memory can not be fully reserved, page cache is at risk of being recycled
- `Burstable` memory allocation overcommit ( request < limit ) could increase OOM kill possibility because of node memory running out

Currently we only use `memory.limit_in_bytes == limits.memory` to limit memory hardly in cgroups v1 and `memory.max == limits.memory` in cgroups v2.

cgroups v2 introduces a better way to limit and guarantee memory quality.

| File | Description |
| -------- | -------- |
| memory.high | memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup’s memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit. |
| memory.max | memory.max is the memory usage hard limit, acting as the final protection mechanism: If a cgroup's memory usage reaches this limit and can't be reduced, the system OOM killer is invoked on the cgroup. Under certain circumstances, usage may go over the memory.high limit temporarily. When the high limit is used and monitored properly, memory.max serves mainly to provide the final safety net. The default is max. |
| memory.low | memory.low is the best-effort memory protection, a “soft guarantee” that if the cgroup and all its descendants are below this threshold, the cgroup's memory won't be reclaimed unless memory can’t be reclaimed from any unprotected cgroups. |
| memory.min | memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup’s memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked. |

This proposal uses `memory.min` to retain request memory for `Guaranteed` pods. And for `Burstable` pods, it uses `memory.low` to achieve best-effort memory protection as well as using `memory.high` to throttle allocation of memory overcommit when pods specify `requests.memory`. We also set `memory.max = limits.memory` to do hard limit to make OOM killer work.

### User Stories (Optional)
- Workload is memory-sensitive, this feature can retain memory to reduce allocation latency
- Workload is overcommitted in memory, this feature can throttle memory overcommit to increase stability by reducing risk of kernel OOM

### Risks and Mitigations
- `Guaranteed` pods memory reservation could reduce node memory supply
- `Burstable` pods memory allocation would slow down when memory use goes over request

## Design Details
### Feature Gate
To set `--feature-gates=MemoryQOS=true` to enable the feature.

### CGroup Resource
We create a new `Unified` variable in `ResourceConfig` to place `memory.min / low / high` and for extension of other parameters in cgroups v2 memory controller. Suggest same meaning in OCI runtime specs https://github.com/opencontainers/runtime-spec/pull/1040
```
pkg/kubelet/cm/types.go
// ADDED
type MemoryQoS string
const (
MemoryLow ResourceQoSType = "memory.low"
MemoryHigh ResourceQoSType = "memory.high"
MemoryMin ResourceQoSType = "memory.min"
)
// ResourceConfig holds information about all the supported cgroup resource parameters.
type ResourceConfig struct {
...
// ADDED
Unified map[ResourceQoSType]int64
...
}
```

### Memory QoS Rules
| QoS Interface | memory.min | memory.max | memory.low | memory.high |
| -------- | -------- | -------- | -------- | -------- |
| Guaranteed | requests.memory | limits.memory | n/a | n/a |
| Burstable | n/a | limits.memory | requests.memory | requests.memory |

Loop pods and update pod level cgroup configs according above rules. This feature only works in cgroups v2 mode.
```
pkg/kubelet/cm/qos_container_manager_linux.go
func (m *qosContainerManagerImpl) UpdateCgroups() error {
m.Lock()
defer m.Unlock()
qosConfigs := map[v1.PodQOSClass]*CgroupConfig{
v1.PodQOSBurstable: {
Name: m.qosContainersInfo.Burstable,
ResourceParameters: &ResourceConfig{},
},
v1.PodQOSBestEffort: {
Name: m.qosContainersInfo.BestEffort,
ResourceParameters: &ResourceConfig{},
},
}
...
// ADDED
var configs []*CgroupConfig
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.MemoryQOS && libcontainercgroups.IsCgroup2UnifiedMode() {
configs = m.setMemoryQoSConfig()
}
...
configs = append(configs, qosConfigs[v1.PodQOSGuaranteed], qosConfigs[v1.PodQOSBurstable], qosConfigs[v1.PodQOSBestEffort])
for _, config := range configs {
err := m.cgroupManager.Update(config)
if err != nil {
klog.Errorf("[ContainerManager]: Failed to update QoS cgroup configuration")
return err
}
}
klog.V(4).Infof("[ContainerManager]: Updated QoS cgroup configuration")
return nil
}
```

Add a new function `setMemoryQoSConfig` to set memory qos in cgroup config when pod is `Guaranteed` or `Burstable`.
```
pkg/kubelet/cm/qos_container_manager_linux.go
// ADDED
func (m *qosContainerManagerImpl) setMemoryQoSConfig() []*CgroupConfig {
pods := m.activePods()
cgs := make([]*CgroupConfig, 0)
for i := range pods {
pod := pods[i]
qosClass := v1qos.GetPodQOS(pod)
// if besteffort, we do nothing
if qosClass == v1.PodQOSBestEffort {
continue
}
cgroupConfig := CgroupConfig{}
cgroupConfig.Name = GetPodContainerName(pod)
if qosClass == v1.PodQOSGuaranteed {
cgroupConfig.ResourceParameters.Unified[MemoryMin] = reqs.Memory()
} else if qosClass == v1.PodQOSBurstable {
cgroupConfig.ResourceParameters.Unified[MemoryLow] = reqs.Memory()
cgroupConfig.ResourceParameters.Unified[MemoryHigh] = reqs.Memory()
}
cgs = append(cgs, &cgroupConfig)
}
return cgs
}
```

### CGroups v2 Support
After Kubernetes v1.19, kubelet can indentify cgroups v2 and do the convention. Since [v1.0.0-rc93](https://github.com/opencontainers/runc/releases/tag/v1.0.0-rc93), runc supports `Unified` to pass through cgroups v2 parameters. So we use this variable to pass 'memory.low / memory.min / memory.high' when cgroups v2 mode is detected.

```
pkg/kubelet/cm/cgroup_manager_linux.go
func (m *cgroupManagerImpl) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
...
if libcontainercgroups.IsCgroup2UnifiedMode() {
if v, ok := resources.Unified[MemoryLow]; ok {
resources.Unified["memory.low"] = v
}
if v, ok := resources.MemoryExtras[MemoryHigh]; ok {
resources.Unified["memory.high"] = v
}
if v, ok := resources.MemoryExtras[MemoryMin]; ok {
resources.Unified["memory.min"] = v
}
}
...
}
```
7 changes: 7 additions & 0 deletions keps/sig-node/2570-memory-qos/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
title: memory qos
kep-number: 2570
authors:
- "@xiaoxubeii"
owning-sig: sig-node
editor: Tim Xu
creation-date: 2021-03-14

0 comments on commit 221b648

Please sign in to comment.