Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add design doc for RBD QoS #4089

Merged
merged 1 commit into from
Sep 7, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
271 changes: 271 additions & 0 deletions docs/design/proposals/rbd-qos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
# Design Doc for RBD QoS using cgroup v2

## Introduction

The RBD QoS (Quality of Service) design aims to address the issue of IO noisy
neighbor problems encountered in early Ceph deployments catering to OpenStack
environments. These problems were effectively managed by implementing QEMU
throttling at the virtio-blk/scsi level. To further enhance this,
capacity-based IOPS were introduced, providing a more dynamic experience
similar to public cloud environments.

The challenge arises in virtual environments, where a noisy neighbor can lead
to performance degradation for other instances sharing the same resources.
Although it's uncommon to observe noisy neighbor issues in Kubernetes
environments backed by Ceph storage, the possibility exists. The existing QoS
support with rbd-nbd doesn't apply to krbd, and as rbd-nbd isn't suitable for
container production workloads, a solution is needed for krbd.

To mitigate resource starvation issues, setting QoS at the device level through
cgroup v2 when enabled becomes crucial. This approach guarantees that I/O
capacity isn't overcommitted and is fairly distributed among workloads.

## Dependency

* cgroup v2 must be enabled on the Node
* We might have Kubernetes dependency as well
* Container runtime dependency that supports cgroupv2

## Manual steps for implementing RBD QoS in a Kubernetes Cluster

```bash
[$] ssh root@node1
sh-4.4# chroot /host
sh-5.1# cat /proc/partitions
major minor #blocks name

259 0 125829120 nvme0n1
259 1 1024 nvme0n1p1
259 2 130048 nvme0n1p2
259 3 393216 nvme0n1p3
259 4 125303791 nvme0n1p4
259 6 52428800 nvme2n1
7 0 536870912 loop0
259 5 536870912 nvme1n1
252 0 52428800 rbd0
sh-5.1#
```

Once the rbd device is mapped on the node we get the device's major and minor
number we need to set the io limit on the device but we need to find the right
cgroup file where we need to set the limit

Kubernetes/Openshift creates a custom cgroup hierarchy for the pods it created
but start is `/sys/fs/cgroup` folder

```bash
sh-5.1# cd /sys/fs/cgroup/
sh-5.1# ls
cgroup.controllers cgroup.subtree_control cpuset.mems.effective io.stat memory.reclaim sys-kernel-debug.mount
cgroup.max.depth cgroup.threads dev-hugepages.mount kubepods.slice memory.stat sys-kernel-tracing.mount
cgroup.max.descendants cpu.pressure dev-mqueue.mount machine.slice misc.capacity system.slice
cgroup.procs cpu.stat init.scope memory.numa_stat sys-fs-fuse-connections.mount user.slice
cgroup.stat cpuset.cpus.effective io.pressure memory.pressure sys-kernel-config.mount
```

`kubepods.slice` is the starting point and it contains multiple slices

```bash
sh-5.1# cd kubepods.slice
sh-5.1# ls
cgroup.controllers cpuset.cpus hugetlb.2MB.rsvd.max memory.pressure
cgroup.events cpuset.cpus.effective io.bfq.weight memory.reclaim
cgroup.freeze cpuset.cpus.partition io.latency memory.stat
cgroup.kill cpuset.mems io.max memory.swap.current
cgroup.max.depth cpuset.mems.effective io.pressure memory.swap.events
cgroup.max.descendants hugetlb.1GB.current io.stat memory.swap.high
cgroup.procs hugetlb.1GB.events kubepods-besteffort.slice memory.swap.max
cgroup.stat hugetlb.1GB.events.local kubepods-burstable.slice memory.zswap.current
cgroup.subtree_control hugetlb.1GB.max kubepods-pod2b38830b_c2d6_4528_8935_b1c08511b1e3.slice memory.zswap.max
cgroup.threads hugetlb.1GB.numa_stat memory.current misc.current
cgroup.type hugetlb.1GB.rsvd.current memory.events misc.max
cpu.idle hugetlb.1GB.rsvd.max memory.events.local pids.current
cpu.max hugetlb.2MB.current memory.high pids.events
cpu.max.burst hugetlb.2MB.events memory.low pids.max
cpu.pressure hugetlb.2MB.events.local memory.max rdma.current
cpu.stat hugetlb.2MB.max memory.min rdma.max
cpu.weight hugetlb.2MB.numa_stat memory.numa_stat
cpu.weight.nice hugetlb.2MB.rsvd.current memory.oom.group
```

Based on the QoS of the pod, either our application pod will end up in the
above `kubepods-besteffort.slice` or `kubepods-burstable.slice` or
`kubepods.slice` (Guaranteed QoS) cgroup. The 3 QoS classes are defined
[here](https://kubernetes.io/docs/concepts/workloads/pods/pod-QoS/#quality-of-service-classes)

To identify the right cgroup file, we need pod UUID and container UUID from the
`pod yaml` output

```bash
[$]kubectl get po csi-rbd-demo-pod -oyaml |grep uid
uid: cdf7b785-4eb7-44f7-99cc-ef53890f4dfd
[$]kubectl get po csi-rbd-demo-pod -oyaml |grep -i containerID
- containerID: cri-o://77e57fbbc0f0630f41f9f154f4b5fe368b6dcf7bef7dcd75a9c4b56676f10bc9
[$]kubectl get po csi-rbd-demo-pod -oyaml |grep -i qosClass
qosClass: BestEffort
```

Now check in the `kubepods-besteffort.slice` and identify the right path using
pod UID and container UID

Before that check `io.max` on the application pod and see if there is any limit

```bash
[$]kubectl exec -it csi-rbd-demo-pod -- sh
sh-4.4# cat /sys/fs/cgroup/io.max
sh-4.4#
```

Come back to the Node and find the right cgroup scope

```bash
sh-5.1# cd kubepods-besteffort.slice/kubepods-besteffort-podcdf7b785_4eb7_44f7_99cc_ef53890f4dfd.slice/crio-77e57fbbc0f0630f41f9f154f4b5fe368b6dcf7bef7dcd75a9c4b56676f10bc9.scope/


sh-5.1# echo "252:0 wbps=1048576" > io.max
sh-5.1# cat io.max
252:0 rbps=max wbps=1048576 riops=max wiops=max
```

Now go back to the application pod and check if we have the right limit set

```bash
[$]kubectl exec -it csi-rbd-demo-pod -- sh
sh-4.4# cat /sys/fs/cgroup/io.max
252:0 rbps=max wbps=1048576 riops=max wiops=max
sh-4.4#
```

Note:- We can only support the QoS that cgroup v2 io controller supports, this
means that cumulative read+write QoS limits won't be supported.

Below are the configurations that will be supported

| Parameter | Description |
| --- | --- |
| MaxReadIOPS | Max read IO operations per second |
| MaxWriteIOPS | Max write IO operations per second |
| MaxReadBytesPerSecond | Max read bytes per second |
| MaxWriteBytesPerSecond | Max write bytes per second |

## Different approaches

The above solution can be implemented using 3 different approaches.

### 1. QoS using new parameters in RBD StorageClass

```yaml
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd-sc
provisioner: rbd.csi.ceph.com
parameters:
MaxReadIOPS: ""
MaxWriteIOPS: ""
MaxReadBytesPerSecond: ""
MaxWriteBytesPerSecond: ""
```

#### Implementation for StorageClass QoS

1. Create new storageClass with new parameters for QoS
1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
procedure
1. During NodePublishVolume CSI procedure
* Retrieve the QoS configuration from the volumeContext in NodePublishRequest
* Identify the rbd device using the NodeStageVolumePath
* Get the pod UUID from the NodeStageVolume
* Set io.max file in all the containers in the pod

#### Drawbacks of StorageClass QoS

1. No way to update the QoS at runtime
1. Need to take a backup and restore to New QoS StorageClass
1. Delete and Recreate the PV object

### 2. QoS using parameters in VolumeAttributeClass

```yaml
apiVersion: storage.k8s.io/v1alpha1
kind: VolumeAttributesClass
metadata:
name: silver
parameters:
MaxReadIOPS: ""
MaxWriteIOPS: ""
MaxReadBytesPerSecond: ""
MaxWriteBytesPerSecond: ""
```

VolumeAttributesClassName is a new parameter in the PVC object the user can
choose from and this can also be updated or removed later.

This new VolumeAttributeClass is designed to keep storage that supports setting
QoS at the storage level which means setting some configuration at the storage
(like QoS for nbd)

#### Implementation of VolumeAttributeClass QoS

1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
procedure
1. Add support in Ceph-CSI to expose ModifyVolume CSI procedure
1. Ceph-CSI will store QoS in the rbd image metadata
1. During NodeStage operation retrieve the image metadata and store it in
stagingPath
1. Whenever a new pod comes in apply the QoS

#### Drawbacks of VolumeAttributeClass QoS

One problem with above is all application need to be scaled downed and scaled
up to get the new QoS value even though its changed in the PVC object, this is
sometime impossible as it will have downtime.

### 3. QoS using parameters in VolumeAttributeClass with NodePublish Secret

1. Modify CSIDriver object to pass pod details to the NodePublishVolume CSI
procedure
1. Add support in Ceph-CSI to expose ModifyVolume CSI procedure
1. Ceph-CSI will store QoS in the rbd image metadata
1. During NodePublishVolume operation retrieve the QoS from image metadata
1. Whenever a new pod comes in apply the QoS

This solution addresses the aforementioned issue, but it requires a secret to
communicate with the ceph cluster. Therefore, we must create a new
PublishSecret for the storageClass, which may be beneficial when Kubernetes
eventually enables Node operations.

Both options 2 and 3 are contingent upon changes to the CSI spec and Kubernetes
Madhu-1 marked this conversation as resolved.
Show resolved Hide resolved
support. Additionally,
[VolumeAttributeClass](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/3751-volume-attributes-class/README.md)
is currently being developed within the Kubernetes realm and will initially be
in the Alpha stage. Consequently, it will be disabled by default.

#### Advantages of QoS using VolumeAttributeClass

1. No Restore/Clone operation is required to change the QoS
1. Easily QoS can be changed for existing PVC only with second approach not
with third as it needs new secret.

### Hybrid Approach

Considering the advantages and drawbacks, we can use StorageClass and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for users to override the QoS in the storage class with the VolumeAttributeClass settings? This sounds good assuming the admin controls which VolumeAttributeClass is available for the user to specify in their PVC. Otherwise, the user could override admin settings in the storage class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both storageclass and VolumeAttributeClass are admin created ones and the user will not have access to create one, Yes users can update the VolumeAttributeClass in the PVC later creation but he/she should get the name of the VolumeAttributeClass from the admin.

VolumeAttributeClass to support QoS, with VolumeAttributeClass taking
precedence over StorageClass. This approach offers a flexible solution that
accounts for dynamic changes while addressing the challenges of existing
approaches.
idryomov marked this conversation as resolved.
Show resolved Hide resolved

### References

Some of the useful links that helped me to understand cgroup v2 and how to set
QoS on the device.

* [Kubernetes cgroup v2
Architecture](https://kubernetes.io/docs/concepts/architecture/cgroups/)
* [cgroup v2 kernel doc](https://docs.kernel.org/admin-guide/cgroup-v2.html)
* [ceph RBD QoS tracker](https://tracker.ceph.com/issues/36191)
* [cgroup v2 io
controller](https://facebookmicrosites.github.io/cgroup2/docs/io-controller.html)
* [Kubernetes IOPS
issue](https://github.com/kubernetes/kubernetes/issues/92287)