Skip to content

Commit

Permalink
draft for workload partitioning
Browse files Browse the repository at this point in the history
Signed-off-by: Evgeny Slutsky <eslutsky@redhat.com>
  • Loading branch information
eslutsky committed Jul 29, 2024
1 parent 721f10f commit f99e137
Showing 1 changed file with 336 additions and 0 deletions.
336 changes: 336 additions & 0 deletions enhancements/microshift/workload-partitioning-in-microshift.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,336 @@
---
title: workload-partitioning-in-microshift
authors:
- "@eslutsky"
reviewers:
- "@sjug, Performance and Scalability expert"
- "@DanielFroehlich, PM"
- "@jogeo, QE lead"
- "@pmtk, working on low latency workloads"
- "@pacevedom, MicroShift lead"
approvers:
- "@jerpeter1"
api-approvers:
- "@jerpeter1"
creation-date: 2024-06-20
last-updated: 2024-06-20
tracking-link:
- https://issues.redhat.com/browse/USHIFT-409
---

# workload partitioning in Microshift

## Summary

This enhancement describes how workload partitioning will be supported on MicroShift hosts.


## Motivation

In constrained environments, management workloads, including the Microshift control plane, need to be configured to use fewer resources than they might by default in normal clusters.

The goal of workload partitioning is to be able to limit the amount of CPU usage of all control plane components. E.g. on a 8 core system, we can limit and guarantee that the control plane is using max 2 cores.

### User Stories

* As a MicroShift administrator, I want to configure MicroShift host and all involved subsystems
so that I can isolate the control plane services to run on a restricted set of CPUs which reserves rest of the device CPU resources for user's own workloads .


### Goals

Provide guidance and example artifacts for configuring the system for workload partitioning running on MicroShift:
- Ensure that Microshift embedded goroutines (API, etc.) respect the cpuset configuration and run exclusively on those CPUs.
- making sure containers that started by MicroShift runs exclusively on the configured CPUs.




### Non-Goals

- low latency workloads (see [OCPSTRAT-361 /etc/kubernetes/openshift-workload-pinning
](https://issues.redhat.com/browse/OCPSTRAT-361))


## Proposal

to ease the configuration of the system for running workload partitioning on MicroShift the following parts need to be put in place:
- kubelet configuration:
- microshift managing its own embedded kubelet instance

- read only configuration can be found at
`/var/lib/microshift/resources/kubelet/config/config.yaml`


- Override crio and microshift systemd units configuration, by adding drop-in files:


currently ovs-vswitchd and ovsdb-server has cpupining hardcoded configuration,
which are applied during rpm [installation](https://github.com/openshift/microshift/blob/main/packaging/rpm/microshift.spec#L276-L277)



### Workflow Description

#### System and MicroShift configuration

##### OSTree
1. User supplies Microshift configuration using blueprint
- /etc/microshift/config.yaml - embed Kubelet configuration through microshift config.
- /etc/kubernetes/openshift-workload-pinning - kubelet configuration file.
1. User supplies CRIO configuration file using blueprint.
1. User specify CPUAffinity with systemd drop-in configuration using blueprint
- crio
- Microshift


1. User builds the blueprint
1. User deploys the commit / installs the system.
1. System boots

Example blueprint:
```
name = "microshift-workload-partiontining"
version = "0.0.1"
modules = []
groups = []
distro = "rhel-94"
[[packages]]
name = "microshift"
version = "4.17.*"
[[customizations.services]]
enabled = ["microshift"]
[[customizations.files]]
path = "/etc/microshift/config.yaml"
data = """
kubelet:
reservedSystemCPUs: "0,6,7"
cpuManagerPolicy: static
cpuManagerPolicyOptions:
full-pcpus-only: "true"
cpuManagerReconcilePeriod: 5s
"""
[[customizations.files]]
path = "/etc/crio/crio.conf.d/20-microshift-wp.conf"
data = """
[crio.runtime]
infra_ctr_cpuset = "0,6,7"
[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0" }
"""
[[customizations.files]]
path = "/etc/systemd/system/crio.service.d/microshift-cpuaffinity.conf"
data = """
[Service]
CPUAffinity=0,6,7
"""
[[customizations.files]]
path = "/etc/systemd/system/microshift.service.d/microshift-cpuaffinity.conf"
data = """
[Service]
CPUAffinity=0,6,7
"""
[[customizations.files]]
path = "/etc/kubernetes/openshift-workload-pinning"
data = """
{
"management": {
"cpuset": "0,6,7"
}
}
"""
```

##### RPM
1. User creates configuration files with the workload partitioning configuration
1. reboot the host

### API Extensions

Following API extensions are expected:
- A passthrough from MicroShift's config to Kubelet config.


### Topology Considerations

#### Hypershift / Hosted Control Planes

N/A

#### Standalone Clusters

N/A

#### Single-node Deployments or MicroShift

Purely MicroShift enhancement.

### Implementation Details/Notes/Constraints

#### Kubelet configuration
- instructs kubelet to modify the node resource with the capacity and allocatable CPUs.

add to the file /etc/kubernetes/openshift-workload-pinning
```json
{
"management": {
"cpuset": "0,6,7"
}
}
```
- microshift passthrough configuration for kubelet

add to file /etc/microshift/config.yaml
```yaml
kubelet:
reservedSystemCPUs: 0,6,7
cpuManagerPolicy: static
cpuManagerPolicyOptions:
full-pcpus-only: "true"
cpuManagerReconcilePeriod: 5s
```

#### CRI-O configuration

- add this drop-in to CRI-I with the following example:
```ini
[crio.runtime]
infra_ctr_cpuset = "0,6,7"

[crio.runtime.workloads.management]
activation_annotation = "target.workload.openshift.io/management"
annotation_prefix = "resources.workload.openshift.io"
resources = { "cpushares" = 0, "cpuset" = "0,6,7" }
```

- add systemd drop-in for crio

in the file /etc/systemd/system/crio.service.d/microshift-cpuaffinity.conf
```ini
[Service]
CPUAffinity=0,6,7
```

#### Admission Webhook
Admission Webhook to update the Control Plane POD Annotations for the CRIO
Introduce cpuset profiles to Microshift configuration


#### Microshift Control Plane cpu pinning
MicroShift runs as a single systemd unit. The main binary embeds as goroutines only those services strictly necessary to bring up a *minimal Kubernetes/OpenShift control and data plane*.
microshift will be pinned using systemd [CPUAffinity](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/assembly_configuring-cpu-affinity-and-numa-policies-using-systemd_managing-monitoring-and-updating-the-kernel) configuration option.

using systemd drop-in file in /etc/systemd/system/microshift.service.d/microshift-cpuaffinity.conf:
```
[Service]
CPUAffinity=0,6,7
```
#### Extra manifests
TBD
### Risks and Mitigations
Biggest risk is system misconfiguration.
CPU starvation for Microshift Control plane will cause service outage.
### Drawbacks
Approach described in this enhancement does not provide much of the NTO's functionality
due to the "static" nature of RPMs and packaged files (compared to NTO's dynamic templating),
but it must be noted that NTO is going beyond Workload partitioning.
One of the NTO's strengths is that it can create systemd units for runtime configuration
(such as offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance banned CPUs,
setting RPS masks). Such dynamic actions are beyond capabilities of static files shipped via RPM.
If such features are required by users, we could ship such systemd units and they could be no-op
unless they're turned on in MicroShift's config. However, it is unknown to author of the enhancement
if these are integral part of the low latency.
## Open Questions [optional]
TBD
## Test Plan
## Graduation Criteria
Feature is meant to be GA on first release.
### Dev Preview -> Tech Preview
Not applicable.
### Tech Preview -> GA
Not applicable.
### Removing a deprecated feature
Not applicable.
## Upgrade / Downgrade Strategy
TBD
## Version Skew Strategy
TBD
## Operational Aspects of API Extensions
Kubelet configuration will be exposed in MicroShift config as a passthrough.
## Support Procedures
## Alternatives
### Deploying Node Tuning Operator
Most of the functionality discussed in scope of this enhancement is already handled by Node Tuning
Operator (NTO). However incorporating it in the MicroShift is not the best way for couple reasons:
- NTO depends on Machine Config Operator which also is not supported on MicroShift,
- MicroShift takes different approach to host management than OpenShift,
- MicroShift being intended for edge devices aims to reduce runtime resource consumption and
introducing operator is against this goal.
### Reusing NTO code
Instead of deploying NTO, its code could be partially incorporated in the MicroShift.
However this doesn't improve the operational aspects: MicroShift would transform a CR into TuneD,
CRI-O config, and kubelet configuration, which means it's still a controller, just running in
different binary and that doesn't help with runtime resource consumption.
Parts that depend on the MCO would need to be rewritten and maintained.
Other aspect is that NTO is highly generic, supporting many configuration options to mix and match
by the users, Responsibility of dev team is to remove common hurdles from user's path so they make less mistakes
and want to continue using the product.but this enhancement focuses solely on Low Latency.
### Providing users with upstream documentations on how to configure CRI-O
This is least UX friendly way of providing the functionality.
## Infrastructure Needed [optional]
N/A

0 comments on commit f99e137

Please sign in to comment.