draft for workload partitioning

Signed-off-by: Evgeny Slutsky <eslutsky@redhat.com>
openshift · Jul 29, 2024 · f99e137 · f99e137
1 parent 721f10f
commit f99e137
Showing 1 changed file with 336 additions and 0 deletions.
diff --git a/enhancements/microshift/workload-partitioning-in-microshift.md b/enhancements/microshift/workload-partitioning-in-microshift.md
@@ -0,0 +1,336 @@
+---
+title: workload-partitioning-in-microshift
+authors:
+  - "@eslutsky"
+reviewers:
+  - "@sjug, Performance and Scalability expert"
+  - "@DanielFroehlich, PM"
+  - "@jogeo, QE lead"
+  - "@pmtk, working on low latency workloads"
+  - "@pacevedom, MicroShift lead"
+approvers:
+  - "@jerpeter1"
+api-approvers:
+  - "@jerpeter1"
+creation-date: 2024-06-20
+last-updated: 2024-06-20
+tracking-link:
+  - https://issues.redhat.com/browse/USHIFT-409
+---
+
+# workload partitioning in Microshift
+
+## Summary
+
+This enhancement describes how workload partitioning will be supported on MicroShift hosts.
+
+
+## Motivation
+
+In constrained environments, management workloads, including the Microshift control plane, need to be configured to use fewer resources than they might by default in normal clusters.
+
+The goal of workload partitioning is to be able to limit the amount of CPU usage of all control plane components. E.g. on a 8 core system, we can limit and guarantee that the control plane is using max 2 cores.
+
+### User Stories
+
+* As a MicroShift administrator, I want to configure MicroShift host and all involved subsystems
+  so that I can isolate the control plane services to run on a restricted set of CPUs which reserves rest of the device CPU resources for user's own workloads .
+
+
+### Goals
+
+Provide guidance and example artifacts for configuring the system for workload partitioning running on MicroShift:
+- Ensure that Microshift embedded goroutines (API, etc.) respect the cpuset configuration and run exclusively on those CPUs.
+- making sure containers that started by MicroShift runs exclusively on the configured CPUs.
+
+
+
+
+### Non-Goals
+
+- low latency workloads (see [OCPSTRAT-361 /etc/kubernetes/openshift-workload-pinning
+](https://issues.redhat.com/browse/OCPSTRAT-361))
+
+
+## Proposal
+
+to ease the configuration of the system for running workload partitioning on MicroShift  the following parts need to be put in place:
+- kubelet configuration:
+  -  microshift managing its own embedded kubelet instance 
+
+  - read only configuration can be found at
+    `/var/lib/microshift/resources/kubelet/config/config.yaml`
+
+
+- Override crio and microshift systemd units configuration, by adding drop-in files:
+
+
+currently ovs-vswitchd and ovsdb-server has cpupining hardcoded configuration,
+which are applied during rpm [installation](https://github.com/openshift/microshift/blob/main/packaging/rpm/microshift.spec#L276-L277)
+
+
+
+### Workflow Description
+
+#### System and MicroShift configuration
+
+##### OSTree
+1. User supplies Microshift configuration  using blueprint
+    -  /etc/microshift/config.yaml - embed Kubelet configuration through microshift config.
+    - /etc/kubernetes/openshift-workload-pinning - kubelet configuration file.
+1. User supplies CRIO configuration file using blueprint.
+1. User specify CPUAffinity with systemd drop-in  configuration using blueprint
+    -  crio
+    -  Microshift 
+
+
+1. User builds the blueprint
+1. User deploys the commit / installs the system.
+1. System boots
+
+Example blueprint:
+```
+name = "microshift-workload-partiontining"
+version = "0.0.1"
+modules = []
+groups = []
+distro = "rhel-94"
+
+[[packages]]
+name = "microshift"
+version = "4.17.*"
+
+[[customizations.services]]
+enabled = ["microshift"]
+
+[[customizations.files]]
+path = "/etc/microshift/config.yaml"
+data = """
+kubelet:
+  reservedSystemCPUs: "0,6,7"
+  cpuManagerPolicy: static
+  cpuManagerPolicyOptions:
+    full-pcpus-only: "true"
+  cpuManagerReconcilePeriod: 5s
+"""
+
+[[customizations.files]]
+path = "/etc/crio/crio.conf.d/20-microshift-wp.conf"
+data = """
+[crio.runtime]
+infra_ctr_cpuset = "0,6,7"
+
+[crio.runtime.workloads.management]
+activation_annotation = "target.workload.openshift.io/management"
+annotation_prefix = "resources.workload.openshift.io"
+resources = { "cpushares" = 0, "cpuset" = "0" }
+"""
+
+[[customizations.files]]
+path = "/etc/systemd/system/crio.service.d/microshift-cpuaffinity.conf"
+data = """
+[Service]
+CPUAffinity=0,6,7
+"""
+
+[[customizations.files]]
+path = "/etc/systemd/system/microshift.service.d/microshift-cpuaffinity.conf"
+data = """
+[Service]
+CPUAffinity=0,6,7
+"""
+
+
+[[customizations.files]]
+path = "/etc/kubernetes/openshift-workload-pinning"
+data = """
+{
+  "management": {
+    "cpuset": "0,6,7"
+  }
+}
+"""
+
+```
+
+##### RPM
+1. User creates configuration files with the workload partitioning configuration
+1. reboot the host
+
+### API Extensions
+
+Following API extensions are expected:
+- A passthrough from MicroShift's config to Kubelet config.
+
+
+### Topology Considerations
+
+#### Hypershift / Hosted Control Planes
+
+N/A
+
+#### Standalone Clusters
+
+N/A
+
+#### Single-node Deployments or MicroShift
+
+Purely MicroShift enhancement.
+
+### Implementation Details/Notes/Constraints
+
+#### Kubelet configuration 
+- instructs kubelet to modify the node resource with the capacity and allocatable CPUs.
+
+  add to the file /etc/kubernetes/openshift-workload-pinning
+  ```json
+  {
+    "management": {
+      "cpuset": "0,6,7"
+    }
+  }
+  ```
+- microshift passthrough configuration for kubelet
+
+  add to file /etc/microshift/config.yaml 
+  ```yaml
+  kubelet:
+    reservedSystemCPUs: 0,6,7
+    cpuManagerPolicy: static
+    cpuManagerPolicyOptions:
+      full-pcpus-only: "true"
+    cpuManagerReconcilePeriod: 5s    
+  ```
+
+#### CRI-O configuration
+
+- add this drop-in to CRI-I with the following example:
+  ```ini
+  [crio.runtime]
+  infra_ctr_cpuset = "0,6,7"
+
+  [crio.runtime.workloads.management]
+  activation_annotation = "target.workload.openshift.io/management"
+  annotation_prefix = "resources.workload.openshift.io"
+  resources = { "cpushares" = 0, "cpuset" = "0,6,7" }
+  ```
+
+- add systemd drop-in for crio
+
+  in the file /etc/systemd/system/crio.service.d/microshift-cpuaffinity.conf
+  ```ini
+  [Service]
+  CPUAffinity=0,6,7
+  ```
+
+####  Admission Webhook 
+Admission Webhook to update the Control Plane POD Annotations  for the CRIO
+Introduce cpuset profiles to Microshift configuration
+
+
+#### Microshift Control Plane cpu pinning
+MicroShift runs as a single systemd unit. The main binary embeds as goroutines only those services strictly necessary to bring up a *minimal Kubernetes/OpenShift control and data plane*. 
+microshift will be pinned using systemd [CPUAffinity](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/assembly_configuring-cpu-affinity-and-numa-policies-using-systemd_managing-monitoring-and-updating-the-kernel) configuration option.
+
+using systemd drop-in file in /etc/systemd/system/microshift.service.d/microshift-cpuaffinity.conf:
+```
+[Service]
+CPUAffinity=0,6,7
+```
+
+
+
+#### Extra manifests
+
+TBD
+
+### Risks and Mitigations
+
+Biggest risk is system misconfiguration.
+CPU starvation for Microshift Control plane will cause service outage.
+
+### Drawbacks
+
+Approach described in this enhancement does not provide much of the NTO's functionality
+due to the "static" nature of RPMs and packaged files (compared to NTO's dynamic templating),
+but it must be noted that NTO is going beyond Workload partitioning.
+
+One of the NTO's strengths is that it can create systemd units for runtime configuration
+(such as offlining CPUs, setting hugepages per NUMA node, clearing IRQ balance banned CPUs,
+setting RPS masks). Such dynamic actions are beyond capabilities of static files shipped via RPM.
+If such features are required by users, we could ship such systemd units and they could be no-op
+unless they're turned on in MicroShift's config. However, it is unknown to author of the enhancement
+if these are integral part of the low latency.
+
+## Open Questions [optional]
+
+TBD
+
+## Test Plan
+
+## Graduation Criteria
+
+Feature is meant to be GA on first release.
+
+### Dev Preview -> Tech Preview
+
+Not applicable.
+
+### Tech Preview -> GA
+
+Not applicable.
+
+### Removing a deprecated feature
+
+Not applicable.
+
+## Upgrade / Downgrade Strategy
+
+TBD
+
+## Version Skew Strategy
+
+TBD
+
+
+## Operational Aspects of API Extensions
+
+Kubelet configuration will be exposed in MicroShift config as a passthrough.
+
+
+## Support Procedures
+
+## Alternatives
+
+### Deploying Node Tuning Operator
+
+Most of the functionality discussed in scope of this enhancement is already handled by Node Tuning
+Operator (NTO). However incorporating it in the MicroShift is not the best way for couple reasons:
+- NTO depends on Machine Config Operator which also is not supported on MicroShift,
+- MicroShift takes different approach to host management than OpenShift,
+- MicroShift being intended for edge devices aims to reduce runtime resource consumption and
+  introducing operator is against this goal.
+
+
+### Reusing NTO code
+
+Instead of deploying NTO, its code could be partially incorporated in the MicroShift.
+However this doesn't improve the operational aspects: MicroShift would transform a CR into TuneD,
+CRI-O config, and kubelet configuration, which means it's still a controller, just running in
+different binary and that doesn't help with runtime resource consumption.
+
+Parts that depend on the MCO would need to be rewritten and maintained.
+
+Other aspect is that NTO is highly generic, supporting many configuration options to mix and match
+by the users, Responsibility of dev team is to remove common hurdles from user's path so they make less mistakes
+and want to continue using the product.but this enhancement focuses solely on Low Latency.
+
+
+### Providing users with upstream documentations on how to configure CRI-O 
+
+This is least UX friendly way of providing the functionality.
+
+
+## Infrastructure Needed [optional]
+
+N/A