CNF-7603: mixed-cpu-node-plugin #1396

Tal-or · 2023-05-08T14:12:42Z

Runtime-level node plugin to allow containers request for both exclusive and shared CPUs.

The plugin is optional (for minimizing risks), runtime-agnostic and vendor lock-in free,
and allows enhanced resource utilization, CPUs optimization and higher pod density.

Signed-off-by: Talor Itzhak titzhak@redhat.com

jmencak

Thank you for the PR, I have just a few minor nits and a question.

enhancements/node-tuning/mixed-cpu-node-plugin.md

bparees · 2023-05-17T21:01:37Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+    minimizing the changes to the platform
+
+### Non-Goals
+* Introducing a generic mechanism in the platform that does involve Kubelet and pod spec changes.


what's the compelling argument for not upstreaming this?

time required?

no one else would use it/be interested in it?

I mentioned that in the Alternatives section,
but the gist is that upstreaming this proposal means we need to have a generic solution,
that could fit for broader set of use cases and requirements
and not only the scope-specific requirement that we're trying to address.

In addition, there is an ongoing effort to support greater resource management flexibility upstream (a KEP which is still in discussion) but:

We are not sure how another KEP blends alongside this ongoing effort.

We can't wait for the existing KEP due to time constraints.

I mentioned the KEP in the Alternative section as well.

Is anyone from the team at least involved in the discussions of that KEP?

If the KEP is approved, how would this design evolve to use it or at least co-exist safely?

Is anyone from the team at least involved in the discussions of that KEP?

Yes. @ffromani and @swatisehgal are.

If the KEP is approved, how would this design evolve to use it or at least co-exist safely?

The KEP is not going to break any of the existing behavior of CPU manager and this feature meant to work along side CPUManager, hence it won't be affected, by the proposed changes.

bartwensley · 2023-05-19T15:33:33Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+A change such as this requires changes to Kubelet, scheduler and other supporting controllers as 
+eviction manager, HPA (horizontal pod autoscaling), etc.
+Considering the u/s velocity, current deadlines, and the number of open questions that have to be addressed,
+the plugin solution has a bigger chance to be completed on time.


I'd suggest that "bigger chance to be completed on time" isn't the best criteria for choosing an alternative. Another consideration is what are going to be the implications for migrating to a proper solution if this short-term approach is implemented?

Hey Bart thank you for taking the time and providing your feedback.

I'd suggest that "bigger chance to be completed on time" isn't the best criteria for choosing an alternative.

I would not underestimate the u/s pace which can be extremely slow, especially for centralize changes in resources management and pod spec API.

Time to market is not the only obstacle - In order to reach consensus among u/s community we should present a broader set of use-cases and scenarios which is relevant to wider audience, which we don't have atm, but the specific use-case mentioned in the user-story section.

Another consideration is what are going to be the implications for migrating to a proper solution if this short-term approach is implemented?

That depends on what the proper solution would be, but let's assume the proper solution is having a new API
field in pod spec + new pool on kubelet config.
So upgrade from short-term to proper solution would roughly look like:

The mixed-cpu-node-plugin can be shut-off easily by a simple change in the performance profile.

The new pool would be configured in Kubelet-config via NTO.

All pod spec that requests openshift.io/shared-cpu would be convert (via API admission hook IIRC) to the new API filed.

bartwensley · 2023-05-19T15:35:12Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+
+The node-plugin populates a special device named `openshift.io/shared-cpus` to provide way for pods to request
+for this special type of CPUs.
+There's no meaning to the value/number of `openshift.io/shared-cpus` devices that the pod requests.


Is this request going to be used by the scheduler to place the pod on a node that has these resources? That should be discussed here.

This is part of the scheduler behavior, but since it's a dummy device it would be populated to all nodes.
The actual reason for having it as a resource described at lines 110 -114:
"The reason for specifying a device, and not, for example, an annotation,
is because when application pod requests for a device, the scheduler makes the pod pending till device up and running.
This gives the node plugin room for setup and not being depended on pod admission order.
So only when the node plugin finishes the registration process with NRI and device plugin, the scheduler
admits the application pod."

I'm curious about the case where the shared-cpus are not configured on a node in its performance profile, so that node would not publish the openshift.io/shared-cpus resource. In that case, will the scheduler avoid putting pods with requests for that resource on those nodes? Or is the requirement that all nodes need to have shared-cpus configured - and if so, how would that be enforced?

I'm curious about the case where the shared-cpus are not configured on a node in its performance profile, so that node would not publish the openshift.io/shared-cpus resource

Yes. If shared-cpus are not configured, NTO won't deploy the node-plugin that responsible to populate the openshift.io/shared-cpus devices.

In that case, will the scheduler avoid putting pods with requests for that resource on those nodes?

If the pods asks for a device, the scheduler schedules the pod only to nodes that have this certain device. if non of the nodes have an available device, the pod keep pending. this is a native behavior of the scheduler.

Or is the requirement that all nodes need to have shared-cpus configured - and if so, how would that be enforced?

No it's not a requirement.

bartwensley · 2023-05-19T15:38:33Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+A way of mitigating that is checks whether it's possible to use workload partitioning
+to ensure the platform housekeeping processes don't run on the shared cpus.
+(This statement needs to be verified)


In any case, I think this proposal needs to discuss the interactions with workload partitioning, which currently places management workloads on the reserved cpus. Will workload partitioning use the larger set of reserved cpus that now includes the "shared" cpus? What happens if the new shared cpu set is created at installation time vs. on an already running system? I think there are implications for changing the reserved cpuset after workload partitioning is configured.

Yea, this is still under investigation.

Will workload partitioning use the larger set of reserved cpus that now includes the "shared" cpus?

@bartwensley @browsell If we would make sure (as part of this feature) that housekeeping processes would run only on a subset of the reserved that doesn't include the shared cpus, would that be a more satisfying approach for this solution?

This is the file that contains management workloads when SNO spin ups with WP enabled.
According to my understanding changing the reserved pool later on (via NTO), won't affect this file, because it's a static configuration, hence houskeeping processes won't be expanded to the shared-cpus.
I asked for SNO cluster for testing that theory but please keep me honest here if I presumed something wrong.

First of all, workload partitioning is not just SNO. The cpuset in the crio dropin for management workload partitioning is generated based on the number of reserved cpus in the performance profile, https://github.com/openshift/cluster-node-tuning-operator/blob/master/pkg/performanceprofile/controller/performanceprofile/components/machineconfig/machineconfig.go#L530.

System reserved with this proposal no longer maps to kubelet's view of system reserved. As long as the NTO continues to generate the crio drop in based on the system reserved defined in the performance profile then there should not be an issue.

Thank you Brent for your comment.

As long as the NTO continues to generate the crio drop in based on the system reserved defined in the performance profile then there should not be an issue.

With this proposal, the reservedSystemCpus in kubelet configuration file, composed of preformanceProfile's .spec.cpu.reserved + .spec.cpu.shared.
But as you said the generated crio drop in, would be composed only of preformanceProfile's .spec.cpu.reserved

Do we have an agree on that?

BTW the enablement of WP in NTO doesn't appear at: https://github.com/openshift/enhancements/blob/master/enhancements/workload-partitioning/management-workload-partitioning.md#implementation-history
This is why I thought this file is a static configuration and not being handled by an operator

this comment thread is useful and the key point about crio.conf config separating which cpus are reserved for management component use from the other cpus now available in the shared pool is clarifying.

bartwensley · 2023-05-19T15:43:59Z

/cc browsell

enhancements/node-tuning/mixed-cpu-node-plugin.md

bartwensley · 2023-06-08T14:36:16Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+
+The node-plugin populates a special device named `openshift.io/shared-cpus` to provide way for pods to request
+for this special type of CPUs.
+There's no meaning to the value/number of `openshift.io/shared-cpus` devices that the pod requests.


I'm curious about the case where the shared-cpus are not configured on a node in its performance profile, so that node would not publish the openshift.io/shared-cpus resource. In that case, will the scheduler avoid putting pods with requests for that resource on those nodes? Or is the requirement that all nodes need to have shared-cpus configured - and if so, how would that be enforced?

enhancements/node-tuning/mixed-cpu-node-plugin.md

bartwensley · 2023-06-08T14:54:01Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+Once shared-partition lands, the annotation should be deprecated
+and node-plugin will use the shared cpus specified 
+at `spec.cpu.shared`.


This is assuming that there are no cases where we want to have one feature enabled but not the other. Just want to confirm that is a safe assumption? If the user wants mixed-cpu support, then that implies they are using isolated/guaranteed cpus for some workloads, so I think they would want to also have the shared-cpu-pool feature enabled as well to enable proper configuration of the shared/guaranteed cpu pools. In the other case, where the user wants the shared-cpu-pool feature, I guess it is possible that they don't need the mixed-cpu support. Is there any harm in enabling it in that case? What would be the cost of having it enabled but not used?

That is a fair argument we should discuss about.
Do we want to be able to shut down the node-plugin completely even when spec.cpu.shared specified?
Maybe we can have some emergency API for doing that. WDYT?

Is there any harm in enabling it in that case? What would be the cost of having it enabled but not used?

Essentially the node-plugin does nothing unless a pod requests explicitly for shared cpus.
Nonetheless, the pod of the node-plugin still runs on the cluster and requires its own resources.
This might be critical on systems with tight resources.

If the user wants mixed-cpu support, then that implies they are using isolated/guaranteed cpus for some workloads, so I think they would want to also have the shared-cpu-pool feature enabled as well to enable proper configuration of the shared/guaranteed cpu pools.

I don't know if shared partition is needed for all the workloads that need the extra share cpu - @MarSik could you please advise here?

I wasn't thinking of an emergency API, but more of a separate way to enable/disable the mixed-cpu support. I guess the question is how much cpu/memory will the node-plugin use when mixed-cpu is enabled and is this acceptable on all systems (mostly SNO) with tight resources?

There's no hard requirements about how much resources the node-plugin needs. but I'll propose a way to enable it anyway for decoupling the enablement of both features

Well in theory no.. you do not need both at the same time. I would go the other way though.. opt-in. mixedCpus have some special considerations that need to be documented, so it can be a workload hint. The docs about that hint can then explain how it works exactly (no limits etc..).

Thanks @MarSik that is exactly what I did.

bartwensley · 2023-06-08T15:04:09Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+    node-role.kubernetes.io/performance: "test"
+```
+
+Specify the annotation activates the feature, signals NTO to update the `reservedSystemCpus` in Kubelet config, 


Can the feature be activated/deactivated on an already running system? Would that just require a reboot? What happens to existing application pods using CPUs that are moved to the reservedSystemCpus - would they automatically have their affinity updated to move them off those CPUs? I think it would be good to mention this here.

Can the feature be activated/deactivated on an already running system?

Yes, i'll mention it.

Would that just require a reboot?

Yes, i'll mention it.

What happens to existing application pods using CPUs that are moved to the reservedSystemCpus - would they automatically have their affinity updated to move them off those CPUs?

Yes, but it's not related to this feature, hence I didn't specify it.
For example we can achieve the same scenario described above just by changing the reserved CPUs under the performance profile.

I guess the difference here is that if you deactivate the feature and some pods are already have requests for the new openshift.io/shared-cpus resource, then those pods are going to fail - right? Is that OK? Do we need to prevent that from happening?

The pods will keep running until something happen (Kubelet/node restart or pod failures) but then it would go under the normal pod lifecycle process.

OK - but deactivating this feature is going to require a restart. So that means any pods using the new resource will not come up after the restart - right? And is that OK? I guess we need a big warning in the customer docs?

Yes, all pods would be pending.

I guess we need a big warning in the customer docs?

Maybe. I'll check how other features are behaving in such cases and change accordingly.

enhancements/node-tuning/mixed-cpu-node-plugin.md

yanirq · 2023-06-11T11:28:01Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+resources (ServiceAccount, RBAC resources, SecurityContextConstraint, etc.)
+NTO will also be responsible to watch, monitor and report the node-plugin's DaemonSet status.
+
+In addition, since both the components are related, mixed-cpu-node-plugin code would be vendor under NTO.


@jmencak @MarSik @Tal-or this means we will have to vendor stable branches/tags/(commits?) of this repo for every fix/bulk of fixes once in a while to NTO as we discussed and agreed.

enhancements/node-tuning/mixed-cpu-node-plugin.md

yanirq · 2023-06-13T15:58:24Z

enhancements/node-tuning/mixed-cpu-node-plugin.md

+Both workloadHints and the annotation have to be specified in order to activate the feature.
+If only one of them would be specified, NTO should report a warning (in logs or status). 


If this is a Must I wonder if we can enforce via validation that if you explicitly specify the annotation then you must have the workloadhints specified.
Lets ask it this way - if the annotation alone is specified will anything happen?

Nothing should happen.

What information is provided in the hints vs the annotation? From an API perspective, if these are joined and need to be set in tandem, then they should be API fields close together with validation to ensure they must both be set.

Having them disjoint is likely to lead to confusion for end users

Tal-or · 2023-06-14T08:45:26Z

/cc @mrunalp