Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSAP-1236: Containerize Tuned #1597

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yanirq
Copy link

@yanirq yanirq commented Mar 14, 2024

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 14, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 14, 2024

@yanirq: This pull request references PSAP-1236 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.16.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@yanirq
Copy link
Author

yanirq commented Mar 14, 2024

@yanirq yanirq force-pushed the containerized_tuned branch 2 times, most recently from a761560 to 7cbf1f9 Compare March 15, 2024 09:48

* “Accumulating” NTO images during upgrades.
* Even with “podmanized” TuneD, we’ll likely not be able to resolve issues such as [these](https://github.com/openshift/machine-config-operator/pull/2944#issuecomment-1029739561) and centralized the tuning into NTO – chicken-and-egg with default openshift TuneD profiles. Rendering them via MCO/installer, doing some built-in NTO bootstrap or using host-native TuneD would help.
* We need to ensure TuneD is not blocking Kubelet system service while starting before it. 15s of TuneD starting would be an issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite the contrary! Tuned must finish tuning before Kubelet starts or there is a chance the workloads will start to a partially configured system state.

The tuning time itself is something that should be kept short of course.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to elaborate:
if any workload start before tuning applied its an issue - so we might want to contain kubelet (blocked by tuned)
this could be risky though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to bugs, this is a choice between having a fully broken cluster in case of TuneD (or the supervising process) issues or a working cluster with incomplete tuning. The ideal scenario, of course, is to iron out all the bugs, reduce the TuneD startup time and block the kubelet. I'm not convinced we're in a position to block kubelet by running TuneD to completion in this way at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we block kubelet but then have a (long) timeout that allows kubelet to come up even if tuned fails to complete the tuning? Then in the (very rare) case where tuning hangs for some reason we still get a partially functioning cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we discussed this yesterday with Martin. This is a possibility. Apart a few design choices, there is a question what is (as you say) "(long) timeout". When TuneD was run in a pod, the timeout of 60s did not seem to be sufficient in constrained SNO environments. This might be slightly different with TuneD starting prior to kubelet, but we simply don't have any data yet. My bet is 60s is enough though as you'll not have all the pods starting at the same time. But this needs testing.


* Tuning integrity is kept and not corrupted when node restarts occur
* Tuning integrity is kept and not corrupted when pod restarts occur
* Tuning will be done as “one shot” to the majority of use cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this EP include a change to stop tuned profiles from being re-applied when the profile changes? If so, then I think that would enforce the "one shot" for all use cases - right? If you are not including that change, then what is enforcing the "one shot" aspect of this?

Copy link
Author

@yanirq yanirq Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are looking into that - e.g. solve at the bug level with OCPBUGS-26401 and OCPBUGS-28647 .

If the implementation intersects with this enhancement then we might indeed be enforcing "one shot" here but there are some exclusions such as possible TuneD runtime reloads) and still using the rtentsk plug-in.

@jmencak @ffromani any additional thoughts here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggested solution by @jmencak to be done before this EP:
Remove the Tuned "rendered" resource and augment the TunedProfiles with a list of Tuned profiles (Spec.Profile) Profile's Spec.Config.TunedProfile was calculated out of. This is to unlink profile changes from tuned changes when they are not necessarily needed in order to reduce tuned reloads.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key component to nail down.

  1. Are we only allowing one-shot period ? Any new profile changes require a reboot ala a mc ?
  2. Are we allowing one-shot for some tunings/plugins but runtime changes for others ? If so how is that achieved with the service approach ?

* Allow static key IPIs to be disabled (caused when sockets are opened with timestamping enabled). Currently these IPIs cause latency hits for low latency applications. This applies to RHEL and also to OpenShift deployments. Currently, to prevent these IPIs, tuned has an [rtentsk plugin](https://github.com/redhat-performance/tuned/blob/master/tuned/plugins/plugin_rtentsk.py) that opens a socket with timestamping enabled and holds this socket open.
This will have to be implemented on RHEL + RHCOS - [RHEL-23928](https://issues.redhat.com/browse/RHEL-23928)

### Workflow Description
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it belongs here, but it would be nice to have a walkthrough of the steps for the initial deployment of an SNO with the NTO to show how everything gets set up and when/how the server is restarted and the new MC is applied to create the new systemd service.

Copy link

@browsell browsell Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we should also document any workflow differences on a standard cluster

enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
* Disabling that functionality when PerformanceProfile is in place should be considered as a viable option.

* Extra reboot during can be caused from the machine config for installation/upgrade.
* This is potentially solvable by providing a MC manifest for the TuneD systemd service at bootstrap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the tuned profile wouldn't be available during bootstrap - so how does the initial application of the tuned profile work in that case (when it has been created by NTO)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some preliminary attempt to do this at bootstrap but I wasn't part of that effort. Yanir/Martin can probably comment. Not only the TuneD profiles, but also the NTO/TuneD image needs to be identified. At bootstrap using the installer this seems easier than using MCO, but using MCO for this covers more installation types. This are needs more investigation/work.

enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved

- The [pod label based matching](https://docs.openshift.com/container-platform/4.15/scalability_and_performance/using-node-tuning-operator.html#accessing-an-example-node-tuning-operator-specification_node-tuning-operator) needs to be officially deprecated although the official warning exists.

## Upgrade / Downgrade Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section needs to be filled out to describe how upgrades will work:

  • Upgrade NTO from version with tuned daemonset to version with containerized tuned.
  • Upgrade NTO with both versions supporting containerized tuned.
  • IBU (Image Based Upgrades) for SNO where new release has containerized tuned.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The IBU upgrade should be fairly straightforward


* Order of pods start up ([OCPBUGS-26401](https://issues.redhat.com/browse/OCPBUGS-26401)) : In case of node reboot, all pods restart again in random order. Since there's no control on pod restart order, it is possible that TuneD pod will start after the workloads. This means the workload pods start with partial tuning which can affect performance or even cause the workload to crash. One of the reasons being partial tuning and/or dynamic tuning done by crio hooks is undone.

* Tuned restarts / reapplication ([OCPBUGS-26400](https://issues.redhat.com/browse/OCPBUGS-26400) , [OCPBUGS-27778](https://issues.redhat.com/browse/OCPBUGS-27778) , [OCPBUGS-22474](https://issues.redhat.com/browse/OCPBUGS-22474)) : If TuneD restarts at runtime for any reason - tuning is broken.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this might be partially mitigated to an extent by starting TuneD and running it prior kubelet, there must be a control mechanism to apply new configurations -- hence possible TuneD runtime reloads.

Copy link
Author

@yanirq yanirq Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the last line then. I think the bugs sum nicely restarts issues.

control mechanism to apply new configurations

with the current daemon set ? or should we add another detailed mechanism in this EP ?

enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved

* “Accumulating” NTO images during upgrades.
* Even with “podmanized” TuneD, we’ll likely not be able to resolve issues such as [these](https://github.com/openshift/machine-config-operator/pull/2944#issuecomment-1029739561) and centralized the tuning into NTO – chicken-and-egg with default openshift TuneD profiles. Rendering them via MCO/installer, doing some built-in NTO bootstrap or using host-native TuneD would help.
* We need to ensure TuneD is not blocking Kubelet system service while starting before it. 15s of TuneD starting would be an issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to bugs, this is a choice between having a fully broken cluster in case of TuneD (or the supervising process) issues or a working cluster with incomplete tuning. The ideal scenario, of course, is to iron out all the bugs, reduce the TuneD startup time and block the kubelet. I'm not convinced we're in a position to block kubelet by running TuneD to completion in this way at this point.

enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
* Disabling that functionality when PerformanceProfile is in place should be considered as a viable option.

* Extra reboot during can be caused from the machine config for installation/upgrade.
* This is potentially solvable by providing a MC manifest for the TuneD systemd service at bootstrap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some preliminary attempt to do this at bootstrap but I wasn't part of that effort. Yanir/Martin can probably comment. Not only the TuneD profiles, but also the NTO/TuneD image needs to be identified. At bootstrap using the installer this seems easier than using MCO, but using MCO for this covers more installation types. This are needs more investigation/work.

enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
enhancements/node-tuning/containerize-tuned.md Outdated Show resolved Hide resolved
Signed-off-by: Yanir Quinn <yquinn@redhat.com>

### Implementation Details/Notes/Constraints

* “Accumulating” NTO images during upgrades: by using podman to run TuneD from the NTO image that comes from the release payload,this old payload image will not be removed from the cluster during upgrades as it is used by the systemd unit running podman.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure it does not get GC'd


* “Accumulating” NTO images during upgrades: by using podman to run TuneD from the NTO image that comes from the release payload,this old payload image will not be removed from the cluster during upgrades as it is used by the systemd unit running podman.
* Even with “podmanized” TuneD, we’ll likely not be able to resolve issues such as [these](https://github.com/openshift/machine-config-operator/pull/2944#issuecomment-1029739561) and centralized the tuning into NTO – chicken-and-egg with default openshift TuneD profiles. Rendering them via MCO/installer, doing some built-in NTO bootstrap or using host-native TuneD would help.
* Even with this approach, TuneD still must finish tuning before Kubelet starts or there is a chance the workloads will start to a partially configured system state.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a requirement that this service finishes before launching any workloads.


- The [pod label based matching](https://docs.openshift.com/container-platform/4.15/scalability_and_performance/using-node-tuning-operator.html#accessing-an-example-node-tuning-operator-specification_node-tuning-operator) needs to be officially deprecated although the official warning exists.

## Upgrade / Downgrade Strategy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The IBU upgrade should be fairly straightforward

* [rtentsk plugin](https://github.com/redhat-performance/tuned/blob/master/tuned/plugins/plugin_rtentsk.py): keeps a socket open to avoid static key IPIs. If TuneD didn't do this we'd need another process to do it. In the past, there was such a binary, called [rt-entsk and it was part of the rt-setup package](https://www.redhat.com/sysadmin/real-time-kernel). It was removed in favor of the rtentsk_plugin.
* [cpu plugin](https://github.com/redhat-performance/tuned/blob/bcfbd5de1163f95deb45b5a8319aff99020ccef9/tuned/plugins/plugin_cpu.py#L497): keeps the /dev/cpu_dma_latency file open to control the maximum c-state. It may be possible to eliminate this by disabling specific c-states for each CPU (i.e. writing to /sys/devices/system/cpu/cpuX/cpuidle/stateX/disable) for each unwanted c-state.

* Even if we eliminated the above and allowed TuneD to just run, apply settings and then exit, we'd still have an issue if the TuneD pod restarted. In that case we would need a way to determine that the configuration had already been applied and avoid doing it again.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be run as a job if truly one shot

Use a taint to prevent application pods from running before the TuneD pod has applied the configuration (and then NTO would remove the taint and unblock the application pods). There are some concerns with this approach:

* Platform pods would require this taint to be tolerated, possibly requiring changes to a lot of pod specs.
* How would this work when the node was rebooted and the pods are already scheduled?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SNO scenario pods would already have been scheduled


* Tuning integrity is kept and not corrupted when node restarts occur
* Tuning integrity is kept and not corrupted when pod restarts occur
* Tuning will be done as “one shot” to the majority of use cases
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a key component to nail down.

  1. Are we only allowing one-shot period ? Any new profile changes require a reboot ala a mc ?
  2. Are we allowing one-shot for some tunings/plugins but runtime changes for others ? If so how is that achieved with the service approach ?

Options such as reduce the TuneD startup time and/or block the kubelet with timeout on TuneD failures can be taken into consideration.
* Note: To remove the need of having a daemon for keeping open files for some tuned plugins (rtentsk,cpu) [RHEL-23928](https://issues.redhat.com/browse/RHEL-23928) and [RHEL-23926](https://issues.redhat.com/browse/RHEL-23926) will have to be implemented first.

### Risks and Mitigations
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we introduce an explicit dep between MCO and NTO. Can this be an issue? Is it worth highlightining among the risks and/or the drawbacks?


#### Pod ordering problems in general

Alternative ways that pod startup could be ordered - potentially with some upstream k8s changes:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more than ordering, it seems the problem is that there is no concept of dependencies across pods. May be worth to be spelled out explicitely.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 8, 2024
@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this May 16, 2024
Copy link
Contributor

openshift-ci bot commented May 16, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@yanirq
Copy link
Author

yanirq commented May 16, 2024

/reopen
/remove-lifecycle rotten

@openshift-ci openshift-ci bot reopened this May 16, 2024
Copy link
Contributor

openshift-ci bot commented May 16, 2024

@yanirq: Reopened this PR.

In response to this:

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 16, 2024

@yanirq: This pull request references PSAP-1236 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.16.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2024
Copy link
Contributor

openshift-ci bot commented May 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sinnykumari for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented May 16, 2024

@yanirq: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2024
@jmencak
Copy link
Contributor

jmencak commented Jun 13, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024
@jmencak
Copy link
Contributor

jmencak commented Jul 12, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2024
@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2024
@jmencak
Copy link
Contributor

jmencak commented Aug 9, 2024

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants