Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-8113: daemon: Make switchKernel less stateful #3580

Merged
merged 3 commits into from
Mar 8, 2023

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Mar 3, 2023

daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.


@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2023
@cgwalters
Copy link
Member Author

Now that we've branched, we can benefit from the fact that we can land PRs like this in master with much lower risk/impact. Once (OK, openshift/release#36937 just landed) so let's give that a go:

/test e2e-gcp-ovn-rt-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2023

@cgwalters: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-externallb-techpreview
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

Now that we've branched, we can benefit from the fact that we can land PRs like this in master with much lower risk/impact. Once (OK, openshift/release#36937 just landed) so let's give that a go:

/test e2e-gcp-ovn-rt-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/test e2e-gcp-ovn-rt-upgrade


if canonicalizeKernelType(oldConfig.Spec.KernelType) == ctrlcommon.KernelTypeRealtime && canonicalizeKernelType(newConfig.Spec.KernelType) == ctrlcommon.KernelTypeDefault {
switchingToThroughput := oldKtype == ctrlcommon.KernelTypeRealtime && newKtype == ctrlcommon.KernelTypeDefault
Copy link
Member

@cheesesashimi cheesesashimi Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Wouldn't it be clearer to use switchingToDefault instead of switchingToThroughput?

I'm a bit confused on where throughput comes from.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on where throughput comes from.

Yeah sorry that's just me having scars from years of people saying e.g. "normal RHEL" with the implication that e.g. RHEL CoreOS is not-normal. (Or people say "normal Fedora" etc. versus "Silverblue"). Or less pejoratively they say "default RHEL"...which isn't bad but is also not super descriptive because, hey maybe one day what the default is changes 😉

From the kernel side you could certainly say kernel is the default. But it is really about latency (kernel-rt) versus throughput (kernel). And I personally find this is a better description.

(Also IMO the "realtime" kernel is a bit of a misnomer because it's really soft real-time which is actually quite different from hard real time, so I personally think calling it "latency optimized" is better)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to expand on this, personally I'd rename kernel -> kernel-throughput-optimized and kernel-rt -> kernel-latency-optimized, I'd also rename "rhel coreos" => "rhel (image mode)" and most cruically "rhel" => "rhel (package mode)" - but only where it matters; otherwise they're both just RHEL. Just like how both kernel-throughput-optimized and kernel-latency-optimized are both really just Linux (aka kernel) in different modes.

Or to say it another way, both are normal. We don't strictly think of one as "default" even. They both have qualifiers, but only where it matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to expand even more, of course today we say "OpenShift" and "Hypershift" - with the implication that the latter is the different/not-normal case. I've even seen people refer to current OpenShift as "normal" OpenShift! But in fact "hypershift" is (and should be!) well on its way to becoming the default, so it's also like OpenShift => OpenShift (standalone) and Hypershift => OpenShift (hosted control plane) etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all this aside, actually I was not consistent in trying to use "throughput" instead of "default" and the top patch stops using "throughput" anyways 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed clarification! Naming things is hard and even when we think it's easy, overloaded names, changing contexts, history, etc. all make things even more complicated. I'm OK with the name throughput now.

pkg/daemon/update.go Outdated Show resolved Hide resolved
@cheesesashimi
Copy link
Member

Overall this seems reasonable. I just have two minor concerns that might help clarify things that I've put inline. The first is where the word "throughput" came from. And the second, is what looks like an unfinished comment.

The only other (non-blocking) thought is my surprise with how many dependencies were bumped solely from bumping coreos/rpmostree-client-go.

@cgwalters
Copy link
Member Author

The only other (non-blocking) thought is my surprise with how many dependencies were bumped solely from bumping coreos/rpmostree-client-go.

Yeah, I think a lot of that is transitive deps from containers/image. But also, because we don't update vendored deps here regularly at all, every time we do there's usually a large set.

@cgwalters
Copy link
Member Author

/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-op

@cheesesashimi
Copy link
Member

Overall this looks fine. I'll approve once the test suites pass, solely because of the large number of dependency changes.

pkg/daemon/update.go Outdated Show resolved Hide resolved
@cgwalters
Copy link
Member Author

Hmm, the e2e-gcp-ovn-rt-upgrade job failed...but not for a reason I was expecting. First, one thing I notice in this job in that confusingly the -upgrade jobs actually just synthesize a "synthetic" upgrade from current CI without the PR to code with the PR. Consequently, we're not actually doing an OS update in this job, and that means we're not actually running this modified code because we aren't doing an OS update.

@cgwalters
Copy link
Member Author

Actually...I am confused by that failure since it seems to say that the machine-config operator was failing, but AFAICS it isn't? Though there are a spam of warnings in the operator logs.

@cgwalters
Copy link
Member Author

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2023

@cgwalters: The following commands are available to trigger required jobs:

  • /test 4.12-upgrade-from-stable-4.11-images
  • /test cluster-bootimages
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-upgrade
  • /test e2e-gcp-op
  • /test images
  • /test okd-scos-images
  • /test unit
  • /test verify

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade
  • /test bootstrap-unit
  • /test e2e-alibabacloud-ovn
  • /test e2e-aws-disruptive
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-fips-op
  • /test e2e-aws-ovn-workers-rhel8
  • /test e2e-aws-proxy
  • /test e2e-aws-serial
  • /test e2e-aws-single-node
  • /test e2e-aws-upgrade-single-node
  • /test e2e-aws-workers-rhel8
  • /test e2e-azure
  • /test e2e-azure-ovn-upgrade
  • /test e2e-azure-upgrade
  • /test e2e-gcp-op-single-node
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-rt
  • /test e2e-gcp-rt-op
  • /test e2e-gcp-single-node
  • /test e2e-gcp-upgrade
  • /test e2e-hypershift
  • /test e2e-metal-assisted
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-openstack
  • /test e2e-openstack-externallb-techpreview
  • /test e2e-openstack-parallel
  • /test e2e-ovirt
  • /test e2e-ovirt-upgrade
  • /test e2e-ovn-step-registry
  • /test e2e-vsphere
  • /test e2e-vsphere-upgrade
  • /test e2e-vsphere-upi
  • /test okd-e2e-aws
  • /test okd-e2e-gcp-op
  • /test okd-e2e-upgrade
  • /test okd-e2e-vsphere
  • /test okd-images
  • /test okd-scos-e2e-aws-ovn
  • /test okd-scos-e2e-gcp-op
  • /test okd-scos-e2e-gcp-ovn-upgrade
  • /test okd-scos-e2e-vsphere

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-machine-config-operator-master-e2e-alibabacloud-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-hypershift
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-okd-images
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-aws-ovn
  • pull-ci-openshift-machine-config-operator-master-okd-scos-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-machine-config-operator-master-okd-scos-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

xref #3485 (comment)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 3, 2023

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/93cd3720-ba16-11ed-9286-0a1a70c20a75-0

@cgwalters cgwalters force-pushed the kernel-updates-refactor branch 2 times, most recently from a902c97 to 6d929c3 Compare March 4, 2023 13:29
@cgwalters
Copy link
Member Author

cgwalters commented Mar 4, 2023

Man, I was so confused why the code wasn't working and yeah...I modified the legacy dead-code OS update path 😢 😢 Going to do a separate PR to excise that from existence 🪓 entirely. (Edit: done in #3583)

@cgwalters
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 4, 2023

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c96e59b0-ba91-11ed-83e3-f9126a759ef7-0

@cgwalters
Copy link
Member Author

cgwalters commented Mar 4, 2023

🎉 Got a green payload run on that previous commit. I had to push a fixup to handle the case of going rt -> throughput without an OS update. I think this only happens in the MCO's CI runs, switching rt -> throughput is an unusual thing to do in production.

So let's do one more payload run with tip
/payload-job periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade
and if both that and e2e-gcp-op are good, I think let's merge this.

@openshift-ci-robot
Copy link
Contributor

@cgwalters: This pull request references Jira Issue OCPBUGS-8113, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

In response to this:

daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

return nil
}

// TODO: Drop this code and use https://github.com/coreos/rpm-ostree/issues/2542 instead
defaultKernel := []string{"kernel", "kernel-core", "kernel-modules", "kernel-modules-extra"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed to ask this, since we are not yet adding kernel-modules-core in defaultKernel list packages. Fixing OCPBUGS-8113 will still need that, correct?

Copy link
Member Author

@cgwalters cgwalters Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; but we can't land this in 4.14/master (still using rhel8.6) if we specify that package. The change to do so for rhel9 is part of that PR, see 4e9fca2

But at a technical level I think we can say that this is still "the" fix for OCPBUGS-8113 since it has 98% of the required code changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this will help QE when they perform testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinnykumari
Copy link
Contributor

/lgtm
/test e2e-gcp-op

Putting hold for qe approval under pre-merge testing
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2023
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2023
@cgwalters
Copy link
Member Author

Out of curiosity what would QE be testing that isn't covered by the payload test run and e2e-gcp-op?

De-duplicate calls to `canonicalizeKernelType` to make the
logic easier to read.  Also add a few comments.
In prep for usage in MCD.
This is prep for fixing RHEL9 upgrades while maintaining `kernel-rt`.

Previously the `switchKernel` logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous `rpm-ostree rebase` command already preserved the previous
kernel.  In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs *twice*.

To say this another way: when doing a RHEL9 update, it's actually
the first `rpm-ostree rebase` command which fails before we
even get to `switchKernel`.

And the reason is due to the introduction of a new `-core` subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

- Before we do the `rebase` operation to the new OS, we detect
  any previous overrides of any packages starting with `kernel-rt`
  and we remove them.  Notably this avoids hardcoding any specific
  kernel subpackages; we just remove *everything* starting with
  `kernel-rt` which should be more robust to subpackage changes
  in the future.
- Consequently the `rebase` operation will hence start out by deploying the
  stock image i.e. with throughput kernel (though note we *are*
  carefully preserving other local overrides)
- The `switchKernel` function now longer needs to take the *previous*
  machineconfig state into account (except for logging).
  Instead, we just detect if the target is RT, and if so we then we
  apply the latest packages.

This significantly simplifies the logic in `switchKernel`, and will
help fix RHEL9 upgrades.
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 8, 2023
@cgwalters
Copy link
Member Author

Rebased 🏄 since another PR bumped the Go deps in the meantime

@sdodson
Copy link
Member

sdodson commented Mar 8, 2023

/hold cancel
This has been tested in CI and needs to be backported to release-4.13 with some urgency. There will be a final QE round of testing once all of the pieces have landed which I believe when paired with CI testing is sufficient.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2023
@sdodson
Copy link
Member

sdodson commented Mar 8, 2023

/lgtm
Code has just been rebased, no material changes since last lgtm.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 8, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 8, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, sdodson, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,sinnykumari]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sdodson sdodson merged commit 4fb7117 into openshift:master Mar 8, 2023
@sdodson
Copy link
Member

sdodson commented Mar 8, 2023

/cherry-pick release-4.13

@openshift-ci-robot
Copy link
Contributor

@cgwalters: Jira Issue OCPBUGS-8113: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-8113 has been moved to the MODIFIED state.

In response to this:

daemon: Clean up switchKernel a bit

De-duplicate calls to canonicalizeKernelType to make the
logic easier to read. Also add a few comments.


vendor: Bump coreos/rpm-ostree-client-go

In prep for usage in MCD.


daemon: Make switchKernel less stateful

This is prep for fixing RHEL9 upgrades while maintaining kernel-rt.

Previously the switchKernel logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) was not quite right because
the previous rpm-ostree rebase command already preserved the previous
kernel. In fact it was pretty expensive to do things this way
because we'd e.g. regenerate the initramfs twice.

To say this another way: when doing a RHEL9 update, it's actually
the first rpm-ostree rebase command which fails before we
even get to switchKernel.

And the reason is due to the introduction of a new -core subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

  • Before we do the rebase operation to the new OS, we detect
    any previous overrides of any packages starting with kernel-rt
    and we remove them. Notably this avoids hardcoding any specific
    kernel subpackages; we just remove everything starting with
    kernel-rt which should be more robust to subpackage changes
    in the future.
  • Consequently the rebase operation will hence start out by deploying the
    stock image i.e. with throughput kernel (though note we are
    carefully preserving other local overrides)
  • The switchKernel function now longer needs to take the previous
    machineconfig state into account (except for logging).
    Instead, we just detect if the target is RT, and if so we then we
    apply the latest packages.

This significantly simplifies the logic in switchKernel, and will
help fix RHEL9 upgrades.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@sdodson: #3580 failed to apply on top of branch "release-4.13":

Applying: daemon: Clean up `switchKernel` a bit
Applying: vendor: Bump coreos/rpm-ostree-client-go
.git/rebase-apply/patch:15768: trailing whitespace.
 
.git/rebase-apply/patch:15845: trailing whitespace.
 
.git/rebase-apply/patch:15872: trailing whitespace.
    
.git/rebase-apply/patch:15896: trailing whitespace.
    
.git/rebase-apply/patch:15971: trailing whitespace.
                quotedString.WriteString(fmt.Sprintf("\\u%04x", c))         
error: patch failed: vendor/github.com/klauspost/compress/README.md:9
error: vendor/github.com/klauspost/compress/README.md: patch does not apply
error: Did you hand edit your patch?
It does not apply to blobs recorded in its index.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Using index info to reconstruct a base tree...
M	go.mod
M	go.sum
M	vendor/google.golang.org/grpc/balancer/balancer.go
A	vendor/google.golang.org/grpc/balancer/conn_state_evaluator.go
M	vendor/google.golang.org/grpc/clientconn.go
M	vendor/google.golang.org/grpc/dialoptions.go
M	vendor/google.golang.org/grpc/internal/envconfig/xds.go
M	vendor/google.golang.org/grpc/internal/grpcutil/method.go
M	vendor/google.golang.org/grpc/internal/transport/http2_client.go
M	vendor/google.golang.org/grpc/internal/transport/http2_server.go
M	vendor/google.golang.org/grpc/internal/transport/http_util.go
M	vendor/google.golang.org/grpc/server.go
M	vendor/google.golang.org/grpc/stream.go
M	vendor/google.golang.org/grpc/version.go
M	vendor/google.golang.org/grpc/vet.sh
M	vendor/modules.txt
Patch failed at 0002 vendor: Bump coreos/rpm-ostree-client-go
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

Set up a manual cherry pick in #3595

@sinnykumari
Copy link
Contributor

Out of curiosity what would QE be testing that isn't covered by the payload test run and e2e-gcp-op?

I understand your point. As per agreement with our QE team, we are following pre-merge testing process since OCP 4.13 to keep things stable for sprintly releases that includes stories with qe_required label and all OCPBUGS related PRs . QE are free to take the call when to test but we manually need qe_approved ack (hopefully someday prow will have automated workflow for this like we have for backport bugs).
And of-coruse, these can be overriden by staff-engineers with follow-up risks 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants