Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OKD-223: Load custom SELinux rules in SCOS and workaround afterburn failures #1556

Merged
merged 2 commits into from
Aug 7, 2024

Conversation

aleskandro
Copy link
Member

@aleskandro aleskandro commented Jul 18, 2024

Recent changes in the SELinux policy to make more service confined has
broken a lot of our code. The SELinux team is working through relaxing
the policy, but in the meantime, let's revert back the affected types to
permissive mode:

  1. afterburn fail when trying to write to /run, /run/metadata and
    /home/$user/.ssh.
    See: https://issues.redhat.com/browse/RHEL-49735
  2. coreos-installer installation fails due to various denials.
    See: https://issues.redhat.com/browse/RHEL-38614
  3. network functionality that rely on systemd-network-generator is
    broken due to the latter being unable to create temporary files.
    See: https://issues.redhat.com/browse/RHEL-47033

@openshift-ci openshift-ci bot requested review from marmijo and prestist July 18, 2024 21:41
@aleskandro aleskandro force-pushed the afterburn-fix-scos branch 2 times, most recently from d18f4f0 to 11f7e8c Compare July 18, 2024 21:43
@aleskandro
Copy link
Member Author

/cc @Prashanth684 @travier

@@ -0,0 +1,7 @@
(typeattributeset cil_gen_require var_run_t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will these rules also persist after the bootimage pivots to the scos image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also say that we could delete this file once a proper fix lands in CS9? I didn't reproduce in a CS9 fresh env with no other modifications as the ones here, but it seems a bug to file there for afterburn or the selinux-policy package you mentioned in the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we have scos publicly available, yes we would need this. But by that time I hope that we can get the actual issue with selinux-policy sorted out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travier see the two PRs above for #1556 (comment)

@aleskandro aleskandro changed the title Load custom SELinux rules in SCOS and workaround afterburn failures OKD-223: Load custom SELinux rules in SCOS and workaround afterburn failures Jul 18, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 18, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 18, 2024

@aleskandro: This pull request references OKD-223 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

The afterburn systemd units fail because the afterburn binary's SELinux domain is restricted from changing the content of files in /run, /run/metadata, and /home/$user/.ssh. See #1555.

This commit implements a systemd unit to apply custom SELinux modules in SCOS shipped as CILs in /usr/lib/okd/selinux and adds an afterburn-custom.cil SELinux module to allow the afterburn services to succeed and the nodes to properly join a cluster.

Refers #1555

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@travier
Copy link
Member

travier commented Jul 19, 2024

Hum, why is this not failing in FCOS?

@aleskandro
Copy link
Member Author

Hum, why is this not failing in FCOS?

I think it is because of previous patches to SELinux-policy that are already available in FCOS.

@aleskandro
Copy link
Member Author

aleskandro commented Jul 19, 2024

Oh here it can be:

zpytela/selinux-policy@18e59aa removed the permissive domain, but non-rawhide fedora has it yet: https://github.com/zpytela/selinux-policy/blob/f40/policy/modules/contrib/afterburn.te#L15

@travier
Copy link
Member

travier commented Jul 19, 2024

How will all of this work in the initrd? We really need to fix this in Rawhide first.

I'm not convinced we should carry SELinux policy modules here. Those domains were in permissive mode before, let's make them permissive again instead.

@aleskandro
Copy link
Member Author

aleskandro commented Jul 19, 2024

How will all of this work in the initrd? We really need to fix this in Rawhide first.

I'm not convinced we should carry SELinux policy modules here. Those domains were in permissive mode before, let's make them permissive again instead.

I've changed the module to set the afterburn_t domain in permissive mode. I could do the same for the coreos installer one so that we have a workaround for the other issue too, until they are are solved upstream and we can revert the two policies' commits. I would still continue to propose having that unit that loops through a folder in /usr for applying custom SELinux rules that can be needed from time to time.

We have something similar in okd-machine-os for the FCOS variant.

@aleskandro aleskandro force-pushed the afterburn-fix-scos branch 2 times, most recently from fd52060 to eb2bd38 Compare July 21, 2024 00:34
@Prashanth684
Copy link
Contributor

/lgtm

@Prashanth684
Copy link
Contributor

How will all of this work in the initrd? We really need to fix this in Rawhide first.
I'm not convinced we should carry SELinux policy modules here. Those domains were in permissive mode before, let's make them permissive again instead.

I've changed the module to set the afterburn_t domain in permissive mode. I could do the same for the coreos installer one so that we have a workaround for the other issue too, until they are are solved upstream and we can revert the two policies' commits. I would still continue to propose having that unit that loops through a folder in /usr for applying custom SELinux rules that can be needed from time to time.

We have something similar in okd-machine-os for the FCOS variant.

@travier could we get an approval for this?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2024
@jlebon
Copy link
Member

jlebon commented Jul 22, 2024

Note that in FCOS rawhide, we're currently pinning selinux-policy due to other issues: https://github.com/coreos/fedora-coreos-config/blob/441ccdd5cc3f2e3ac0cb5e8e0bcf71b301a51c7e/manifest-lock.overrides.yaml#L17-L26

But even then, the policy between c9s and rawhide are different and not mean to be kept in sync AIUI. See the conversation in https://github.com/openshift/os/issues/1514.\

I sympathize with this. I'm not opposed in theory, but it's also a much more invasive workaround that might have subtle repercussions. Note also that not many people on the team are familiar with maintaining SELinux policies.

What we really need of course is better CI integration with c9s. This should be a big part of the bootable containers effort.

Short-term, I guess we can just host the last working version RPM somewhere else and revert #1552 ? (Or, if you have the right channels, try to get a new tag created in the CentOS Stream koji instance with tag2distrepo set up so we can just tag older packages we need in there.)

@Prashanth684
Copy link
Contributor

I sympathize with this. I'm not opposed in theory, but it's also a much more invasive workaround that might have subtle repercussions. Note also that not many people on the team are familiar with maintaining SELinux policies.

What repercussions are we worried about?

Short-term, I guess we can just host the last working version RPM somewhere else and revert #1552 ? (Or, if you have the right channels, try to get a new tag created in the CentOS Stream koji instance with tag2distrepo set up so we can just tag older packages we need in there.)

Isn't this equivalent to having a workaround here?

@travier
Copy link
Member

travier commented Jul 23, 2024

As I asked before:

How will all of this work in the initrd?

I don't think this will work as this unit runs after the initrd, only in the real root.

@travier
Copy link
Member

travier commented Jul 23, 2024

/hold
As I don't think this is the right fix.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2024
@aleskandro
Copy link
Member Author

As I asked before:

How will all of this work in the initrd?

I don't think this will work as this unit runs after the initrd, only in the real root.

@travier this branch is already being used in the releases that turned green from Sunday thanks to this fix (for AWS).

  • Do we really need this after rebasing from the FCOS boot image (and initrd) to SCOS?
  • Which cases in particular?
  • How can installing the previous RPM in the base OS tree resolve this problem for initrd too?
  • Is there a temporary solution that we can achieve to ensure the releases get green again before the actual and permanent fix downstreams to CS9?

Note that in FCOS rawhide, we're currently pinning selinux-policy due to other issues: https://github.com/coreos/fedora-coreos-config/blob/441ccdd5cc3f2e3ac0cb5e8e0bcf71b301a51c7e/manifest-lock.overrides.yaml#L17-L26

But even then, the policy between c9s and rawhide are different and not mean to be kept in sync AIUI. See the conversation in #1514

I sympathize with this. I'm not opposed in theory, but it's also a much more invasive workaround that might have subtle repercussions. Note also that not many people on the team are familiar with maintaining SELinux policies.

What we really need of course is better CI integration with c9s. This should be a big part of the bootable containers effort.

Short-term, I guess we can just host the last working version RPM somewhere else and revert #1552 ? (Or, if you have the right channels, try to get a new tag created in the CentOS Stream koji instance with tag2distrepo set up so we can just tag older packages we need in there.)

@jlebon the current version of this PR 'reverts' the SELinux policies to the previous ones for the afterburn_t domain (permissive). If it's invasive because we allow a lot more than what's needed, I can go back to the fine-grained policies we proposed earlier.

I also understand that maintaining SELinux policies is not the most exciting thing here. Still, it allows the SCOS release to unblock and OKD/SCOS to have dedicated SELinux policies when needed, especially since, based on what you said, the SELinux policies in Fedora C9S might not always be in sync (and well-tested as they are for RHCOS).

To have working and quickly released OKD/SCOS, we will occasionally need more workarounds dedicated to C9S than RHCOS.

This PR adds that systemd unit only to C9S builds, with the hope that whoever maintains and reviews a change in that folder will always bind a Jira or GH issue to revert the changes eventually when they are fixed and downstreamed to be consumed as a permanent solution.

We had similar cases for OKD/FCOS: https://github.com/openshift/okd-machine-os/blob/master/overlay.d/99okd/usr/lib/okd/selinux-fixes.cil.

I would avoid further delays for the release of OKD/SCOS, 4.16 and beyond when similar cases arise.

@Prashanth684
Copy link
Contributor

@travier @jlebon any suggestions on how this can be unblocked in the short term? I think @aleskandro has given sufficient reasoning above for the approach.

@ausil
Copy link

ausil commented Jul 26, 2024

From my point of view this approach is better than setting up a repo and locking to an old version, we will be missing all of the bug fixes in the newer selinux-policy builds. This is not an issue in the FCOS portion because the bug is fixed upstream. We are missing all the other fixes in selinux-policy if we lock to an old version. The selinux team needs to prioritise backporting the upstream fix. As I understand the issue, we only hit it on reboot when moving to CentOS Stream, which is where the fix is needed.

@ausil
Copy link

ausil commented Jul 31, 2024

@dustymabe @cverna This is blocking OKD releases from going stable cc @preethit

@jlebon
Copy link
Member

jlebon commented Aug 6, 2024

How about having just the overlay.d/50scos/usr/lib/okd/selinux/afterburn-permissive.cil file, and a postprocess step to semodule -i it? I think it'd be a lot cleaner to have this be part of the baked policy instead of forcing a local policy rebuild.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 6, 2024
@aleskandro
Copy link
Member Author

How about having just the overlay.d/50scos/usr/lib/okd/selinux/afterburn-permissive.cil file, and a postprocess step to semodule -i it? I think it'd be a lot cleaner to have this be part of the baked policy instead of forcing a local policy rebuild.

Hi @jlebon, that is better. Done

Thanks for following up.

@aleskandro
Copy link
Member Author

/retest-required

@jlebon
Copy link
Member

jlebon commented Aug 7, 2024

Thanks! I've reworked this as follow:

  1. There's actually no need to ship this in /usr. Doing semodule -i will automatically compile and copy the results into /etc/selinux/targeted/active/ along with the other packages, so it'd just be duplicating it. Instead, we now just create a temporary file to be able to install it, and then delete the file.
  2. I've added additional workarounds for other outstanding SELinux issues.
  3. I've moved the postprocess script to manifest-c9s.yaml since this affects both the c9s and okd-c9s variants.

Recent changes in the SELinux policy to make more service confined has
broken a lot of our code. The SELinux team is working through relaxing
the policy, but in the meantime, let's revert back the affected types to
permissive mode:

1. afterburn fail when trying to write to `/run`, `/run/metadata` and
   `/home/$user/.ssh`.
   See: https://issues.redhat.com/browse/RHEL-49735
2. coreos-installer installation fails due to various denials.
   See: https://issues.redhat.com/browse/RHEL-38614
3. network functionality that rely on systemd-network-generator is
   broken due to the latter being unable to create temporary files.
   See: https://issues.redhat.com/browse/RHEL-47033

Co-authored-by: Jonathan Lebon <jonathan@jlebon.com>
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 7, 2024

@aleskandro: This pull request references OKD-223 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Recent changes in the SELinux policy to make more service confined has
broken a lot of our code. The SELinux team is working through relaxing
the policy, but in the meantime, let's revert back the affected types to
permissive mode:

  1. afterburn fail when trying to write to /run, /run/metadata and
    /home/$user/.ssh.
    See: https://issues.redhat.com/browse/RHEL-49735
  2. coreos-installer installation fails due to various denials.
    See: https://issues.redhat.com/browse/RHEL-38614
  3. network functionality that rely on systemd-network-generator is
    broken due to the latter being unable to create temporary files.
    See: https://issues.redhat.com/browse/RHEL-47033

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jlebon
Copy link
Member

jlebon commented Aug 7, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 7, 2024
We're currently hitting OpenSSL issues there:
openshift#1540
@jlebon
Copy link
Member

jlebon commented Aug 7, 2024

I've also inlined the first commit of #1544 here. This PR should now be all we need to unblock building c9s-based bootimages in the CoreOS pipeline.

@jlebon
Copy link
Member

jlebon commented Aug 7, 2024

CI failing on #1551. I've verified this PR locally.

/override ci/prow/rhcos-9-build-test-metal
/override ci/prow/rhcos-9-build-test-qemu
/override ci/prow/scos-9-build-test-metal
/override ci/prow/scos-9-build-test-qemu

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 7, 2024
Copy link
Contributor

openshift-ci bot commented Aug 7, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aleskandro, jlebon, Prashanth684

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2024
Copy link
Contributor

openshift-ci bot commented Aug 7, 2024

@jlebon: Overrode contexts on behalf of jlebon: ci/prow/rhcos-9-build-test-metal, ci/prow/rhcos-9-build-test-qemu, ci/prow/scos-9-build-test-metal, ci/prow/scos-9-build-test-qemu

In response to this:

CI failing on #1551. I've verified this PR locally.

/override ci/prow/rhcos-9-build-test-metal
/override ci/prow/rhcos-9-build-test-qemu
/override ci/prow/scos-9-build-test-metal
/override ci/prow/scos-9-build-test-qemu

/approve
/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Aug 7, 2024

@aleskandro: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@aleskandro
Copy link
Member Author

Looks good to me!

Thanks @jlebon

@openshift-merge-bot openshift-merge-bot bot merged commit 1eaeec6 into openshift:master Aug 7, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants