Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kola: Do failed units/SELinux checks both before *and* after tests #2067

Closed
wants to merge 1 commit into from

Conversation

cgwalters
Copy link
Member

When I went to add the SELinux denials check, I completely missed
that we only were checking for denials before the test runs.

We obviously want to check both before and after - the same
way we do with e.g. looking for errors on the console.

As part of this, I'd also added --no-default-checks but it
actually didn't work. Fix that so that it does work, and add
a test case with a failing systemd unit that verifies this,
so we can test the test system.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member Author

Ahh oops I didn't see #2064 before this...hmm, I think we can merge the two?

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh oops I didn't see #2064 before this...hmm, I think we can merge the two?

Sure, SGTM!

But I'm really hesitant to get this in again without at least a declarative way to disable the check via e.g. kola-denylist.yaml. (But ideally, we'd have fine-grained control on allowed SELinux violations like suggested in #1920 (comment), but that could come after).

// CheckMachineBasics validates that no systemd units have
// failed and no SELinux AVC denials were logged. It may be
// extended in the future - the idea is these are "baseline"
// errors that shouldn't be seen across any tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea is these are "baseline" errors that shouldn't be seen across any tests.

Hmm, I'm concerned this might not be a safe assumption. The tests themselves could be testing these errors. I don't really imagine us testing the SELinux policy really, but failed systemd units sounds plausible. Though I guess... the test then could fix the error condition and restart the unit so that it's not in a failed state?

Anyway, willing to try!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests themselves could be testing these errors.

Right, but anything intentionally causing a systemd unit failure could do e.g. systemctl reset-failed before it exits. Scrubbing SELinux avc denials though is probably a lot more annoying to do.

Perhaps simplest is a per-test flag to turn this off.

mantle/platform/platform.go Show resolved Hide resolved
Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to this. Will relook once the two similar PRs are brought together.

Copy link
Member

@travier travier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to merging PRs and what's in this one :)

First: Revert "Revert "kola: Add --no-default-checks""
 (this reverts commit: 57b3520 )

Now, when I went to add the SELinux denials check, I completely missed
that we only were checking for denials *before* the test runs.

We obviously want to check both before and after - the same
way we do with e.g. looking for errors on the console.

As part of this, I'd also added `--no-default-checks` but it
actually didn't work.  Fix that so that it does work, and add
a test case with a failing systemd unit that verifies this,
so we can test the test system.

In order to help ratchet things, also add a special `default-checks`
entry to the denylist.  This way if e.g. we are hitting SELinux AVC
denials just on one arch/test we can turn them off temporarily.
@@ -63,6 +63,7 @@ func init() {
bv(&kola.NoNet, "no-net", false, "Don't run tests that require an Internet connection")
ssv(&kola.Tags, "tag", []string{}, "Test tag to run. Can be specified multiple times.")
bv(&kola.Options.SSHOnTestFailure, "ssh-on-test-failure", false, "SSH into a machine when tests fail")
bv(&kola.Options.SuppressDefaultChecks, "no-default-checks", false, "Disable default checks for failed systemd units and SELinux AVC denials")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more on this; WDYT about splitting this into two separate switches? One for systemd units and one for SELinux denials.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate: I think eventually we should have precise control over what checks we want to allow to fail. Refining that can come later (because it requires more kola work), but at least right now we already have two broad categories we can split on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Agree that unit failures are usually much worse. But making it more granular feels tricky because we're likely to add more here. A good example is systemd generator failures and systemd ordering failures are also missed today.
And another good one is coredumps (which usually result in a unit failure, but not always).

Maybe what we really want is a syntax to ignore specific AVC denials or unit failures? Something like:

# This file documents currently known-to-fail kola tests. It is consumed by
# coreos-assembler to automatically skip some tests. For more information,
# see: https://github.com/coreos/coreos-assembler/pull/866.
tests:
  - pattern: fcos.internet
    tracker: https://github.com/coreos/coreos-assembler/pull/1478
  - pattern: podman.workflow
    tracker: https://github.com/coreos/coreos-assembler/pull/1478
avc:
  # 3 tuple is string match on (vector, scontext, tcontext)
  - read chronyd_t systemd_resolved_var_run_t
units:
  # Failing on vsphere, but not a blocker
  - NetworkManager.service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But making it more granular feels tricky because we're likely to add more here.

True. But while having a single switch is useful for dev, it's too blunt an instrument for CI/prod because it makes it too easy to miss out on regressions because of e.g. one AVC we're working around. (This was part of the motivation for #2064.)

Maybe what we really want is a syntax to ignore specific AVC denials or unit failures? Something like:

Yes, 100%. (This is what I mean with #1920 (comment); though yeah it'd be cleaner to first make the top-level object a dict rather than a list).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably before that, move this denylist directly into kola.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're willing to blanket-ignore denials/failures in specific tests, we have the existing test flags mechanism for this. It doesn't currently support more granular filters, but in principle it could.

cgwalters added a commit to cgwalters/fedora-coreos-config that referenced this pull request Mar 8, 2021
Pairs with coreos/coreos-assembler#2067
which will allow us to turn on SELinux AVC checks before/after
all tests.  But we can't do that until the chrony AVC is fixed.
@openshift-ci
Copy link

openshift-ci bot commented Jun 19, 2021

@cgwalters: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Feb 2, 2022

@cgwalters: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images d53a4f6 link true /test images
ci/prow/rhcos d53a4f6 link true /test rhcos

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jlebon
Copy link
Member

jlebon commented Apr 18, 2024

I think there's rough agreement that this should be more granular rather than a global flag.

One easy approach would just be special tags (e.g. allow-failed-units and allow-avc-denials) that tests could use to opt out of those checks.

That said, we actually haven't felt this pain too much in recent times AFAIK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants