Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Cleanup: Remove cgroups v1 & runc support #23020

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

cevich
Copy link
Member

@cevich cevich commented Jun 17, 2024

With (esp. Debian) CI VM images built by containers/automation_images#338 CI no-longer tests with runc nor cgroups v1. Add logic to fail under these conditions. Prune back high-level YAML/script envars and logic formerly required to support these things.

Does this PR introduce a user-facing change?

None

@cevich cevich requested a review from edsantiago June 17, 2024 19:35
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none labels Jun 17, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2024
@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

@edsantiago I sent you a slack message, posting it here assuming it was lost:

I noticed a TON of system test calls to skip_if_cgroupsv1() and some is_cgroupsv2(). I'm guessing there are a lot of similar e2e conditionals. I'm unsure if I should bother updating/removing all of them for the new "cgroups v2 only" world-order. I'm assuming no, but do you have a different opinion?

@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2024
@Luap99
Copy link
Member

Luap99 commented Jun 18, 2024

I noticed a TON of system test calls to skip_if_cgroupsv1() and some is_cgroupsv2(). I'm guessing there are a lot of similar e2e conditionals. I'm unsure if I should bother updating/removing all of them for the new "cgroups v2 only" world-order. I'm assuming no, but do you have a different opinion?

Does downstream QE run these tests with cgroupsv1 on RHEL 9? If so I think it is best to keep them for a while at least. If not I like to remove them, although I wouldn't block this PR on it. That could happen in a follow up, I think there a bigger CI priorities right now compared to removing a bunch of conditionals.

@edsantiago
Copy link
Member

Saw your message, am still catching up from PTO. I'd say removing the cgroups conditionals is a rainy-day exercise for the future. Although it seems trivial, it won't be (I expect linter issues, easy but tedious). Oh, and Paul's point is a good one: I had assumed that podman v5 is cgroupsv2-only, but I never know what RHEL is going to do.

For the time being, I think it's best to not tackle cgroups conditionals.

@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

Thanks for the feedback guys, I too had not considered the RHEL case 😊

Do we even ever have "rainy days" 🤣

Comment on lines 92 to 96
# See https://github.com/containers/podman/pull/21431
if [[ -n "$PODMAN_IGNORE_CGROUPSV1_WARNING" ]]; then
skip "impossible to test due to pitfalls in our SSH implementation"
fi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edsantiago PTAL this sys podman debian-13 root host sqlite failure is curious. ISTM like PODMAN_IGNORE_CGROUPSV1_WARNING being forced 'true' was hiding this failure somehow.

The test error is coming from the end of run_podman(), but the condition catching it is really complex. Since the rootless version passed, should I add a special-case in setup_environment() on rootful Debian, where setup_rootless() enables lingering for $ROOTLESS_USER (like the warning message suggests)?

@edsantiago
Copy link
Member

[+0815s] not ok 277 [120] podman image scp transfer in 2599ms
...
<+052ms> # # podman image scp foo.bar/nonesuch/c_9yzja1xujd:mytag some9825dude@localhost::
...
         # time="2024-06-18T15:55:35Z" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"

This smells like an ssh problem. Maybe a missing loginctl, or some sort of pam setup not being done in debian for rootless?

@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

Looks like we cross-posted.

Maybe a missing loginctl, or some sort of pam setup not being done in debian for rootless?

I'm pretty sure there is no loginctl used for the rootless setup. Since that was my thought too, let's give it a try...

@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

Force-push: Added test fixme! commit to see if enabling rootless lingering fixes podman image scp problem on rootful debian.

@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

Same/similar failure despite lingering being enabled for the rootless user. Thinking more, I wonder if this is happening because CGROUP_MANAGER=systemd is set in /etc/ci_environment and is getting passed through into the rootless user environment somehow 🤔

@cevich
Copy link
Member Author

cevich commented Jun 18, 2024

I wonder if

Answer: Doesn't appear to be. There's no modification of the rootless user's .bashrc or .bash_profile or anything else that would load /etc/ci_environment for the user, unless maybe podman itself is passing the current CGROUP_MANAGER value through?

I think the next step is to just go hands-on with hack/get_ci_vm.sh where the entire setup can be simulated for experimentation.

@Luap99
Copy link
Member

Luap99 commented Jun 19, 2024

What system version is used? On fedora we saw a regression were lingering was broken on 256-rc1 to 3 using 256-rc4 or final release fixed it again AFAIK

@edsantiago
Copy link
Member

I think that may be the problem: systemd on debian is 256~rc3-5. The important thing there is the rc3, which is bad; I had misread the 5 as being >4 and therefore good. That was the wrong part to look at.

Since rootless tests work despite the bad systemd, I would suggest just leaving this ssh test disabled for now. Unless someone feels like building new CI VMs.

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

I would suggest just leaving this ssh test disabled

Maybe add a timebomb() onto it?

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

Confirmed, looks like Lokesh's recent builds have 256~rc3-5 😞

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

Force-push: Added skip for scp test on debian

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

This is a new one for me:

<+016ms> # # podman pull quay.io/libpod/testimage:20240123
<+0120s> # Trying to pull quay.io/libpod/testimage:20240123...
         # timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
<+005ms> # [ rc=124 ]
         # *** TIMED OUT ***
         # # [teardown]

Assuming it's a flake and re-running.

@cevich cevich marked this pull request as ready for review June 20, 2024 20:16
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 20, 2024
@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

@edsantiago want me to wait for #23058 to go in, then re-test this w/o the debian/systemd scp test skip?

@edsantiago
Copy link
Member

@cevich CI is blowing up hard right now (see #23059); I don't know if it's a github problem, or cirrus, or something to do with the new cirrus.yml skips. Let's get that resolved before pushing anything.

(But yes, once things clear up, I think it'd be good to rebase with the debian skip removed)

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

CI is blowing up hard right now

Ya I saw that 23059. IMO (I didn't look closely) ISTM could easily be a networking/quay timeout of some form. I think it's probably just a coincidence with the new skips.

@cevich
Copy link
Member Author

cevich commented Jun 20, 2024

force-push: Rebased on top of #23059 w/ updated CI VM images (#23058).

@edsantiago
Copy link
Member

I don't understand why you included #23059, but otherwise LGTM. Fingers crossed for debian CI

@edsantiago
Copy link
Member

Sigh

@cevich
Copy link
Member Author

cevich commented Jun 21, 2024

Oh my bad, for some reason I thought the timeout fixed the scp problem. Must have been brain-tired 😊

With (esp. Debian) CI VM images built by
https://github.com/containers/automation_images/ pull/338 CI no-longer
tests with runc nor cgroups v1.  Add logic to fail under these
conditions.  Prune back high-level YAML/script envars and logic formerly
required to support these things.

Signed-off-by: Chris Evich <cevich@redhat.com>
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan
Copy link
Member

rhatdan commented Jun 21, 2024

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 21, 2024
Copy link
Contributor

openshift-ci bot commented Jun 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cevich, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 21, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 794c139 into containers:main Jun 21, 2024
90 checks passed
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 20, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants