process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

wongma7 · 2017-09-08T14:07:10Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16232/test_pull_request_origin_extended_conformance_gce/7498/

Reason:ContainerCannotRun,Message:invalid header field value "oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod86daabc9_9486_11e7_a2b8_42010a800005.slice/docker-ca877fc648dcf466185be3e66b95e774f4589f866620920615c40c12bfe86f2a.scope/cgroup.procs: no such file or directory\""\n"

smarterclayton · 2017-10-21T04:27:32Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/test_branch_origin_extended_conformance_gce/1757/

Since this could be a runtime regression I'd like this triaged and investigated. I assume it could be just a poorly handled error but that's no excuse.

jim-minter · 2017-10-23T21:30:21Z

@Kargakis @jwhonce @bparees @mfojtik @ewolinetz @simo5 @pweil- @tnozicka @sjenning I have a suspicion that #16994, #16881, #16985, #16023 and this issue are all duplicates of each-other. If you think so, I suggest closing the others and renaming this issue process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory.

sjenning · 2017-11-29T17:23:46Z

@vikaschoudhary16 can you take a look at this one please. I've dupped many issues to this one so it seems to happen pretty frequently.

vikaschoudhary16 · 2017-11-29T18:35:58Z

@sjenning sure!!

runcom · 2017-11-30T12:29:57Z

we're experiencing the very same issue in CRI-O, it's either a runc or systemd bug 😕

runcom · 2017-11-30T22:23:12Z

/cc @mrunalp

vikaschoudhary16 · 2017-12-06T08:39:00Z

Here is the experimental PR, #17600
I found a lock missing in runc. With this PR, i am verifying if this missing lock is the reason for this flake. I will keep this PR open for a very long time and keep rerunning CI tests to see if this same flake re-appears on the PR.
Meanwhile if the flake is observed often on other PRs and not on this experimental PR ever, over a longer period of time, chances will be high that this lock has really helped.

Corresponding PR on the runc: opencontainers/runc#1668

vikaschoudhary16 · 2017-12-11T17:22:37Z

was this flake noticed recently? I am not able to notice it in past couple of days.

vikaschoudhary16 · 2017-12-14T07:28:05Z

This issue is definitely not an Origin issue. Here its reported on OSP:
https://bugzilla.redhat.com/show_bug.cgi?id=1523043

vikaschoudhary16 · 2018-01-07T21:08:25Z

How I debugged a race in `systemd`

Context/Background

There are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join pids cgroup controller.

process_linux.go:258: applying cgroup configuration for process caused "open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory

Following are some of these issues:

Join pids cgroup with systemd causes "no such device" cri-o/cri-o#1196
process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246
https://bugzilla.redhat.com/show_bug.cgi?id=1523043
Random "Cannot start container" Errors on 1.9.0-rc5 CentOS7 moby/moby#17653

Where is the race in code:

T0: runc triggers a systemd unit creation asynchronously from here
T1: runc then moves ahead and starts creating cgroup paths(.scope directories), here. Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
T3: systemd execution thread which was invoked at time T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the .scope directory which is created at time T1 by runc from here in the code. Here is the related call trace from systemd:

gdb) bt
#0  cg_trim (controller=<optimized out>, path=<optimized out>, delete_root=<optimized out>) at src/shared/cgroup-util.c:575
#1  0x0000564b1c6b85b0 in cg_create_everywhere (path=0x564b1caefc90 "/libcontainer_54684_systemd_test_default.slice", mask=_CGROUP_CONTROLLER_MASK_ALL, supported=63)
    at src/shared/cgroup-util.c:1637
#2  unit_create_cgroups (mask=_CGROUP_CONTROLLER_MASK_ALL, u=0x564b1c9e4e70) at src/core/cgroup.c:679
#3  unit_realize_cgroup_now.26298 (u=u@entry=0x564b1c9e4e70, state=MANAGER_RUNNING) at src/core/cgroup.c:751
#4  0x0000564b1c6b8a89 in unit_realize_cgroup (u=<optimized out>) at src/core/cgroup.c:853
#5  0x0000564b1c688e9e in slice_start.65343 (u=<optimized out>) at src/core/slice.c:187
#6  0x0000564b1c69dd9c in unit_start (u=<optimized out>) at src/core/unit.c:1486
#7  job_run_and_invalidate (j=<optimized out>) at src/core/job.c:532
#8  manager_dispatch_run_queue.4007 (source=<optimized out>, userdata=0x564b1c9d3e10) at src/core/manager.c:1539
#9  0x0000564b1c6cb032 in source_dispatch (s=s@entry=0x564b1c9d27e0) at src/libsystemd/sd-event/sd-event.c:2155
#10 0x0000564b1c6cb2aa in sd_event_dispatch (e=0x564b1c9d47e0) at src/libsystemd/sd-event/sd-event.c:2472
#11 0x0000564b1c6704d7 in manager_loop (m=0x564b1c9d3e10) at src/core/manager.c:2212
#12 main (argc=4, argv=0x7fff036998b8) at src/core/main.c:1801

T4: runc, unaware that systemd has already deleted the .scope directory created by it, tried to write container PID to the cgroup.procs, which results in the container creation failure with the error message shown in the beginning.

Debugging Steps:

To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built runc from src code and used runc --systemd-cgroup create <container-id> in a go script, test-runc.go to launch ~200 containers in parallel using go-routines.

func main() {
	ch := make(chan int, 201)
	for i := 100; i < 300; i++ {
		go run(i, ch)
	}
	for i := 100; i < 301; i++ {
		fmt.Printf("waiting on channel i ,%v\n", i)
		j := <-ch
		fmt.Printf("received on channel  %v\n", j)
	}
	close(ch)
	fmt.Printf("All created\n")

}
func run(i int, ch chan int) {
	var stderr bytes.Buffer
	id := strconv.Itoa(i)
	cmd := exec.Command("runc", "--systemd-cgroup", "create", id)
	cmd.Dir = "/mycontainer"
	cmd.Stderr = &stderr
	//cmd.Start()
	ch <- i
	err := cmd.Run()
	if err != nil {
		fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
		fmt.Printf("wrror channel i ,%v\n", i)
		//fmt.Println("err: " + fmt.Sprint(err))
		os.Exit(-1)
	}
	fmt.Printf("writing to channel i ,%v\n", i)
}

Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with too many open files error. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?
Did some analysis on the situation.

...
func join(c *configs.Cgroup, subsystem string, pid int) (string, error) {
...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
354                 return "", err
355         }
356         return path, nil
357 }
...

There could be only one of the two possible reasons of runc not able to find cgroup.procs file at L#353:

First, L#350 returned only after creating .scope directory and the children files are not created yet by the kernel but runc execution has reached #L353
OR cgroup.procs got created by kernel in atomic fashion but some other process(most probably systemd) deleted the file(and .scope directory) between the execution of L#350 and L#353

I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means systemd must be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.
Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in runc to add sleep as following and reran test-runc.go:

...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if strings.Contains(path, "pids") {
354                 time.Sleep(1 * time.Second)
355         }
356         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
...

And BANG!!! issue was reproducible now on each run of test-runc.go i.e race would hit for sure each time when script will try to create 200 containers.

strace on systemd:

Then i ran strace on PID1 while executing test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing a rmdir on the scope unit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.
But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.

Building and installing instrumented systemd:

First i tried systemd-nspawn. Did not solve the purpose. Inside launched container /sys/fs/cgroup was read-only and it was not possible to create any cgroups. May be i missed some configuration.
Tried make; make install on systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.
Then i tried kvmtool to launch a vm with filesystem prepared seperately with systemd installed.

$ yum -y --releasever=25 --nogpg --installroot=$HOME/distro-trees/fedora-24/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
$ cd systemd; DESTDIR=$HOME/distro-trees/fedora-24 make install
$ cd ../kvmtool
$ ./lkvm run --disk /root/distro-trees/fedora-24 --kernel ../bzImage --network bridge --mem 8024

This approach seemed to be working. I set-up runc, systemd and test-runc.go in this vm. But still good times were a bit far. test-runc.go was able to create as many as 500 containers each time. And issue was not getting reproduced. When verified through strace and systemd logs, found that systemd, somehow, was not invoking the rmdir() syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.
While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.

Then finally i used rhpkg and rpmbuild to create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.

Why systemd invokes rmdir() on pids cgroup?

There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds CGROUP_PIDS to CGroupControllerMask (along with some other changes) but does not update _CGROUP_CONTROLLER_MASK_ALL from 31 to 63. This mistake makes the check at L#1628 to fail for pids and therefore for pids, cg_trim() is invoked always.

Here is the relevant code:

1605         "cpu\0"
1606         "cpuacct\0"
1607         "blkio\0"
1608         "memory\0"
1609         "devices\0"
1610         "pids\0";
1611  
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char *path) {
1613         CGroupControllerMask bit = 1;
1614         const char *n;
1615         int r;
1616                 
1617         /* This one will create a cgroup in our private tree, but also
1618          * duplicate it in the trees specified in mask, and remove it
1619          * in all others */
1620  
1621         /* First create the cgroup in our own hierarchy. */
1622         r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623         if (r < 0)
1624                 return r;
1625                         
1626         /* Then, do the same in the other hierarchies */
1627         NULSTR_FOREACH(n, mask_names) {
1628                 if (mask & bit)
1629                         cg_create(n, path);
1630                 else if (supported & bit)
1631                         cg_trim(n, path, true);
1632 
1633                 bit <<= 1;
1634         }
1635 
1636         return 0;
1637 }

src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
 33         CGROUP_CPU = 1,
 34         CGROUP_CPUACCT = 2,
 35         CGROUP_BLKIO = 4,
 36         CGROUP_MEMORY = 8,
 37         CGROUP_DEVICE = 16,
 38         CGROUP_PIDS = 32,
 39         _CGROUP_CONTROLLER_MASK_ALL = 31
 40 } CGroupControllerMask;

At L39, there should be 63

Can this be handled on runc side?

Yes, here is the fix. Wait on channel till StartTransientUnit is Done

jim-minter · 2018-01-08T19:21:58Z

@vikaschoudhary16 nice write-up, and I'm glad I was on the right lines :) Have you notified the systemd folks about the needed _CGROUP_CONTROLLER_MASK_ALL = 63 change?

derekwaynecarr · 2018-01-08T20:32:24Z

@vikaschoudhary16 -- great job isolating this.

vikaschoudhary16 · 2018-01-09T02:49:36Z

@jim-minter not yet, today will talk to @lnykrn(Lukáš Nykrýn)

vikaschoudhary16 · 2018-01-09T06:07:40Z

Reported : lnykryn/systemd-rhel#180
Fix got merged both at systemd and runc:
lnykryn/systemd-rhel#181
opencontainers/runc#1683

@vikaschoudhary16

Automatic merge from submit-queue (batch tested with PRs 18040, 18097, 18098, 18106, 18087). UPSTREAM: opencontainers/runc: 1683: Fix race against systemd Thanks to @vikaschoudhary16 for doing all the legwork upstream on this! Fixes #16246 opencontainers/runc#1683 google/cadvisor#1861 kubernetes/kubernetes#58117 @derekwaynecarr

wongma7 mentioned this issue Sep 8, 2017

UPSTREAM: 45345: Support "fstype" parameter in dynamically provisioned PVs #16232

Merged

pweil- added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1 component/kubernetes labels Sep 11, 2017

pweil- assigned sjenning Sep 11, 2017

ashcrow mentioned this issue Oct 2, 2017

openshift_checks: Fix incorrect list cast openshift/openshift-ansible#5619

Merged

gabemontero mentioned this issue Oct 2, 2017

more build metric massage; add unit test; adjust integration test #16641

Merged

smarterclayton added this to the 3.7.0 milestone Oct 21, 2017

sjenning assigned vikaschoudhary16 Nov 29, 2017

sjenning removed their assignment Nov 29, 2017

runcom mentioned this issue Nov 30, 2017

Join pids cgroup with systemd causes "no such device" cri-o/cri-o#1196

Closed

vikaschoudhary16 mentioned this issue Dec 5, 2017

[DO-NOT-MERGE] Only for testing purpose #17600

Closed

vikaschoudhary16 mentioned this issue Dec 11, 2017

[DO-NOT-MERGE] Test PR, to verify flake #16246 #17710

Closed

vikaschoudhary16 mentioned this issue Jan 7, 2018

Fix race against systemd opencontainers/runc#1683

Merged

vikaschoudhary16 mentioned this issue Jan 9, 2018

systemd always invokes rmdir() on pids cgroup lnykryn/systemd-rhel#180

Closed

Taeung mentioned this issue Jan 10, 2018

Add missing lock in Apply() opencontainers/runc#1668

Closed

sjenning mentioned this issue Jan 12, 2018

UPSTREAM: opencontainers/runc: 1683: Fix race against systemd #18106

Merged

openshift-merge-robot closed this as completed in #18106 Jan 15, 2018

eloycoto mentioned this issue Mar 12, 2018

Nightly: Test flake cannot exec command in container. cilium/cilium#3106

Closed

derekwaynecarr mentioned this issue Apr 13, 2018

Update libcontainer to include PRs with fixes to systemd cgroup driver kubernetes/kubernetes#61926

Merged

smileusd mentioned this issue Oct 18, 2022

systemd cgroup lost on node reboot for systemd cgroup hybird node opencontainers/runc#3637

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

wongma7 commented Sep 8, 2017

smarterclayton commented Oct 21, 2017

jim-minter commented Oct 23, 2017

sjenning commented Nov 29, 2017

vikaschoudhary16 commented Nov 29, 2017

runcom commented Nov 30, 2017

runcom commented Nov 30, 2017

vikaschoudhary16 commented Dec 6, 2017

vikaschoudhary16 commented Dec 11, 2017

vikaschoudhary16 commented Dec 14, 2017

vikaschoudhary16 commented Jan 7, 2018 •

edited

Loading

jim-minter commented Jan 8, 2018

derekwaynecarr commented Jan 8, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018 •

edited

Loading

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

Comments

wongma7 commented Sep 8, 2017

smarterclayton commented Oct 21, 2017

jim-minter commented Oct 23, 2017

sjenning commented Nov 29, 2017

vikaschoudhary16 commented Nov 29, 2017

runcom commented Nov 30, 2017

runcom commented Nov 30, 2017

vikaschoudhary16 commented Dec 6, 2017

vikaschoudhary16 commented Dec 11, 2017

vikaschoudhary16 commented Dec 14, 2017

vikaschoudhary16 commented Jan 7, 2018 • edited Loading

How I debugged a race in systemd

Context/Background

Where is the race in code:

Debugging Steps:

strace on systemd:

Building and installing instrumented systemd:

Why systemd invokes rmdir() on pids cgroup?

Can this be handled on runc side?

jim-minter commented Jan 8, 2018

derekwaynecarr commented Jan 8, 2018

vikaschoudhary16 commented Jan 9, 2018

vikaschoudhary16 commented Jan 9, 2018 • edited Loading

vikaschoudhary16 commented Jan 7, 2018 •

edited

Loading

How I debugged a race in `systemd`

vikaschoudhary16 commented Jan 9, 2018 •

edited

Loading