Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory #16246

Closed
wongma7 opened this issue Sep 8, 2017 · 14 comments
Assignees
Labels
component/kubernetes kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Milestone

Comments

@wongma7
Copy link
Contributor

wongma7 commented Sep 8, 2017

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16232/test_pull_request_origin_extended_conformance_gce/7498/

Reason:ContainerCannotRun,Message:invalid header field value "oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod86daabc9_9486_11e7_a2b8_42010a800005.slice/docker-ca877fc648dcf466185be3e66b95e774f4589f866620920615c40c12bfe86f2a.scope/cgroup.procs: no such file or directory\""\n"

@smarterclayton
Copy link
Contributor

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/test_branch_origin_extended_conformance_gce/1757/

Since this could be a runtime regression I'd like this triaged and investigated. I assume it could be just a poorly handled error but that's no excuse.

@smarterclayton smarterclayton added this to the 3.7.0 milestone Oct 21, 2017
@jim-minter
Copy link
Contributor

@Kargakis @jwhonce @bparees @mfojtik @ewolinetz @simo5 @pweil- @tnozicka @sjenning I have a suspicion that #16994, #16881, #16985, #16023 and this issue are all duplicates of each-other. If you think so, I suggest closing the others and renaming this issue process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory.

@sjenning sjenning changed the title flake: [k8s.io] Projected should be consumable from pods in volume as non-root [Conformance] [Volume] process_linux.go:258: applying cgroup configuration for process caused \"open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory Nov 29, 2017
@sjenning
Copy link
Contributor

@vikaschoudhary16 can you take a look at this one please. I've dupped many issues to this one so it seems to happen pretty frequently.

@sjenning sjenning removed their assignment Nov 29, 2017
@vikaschoudhary16
Copy link
Contributor

@sjenning sure!!

@runcom
Copy link
Member

runcom commented Nov 30, 2017

we're experiencing the very same issue in CRI-O, it's either a runc or systemd bug 😕

@runcom
Copy link
Member

runcom commented Nov 30, 2017

/cc @mrunalp

@vikaschoudhary16
Copy link
Contributor

Here is the experimental PR, #17600
I found a lock missing in runc. With this PR, i am verifying if this missing lock is the reason for this flake. I will keep this PR open for a very long time and keep rerunning CI tests to see if this same flake re-appears on the PR.
Meanwhile if the flake is observed often on other PRs and not on this experimental PR ever, over a longer period of time, chances will be high that this lock has really helped.

Corresponding PR on the runc: opencontainers/runc#1668

@vikaschoudhary16
Copy link
Contributor

was this flake noticed recently? I am not able to notice it in past couple of days.

@vikaschoudhary16
Copy link
Contributor

This issue is definitely not an Origin issue. Here its reported on OSP:
https://bugzilla.redhat.com/show_bug.cgi?id=1523043

@vikaschoudhary16
Copy link
Contributor

vikaschoudhary16 commented Jan 7, 2018

How I debugged a race in systemd

Context/Background

There are several open bugs, on several repos, which are basically about a, rarely occuring, container creation failure. Containers fails to join pids cgroup controller.

process_linux.go:258: applying cgroup configuration for process caused "open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/.../cgroup.procs: no such file or directory

Following are some of these issues:

Where is the race in code:

  • T0: runc triggers a systemd unit creation asynchronously from here
  • T1: runc then moves ahead and starts creating cgroup paths(.scope directories), here. Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
  • T3: systemd execution thread which was invoked at time T0, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the .scope directory which is created at time T1 by runc from here in the code. Here is the related call trace from systemd:
gdb) bt
#0  cg_trim (controller=<optimized out>, path=<optimized out>, delete_root=<optimized out>) at src/shared/cgroup-util.c:575
#1  0x0000564b1c6b85b0 in cg_create_everywhere (path=0x564b1caefc90 "/libcontainer_54684_systemd_test_default.slice", mask=_CGROUP_CONTROLLER_MASK_ALL, supported=63)
    at src/shared/cgroup-util.c:1637
#2  unit_create_cgroups (mask=_CGROUP_CONTROLLER_MASK_ALL, u=0x564b1c9e4e70) at src/core/cgroup.c:679
#3  unit_realize_cgroup_now.26298 (u=u@entry=0x564b1c9e4e70, state=MANAGER_RUNNING) at src/core/cgroup.c:751
#4  0x0000564b1c6b8a89 in unit_realize_cgroup (u=<optimized out>) at src/core/cgroup.c:853
#5  0x0000564b1c688e9e in slice_start.65343 (u=<optimized out>) at src/core/slice.c:187
#6  0x0000564b1c69dd9c in unit_start (u=<optimized out>) at src/core/unit.c:1486
#7  job_run_and_invalidate (j=<optimized out>) at src/core/job.c:532
#8  manager_dispatch_run_queue.4007 (source=<optimized out>, userdata=0x564b1c9d3e10) at src/core/manager.c:1539
#9  0x0000564b1c6cb032 in source_dispatch (s=s@entry=0x564b1c9d27e0) at src/libsystemd/sd-event/sd-event.c:2155
#10 0x0000564b1c6cb2aa in sd_event_dispatch (e=0x564b1c9d47e0) at src/libsystemd/sd-event/sd-event.c:2472
#11 0x0000564b1c6704d7 in manager_loop (m=0x564b1c9d3e10) at src/core/manager.c:2212
#12 main (argc=4, argv=0x7fff036998b8) at src/core/main.c:1801
  • T4: runc, unaware that systemd has already deleted the .scope directory created by it, tried to write container PID to the cgroup.procs, which results in the container creation failure with the error message shown in the beginning.

Debugging Steps:

To fix any issue or to verify the effect of a probable fix, one should be able to reproduce the issue.
This issue was not easily reproducible. I built runc from src code and used runc --systemd-cgroup create <container-id> in a go script, test-runc.go to launch ~200 containers in parallel using go-routines.

func main() {
	ch := make(chan int, 201)
	for i := 100; i < 300; i++ {
		go run(i, ch)
	}
	for i := 100; i < 301; i++ {
		fmt.Printf("waiting on channel i ,%v\n", i)
		j := <-ch
		fmt.Printf("received on channel  %v\n", j)
	}
	close(ch)
	fmt.Printf("All created\n")

}
func run(i int, ch chan int) {
	var stderr bytes.Buffer
	id := strconv.Itoa(i)
	cmd := exec.Command("runc", "--systemd-cgroup", "create", id)
	cmd.Dir = "/mycontainer"
	cmd.Stderr = &stderr
	//cmd.Start()
	ch <- i
	err := cmd.Run()
	if err != nil {
		fmt.Println(fmt.Sprint(err) + ": " + stderr.String())
		fmt.Printf("wrror channel i ,%v\n", i)
		//fmt.Println("err: " + fmt.Sprint(err))
		os.Exit(-1)
	}
	fmt.Printf("writing to channel i ,%v\n", i)
}

Dint work out. Every time, script was creating 200 containers successfully. I ensured that my systemd, runc and kernel versions are same as in the reported issues. Beyond 200, container creation started failing with too many open files error. After increasing system's default limits, script started creating 500 containers successfully, but original issue dint get reproduced. What next?
Did some analysis on the situation.

...
func join(c *configs.Cgroup, subsystem string, pid int) (string, error) {
...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
354                 return "", err
355         }
356         return path, nil
357 }
...

There could be only one of the two possible reasons of runc not able to find cgroup.procs file at L#353:

  • First, L#350 returned only after creating .scope directory and the children files are not created yet by the kernel but runc execution has reached #L353
  • OR cgroup.procs got created by kernel in atomic fashion but some other process(most probably systemd) deleted the file(and .scope directory) between the execution of L#350 and L#353

I strongly felt second one to be the case, which proved to be true after some kernel code browsing in cgroup.c.
This means systemd must be deleting the file after the execution of L#350 and before the execution of L#353. Since these are almost consecutive lines, systemd is getting a very small time window to delete the file and thus making hard to reproduce issue.
Lets increase this time window, to delete the file, for systemd by adding a sleep of 1 sec between L#350 and L#353. Made the changes in runc to add sleep as following and reran test-runc.go:

...
350         if err := os.MkdirAll(path, 0755); err != nil {
351                 return "", err
352         }
353         if strings.Contains(path, "pids") {
354                 time.Sleep(1 * time.Second)
355         }
356         if err := cgroups.WriteCgroupProc(path, pid); err != nil {
...

And BANG!!! issue was reproducible now on each run of test-runc.go i.e race would hit for sure each time when script will try to create 200 containers.

strace on systemd:

Then i ran strace on PID1 while executing test-runc.go. On Comparing sys calls made by systemd for a successful container create and failure case, it got further verified that systemd is actually doing a rmdir on the scope unit. To be specific, there was failed rmdir() on pids cgroup scope in each successful case while success rmdir() on pids cgroup in the race hit container case.
But how to know that from where and why in the code this rmdir() sys call on pids cgroup is invoked?
Now to debug further and understand the cause, a systemd dev setup was needed, with an instrumented systemd, built from source, running.

Building and installing instrumented systemd:

  • First i tried systemd-nspawn. Did not solve the purpose. Inside launched container /sys/fs/cgroup was read-only and it was not possible to create any cgroups. May be i missed some configuration.
  • Tried make; make install on systemd source inside a fedora vm. VM was getting hanged at reboot. Could not find the reason. Dropped this approach as well.
  • Then i tried kvmtool to launch a vm with filesystem prepared seperately with systemd installed.
$ yum -y --releasever=25 --nogpg --installroot=$HOME/distro-trees/fedora-24/ --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal
$ cd systemd; DESTDIR=$HOME/distro-trees/fedora-24 make install
$ cd ../kvmtool
$ ./lkvm run --disk /root/distro-trees/fedora-24 --kernel ../bzImage --network bridge --mem 8024

This approach seemed to be working. I set-up runc, systemd and test-runc.go in this vm. But still good times were a bit far. test-runc.go was able to create as many as 500 containers each time. And issue was not getting reproduced. When verified through strace and systemd logs, found that systemd, somehow, was not invoking the rmdir() syscall on pids cgroup at all. Since from BM, I saw connection between the race issue and rmdir() on pids cgroup, it was understood that why issue was not getting reproduced in kvmtool vm.
While I was still trying to understand the cause for missing rmdir and reason for diffrence in behavior from the BM, one of the team mates suggested to use rpm based systemd installation on BM.

  • Then finally i used rhpkg and rpmbuild to create and install patched systemd. It worked like a charm. I finally had the setup that i wanted for tapping into systemd.

Why systemd invokes rmdir() on pids cgroup?

There is a patch in systemd-219 which is generated from the sources fetched using rhpkg, 0475-core-add-support-for-the-pids-cgroup-controller.patch. This patch adds CGROUP_PIDS to CGroupControllerMask (along with some other changes) but does not update _CGROUP_CONTROLLER_MASK_ALL from 31 to 63. This mistake makes the check at L#1628 to fail for pids and therefore for pids, cg_trim() is invoked always.

Here is the relevant code:

1605         "cpu\0"
1606         "cpuacct\0"
1607         "blkio\0"
1608         "memory\0"
1609         "devices\0"
1610         "pids\0";
1611  
1612 int cg_create_everywhere(CGroupControllerMask supported, CGroupControllerMask mask, const char *path) {
1613         CGroupControllerMask bit = 1;
1614         const char *n;
1615         int r;
1616                 
1617         /* This one will create a cgroup in our private tree, but also
1618          * duplicate it in the trees specified in mask, and remove it
1619          * in all others */
1620  
1621         /* First create the cgroup in our own hierarchy. */
1622         r = cg_create(SYSTEMD_CGROUP_CONTROLLER, path);
1623         if (r < 0)
1624                 return r;
1625                         
1626         /* Then, do the same in the other hierarchies */
1627         NULSTR_FOREACH(n, mask_names) {
1628                 if (mask & bit)
1629                         cg_create(n, path);
1630                 else if (supported & bit)
1631                         cg_trim(n, path, true);
1632 
1633                 bit <<= 1;
1634         }
1635 
1636         return 0;
1637 }

src/shared/cgroup-util.h
32 typedef enum CGroupControllerMask {
 33         CGROUP_CPU = 1,
 34         CGROUP_CPUACCT = 2,
 35         CGROUP_BLKIO = 4,
 36         CGROUP_MEMORY = 8,
 37         CGROUP_DEVICE = 16,
 38         CGROUP_PIDS = 32,
 39         _CGROUP_CONTROLLER_MASK_ALL = 31
 40 } CGroupControllerMask;

At L39, there should be 63

Can this be handled on runc side?

Yes, here is the fix. Wait on channel till StartTransientUnit is Done

@jim-minter
Copy link
Contributor

@vikaschoudhary16 nice write-up, and I'm glad I was on the right lines :) Have you notified the systemd folks about the needed _CGROUP_CONTROLLER_MASK_ALL = 63 change?

@derekwaynecarr
Copy link
Member

@vikaschoudhary16 -- great job isolating this.

@vikaschoudhary16
Copy link
Contributor

@jim-minter not yet, today will talk to @lnykrn(Lukáš Nykrýn)

@vikaschoudhary16
Copy link
Contributor

vikaschoudhary16 commented Jan 9, 2018

Reported : lnykryn/systemd-rhel#180
Fix got merged both at systemd and runc:
lnykryn/systemd-rhel#181
opencontainers/runc#1683

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/kubernetes kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

8 participants