Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

Closed
bcressey opened this issue Apr 27, 2023 · 14 comments
Closed

Comments

@bcressey
Copy link

bcressey commented Apr 27, 2023

Description

The most recent Bottlerocket release included an update to runc 1.1.6. Shortly after the release, we received reports of a regression where nodes would fall over after kubelet, systemd, and dbus-broker consumed excessive CPU and memory resources.

In bottlerocket-os/bottlerocket#3057 I narrowed this down via git bisect to e4ce94e which was meant to fix this issue, but instead now causes it to happen consistently.

I've confirmed that reverting that specific patch fixes the regression.

Steps to reproduce the issue

On an EKS 1.26 cluster with a single worker node, apply a consistent load via this spec:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; sleep $(( ( RANDOM % 10 )  + 1 ));
          restartPolicy: OnFailure
          hostNetwork: true
      parallelism: 100

(repro credit to @yeazelm)

After a short time, the "Path does not exist" and "Failed to delete cgroup paths" errors appear and continue even after the spec is deleted and the load is removed.

Describe the results you received and expected

systemd, kubelet, and dbus-broker all showed high CPU usage.

journalctl -f and busctl monitor showed these messages repeatedly:

Apr 27 00:20:09 ip-10-0-83-192.us-west-2.compute.internal kubelet[1206]: I0427 00:20:09.181437    1206 kubelet_getters.go:306] "Path does not exist" path="/var/lib/kubelet/pods/b08478e4-6c1b-461e-9fb5-e6c6411cf3ef/volumes"

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5417700  Timestamp="Thu 2023-04-27 00:21:08.887527 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=UnitNew
  UniqueName=:1.0
  MESSAGE "so" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5417701  Timestamp="Thu 2023-04-27 00:21:08.887570 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=JobNew
  UniqueName=:1.0
  MESSAGE "uos" {
          UINT32 454294;
          OBJECT_PATH "/org/freedesktop/systemd1/job/454294";
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
  };

‣ Type=method_call  Endian=l  Flags=0  Version=1 Cookie=449180  Timestamp="Thu 2023-04-27 00:21:10.622122 UTC"
  Sender=:1.11  Destination=org.freedesktop.systemd1  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=StopUnit
  UniqueName=:1.11
  MESSAGE "ss" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          STRING "replace";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5418558  Timestamp="Thu 2023-04-27 00:21:09.106241 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=UnitRemoved
  UniqueName=:1.0
  MESSAGE "so" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
  };

What version of runc are you using?

# runc -v
runc version 1.1.6+bottlerocket
commit: 0f48801a0e21e3f0bc4e74643ead2a502df4818d
spec: 1.0.2-dev
go: go1.19.6
libseccomp: 2.5.4

Host OS information

Bottlerocket

Host kernel information

Bottlerocket releases cover a variety of kernels, so to break it down a bit:

  • Kubernetes 1.23, Linux 5.10, cgroup v1 - not affected
  • Kubernetes 1.24, Linux 5.15, cgroup v1 - affected
  • Kubernetes 1.25, Linux 5.15, cgroup v1 - affected
  • Kubernetes 1.26, Linux 5.15, cgroup v2 - not affected
@AkihiroSuda
Copy link
Member

Kubernetes 1.23, Linux 5.10, cgroup v1 - not affected
Kubernetes 1.24, Linux 5.15, cgroup v1 - affected

Could this be rather a regression on the kernel side?

@bcressey
Copy link
Author

Could this be rather a regression on the kernel side?

Conceivably, although:

  • The runc 1.1.5 to 1.1.6 change was the only runtime (vs. first boot time) change in the affected Bottlerocket release
  • no kernel changes were included
  • the "misc" controller is not present in the 5.10 kernel, so only images using the 5.15 kernel would be affected

I am happy to help provide additional logs or dig deeper, since I can reproduce this on-demand easily.

What's puzzling to me is why this apparently fixed the Kubernetes issue for some users, but triggers it reliably here where we never saw the problem before.

It's possible it comes down to kernel config differences or systemd versions or something like that.

@kolyshkin
Copy link
Contributor

I suspect that runc 1.1.6 binary creates misc cgroup, and then kubelet uses runc's libcontainer (of an older version) to remove it. That older libcontainer version doesn't know about misc (and systemd doesn't know about it either), so it's not removed.

Thus, bumping the runc/libcontainer dependency to 1.1.6 should fix this.

@bcressey
Copy link
Author

@dims pointed me at #3028 which was helpful background for me.

Let me try to step through the timeline as I understand it:

  1. kubernetes/kubernetes/issues/112124 is opened and narrowed down to the "misc" cgroup being added to the pod, preventing its cleanup.
  2. Although runc is mentioned, prior to e4ce94e in 1.1.6, runc did not always join containers to the misc controller.
  3. kubelet depends on runc's cgroup libraries, so in order to clean up pods using the "misc" controller, runc's cgroup library needed to be updated to be aware of it, so that kubelet's dependency could be updated.
  4. Meanwhile, any host with a 5.13+ kernel that updates to runc 1.1.6 under the assumption that it's a stable bugfix release now experiences the original failure from (1).
  5. kubelet can now include the updated cgroup library, so new releases of kubelet can address the problems in (1) and (4).

One way to frame it is:

kubelet uses runc's cgroup library, which is a fully supported use case, and since kubelet needed a library update in order to handle the unexpected "misc" controller, and since that change did not break the library API, it was appropriate for a stable release.

Another way might be:

End users install the latest binary release of runc alongside the latest binary release of their desired Kubernetes version. However, the latest runc binary breaks at least some of the latest Kubernetes binaries, given the inherent time lag involved in coordinating multiple project releases. Therefore the change was not appropriate for a stable release.

I don't intend to criticize, I'm just trying to understand the path forward. If the only safe path is to update runc and kubelet's runc dependency in lockstep, I can work with that. If they're expected to be independent, I can file bug reports if it ever turns out that they're not. Right now I'm not sure.

(Would it have been possible to add awareness of the "misc" controller to the cgroup library, so kubelet could handle its existence, without also changing the runc binary's default behavior to join that controller?)

@dims
Copy link
Contributor

dims commented Apr 29, 2023

@bcressey right now the sequence of events for the record are:

Also once a release of k8s is made, don't go updating the vendor-ed dependency usually (this current snafu could be an exception), but we may update the binary we test.

So currently what we tell folks is that a distro should probably follow the lead from k8s (and not jump ahead of the signals we get from the dozens of CI jobs across containerd and k8s).

@bcressey
Copy link
Author

So currently what we tell folks is that a distro should probably follow the lead from k8s (and not jump ahead of the signals we get from the dozens of CI jobs across containerd and k8s).

@dims how does this work in a situation where the runc update also contains security fixes?

That's not the case with runc 1.1.6, but it did fix at least one concerning bug around adding containers to the proper cgroup, which is why we pulled it in to Bottlerocket.

So this is not that situation, exactly, but it's very much at the top of my mind, and not just a theoretical exercise.

@dims
Copy link
Contributor

dims commented Apr 29, 2023

@dims how does this work in a situation where the runc update also contains security fixes?

@bcressey yes, this is a problem currently for sure.

@kolyshkin
Copy link
Contributor

I don't intend to criticize, I'm just trying to understand the path forward. If the only safe path is to update runc and kubelet's runc dependency in lockstep, I can work with that. If they're expected to be independent, I can file bug reports if it ever turns out that they're not. Right now I'm not sure.

(Would it have been possible to add awareness of the "misc" controller to the cgroup library, so kubelet could handle its existence, without also changing the runc binary's default behavior to join that controller?)

To clarify:

  • a newer runc binary which is aware of misc controller + kubernetes with old runc/libcontainer which is not aware of misc controller is a problem.
  • an older runc binary + newer kubernetes is not a problem

In other words, there is no need for runc binary and runc/libcontainer to be totally in sync. However, if runc binary knows about misc but k8s don't, it's a problem (provided we have cgroup v1 and a sufficiently new kernel).

One way to fix this would be for runc/libcontainer's Destroy method to remove all controllers, known and unknown. The problem here is cgroup v1 hierarchy is a forest, not a tree, i.e. we have a bunch of paths like /sys/fs/cgroup/<controller>/<some/path>, and the list of known controllers is hardcoded.

There may be a way to fix this issue. Let me see...

@kolyshkin
Copy link
Contributor

There may be a way to fix this issue. Let me see...

Such a way will still require updating runc/libcontainer in kubernetes, so it will not help the current problem, only the future ones that are similar. I am trying to justify if we should try to fix it or not.

Arguments for:

  • the issue is pretty bad, high CPU usage and tons of logs is no good;
  • it is likely that the future kernels will add more controllers, so a similar issue may happen again.

Arguments against:

  • cgroup v1 is going to be deprecated in a few years (fingers crossed);
  • the implementation does not look trivial.

@kolyshkin
Copy link
Contributor

I spent more than half a day last week working on this, and it's not very easy. So, let's hope the kernel will not add more controllers any time soon.

In the meantime, here are the fixes for kubernetes:

@dims
Copy link
Contributor

dims commented May 12, 2023

let's hope the kernel will not add more controllers any time soon.

🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾

@hakman
Copy link

hakman commented May 13, 2023

@kolyshkin To help others bumping into this issue, would be helpful to update the release notes for v1.1.6 to make it clear that it may / is a breaking change. Having "cgroup v1 drivers are now aware of misc controller" as a quick mention mention doesn't fully express the "sadness" it brings. 🥲

@kolyshkin
Copy link
Contributor

@hakman thanks, I've added the "Known issues" section into https://github.com/opencontainers/runc/releases/tag/v1.1.6

@kolyshkin
Copy link
Contributor

I believe this issue is addressed (as much as we can), so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants