Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc init hang #2530

Closed
likakuli opened this issue Jul 28, 2020 · 20 comments
Closed

runc init hang #2530

likakuli opened this issue Jul 28, 2020 · 20 comments

Comments

@likakuli
Copy link

runc exec will hang when the runc init cmd receive an abort signal and the err msg is large than 65536 before execv executed.

By default, containerd start exec proess and set its stderr to a pipe. runc init will hang when err msg large than 65536 because the pipe is full and can't write any more

@cyphar
Copy link
Member

cyphar commented Jul 28, 2020

Can you give an example of this happening? Also it seems to me -- at least from your description -- like the issue is that containerd isn't reading from the other side of the pipe (or increasing the pipe buffer size).

@likakuli
Copy link
Author

likakuli commented Jul 29, 2020

image

image

image

fd 2 is the stderr, it is maked via os.Pipe() and passed to runc by containerd. If runc exec triggered cgroup limit and receive an abort signal before execv excuted, the error message will write to the stderr(pipe).

containerd read the pipe after the runc process exited, but the runc process hangs

@cyphar
Copy link
Member

cyphar commented Jul 29, 2020

containerd read the pipe after the runc process exited, but the runc process hangs

This seems like strange behaviour -- if you run a program with its stdio as a pipe, it's reasonable to expect that the calling program should read from the pipe frequently. From runc we could work around this by increasing the buffer size of stdio if we're called under a pipe but that is not a good idea to do generally -- there might be a reason why the caller has set the pipe buffer to a particular size.

Containerd should (at least based on the description) be reading from the fifo regularly, use a regular file as stdio for runc, or increase the fifo buffer size to be large enough for runc's stack trace output. Is there a bug open in containerd for this issue?

@likakuli
Copy link
Author

Containerd should (at least based on the description) be reading from the fifo regularly, use a regular file as stdio for runc, or increase the fifo buffer size to be large enough for runc's stack trace output. Is there a bug open in containerd for this issue?

Not yet. If containerd read from the pipe before waiting the runc process to exit, this behaviour should disappear too. I will open an issue in containerd

@shizhx
Copy link

shizhx commented Sep 19, 2020

Docker version 19.03.2, build 6a30dfc

We get into a similar problem: docker exec <container name> command hang on runc init sometimes:

admin    18930  0.0  0.1 108760  5496 ?        Sl   Sep17   0:11  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419ac10e39f1cfce42523b3deab1ff71005574e8390c27741e -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
admin    19984  0.1  0.3 563436 13240 ?        Sl   14:38   0:00  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419ac10e39f1cfce42523b3deab1ff71005574e8390c27741e/log.json --log-format json exec --process /tmp/runc-process873404975 --detach --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419
admin    19991  0.0  0.1  19732  5800 ?        D    14:38   0:00  |       \_ runc init

and linux kernel panic soon by khungtaskd:

    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Fri Sep 18 14:41:32 CST 2020
      UPTIME: 3 days, 16:32:11
LOAD AVERAGE: 3.09, 1.95, 1.74
       TASKS: 1054
    NODENAME: aTrust
     RELEASE: 4.18.0
     VERSION: #5 SMP Thu Aug 6 14:08:14 CST 2020
     MACHINE: x86_64  (2095 Mhz)
      MEMORY: 4 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 37
     COMMAND: "khungtaskd"
        TASK: ffff9e8abbbb0000  [THREAD_INFO: ffff9e8abbbb0000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 37     TASK: ffff9e8abbbb0000  CPU: 1   COMMAND: "khungtaskd"
 #0 [ffffbb1600777d28] machine_kexec at ffffffffb6657b5e
 #1 [ffffbb1600777d80] __crash_kexec at ffffffffb675356d
 #2 [ffffbb1600777e48] panic at ffffffffb66ae418
 #3 [ffffbb1600777ec8] watchdog at ffffffffb6788647
 #4 [ffffbb1600777f10] kthread at ffffffffb66d1232
 #5 [ffffbb1600777f50] ret_from_fork at ffffffffb7000255

but we don't know why, hope help.

@fuweid
Copy link
Member

fuweid commented Sep 28, 2020

/cc

@gaopeiliang
Copy link

watch

@zvier
Copy link
Contributor

zvier commented Mar 9, 2021

Docker version 19.03.2, build 6a30dfc

We get into a similar problem: docker exec <container name> command hang on runc init sometimes:

admin    18930  0.0  0.1 108760  5496 ?        Sl   Sep17   0:11  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419ac10e39f1cfce42523b3deab1ff71005574e8390c27741e -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
admin    19984  0.1  0.3 563436 13240 ?        Sl   14:38   0:00  |   \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419ac10e39f1cfce42523b3deab1ff71005574e8390c27741e/log.json --log-format json exec --process /tmp/runc-process873404975 --detach --pid-file /run/containerd/io.containerd.runtime.v1.linux/moby/cd93cca285f42a419
admin    19991  0.0  0.1  19732  5800 ?        D    14:38   0:00  |       \_ runc init

and linux kernel panic soon by khungtaskd:

    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Fri Sep 18 14:41:32 CST 2020
      UPTIME: 3 days, 16:32:11
LOAD AVERAGE: 3.09, 1.95, 1.74
       TASKS: 1054
    NODENAME: aTrust
     RELEASE: 4.18.0
     VERSION: #5 SMP Thu Aug 6 14:08:14 CST 2020
     MACHINE: x86_64  (2095 Mhz)
      MEMORY: 4 GB
       PANIC: "Kernel panic - not syncing: hung_task: blocked tasks"
         PID: 37
     COMMAND: "khungtaskd"
        TASK: ffff9e8abbbb0000  [THREAD_INFO: ffff9e8abbbb0000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 37     TASK: ffff9e8abbbb0000  CPU: 1   COMMAND: "khungtaskd"
 #0 [ffffbb1600777d28] machine_kexec at ffffffffb6657b5e
 #1 [ffffbb1600777d80] __crash_kexec at ffffffffb675356d
 #2 [ffffbb1600777e48] panic at ffffffffb66ae418
 #3 [ffffbb1600777ec8] watchdog at ffffffffb6788647
 #4 [ffffbb1600777f10] kthread at ffffffffb66d1232
 #5 [ffffbb1600777f50] ret_from_fork at ffffffffb7000255

but we don't know why, hope help.

The same with you, so do you have solved this problem ?

@cseufert
Copy link

cseufert commented Mar 16, 2021

I get the same problem with runc init hanging, however I dont see any kernel panics.

If i try to strace the ./runc init pid, it just makes it exit, and the container does not start. and i get
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument: unknown.

And if i kill the runc --root /var/run/docker/runtime-runc/moby --log /run/... process, then it will usually start, sometimes I have to do the start/kill thing more than once.

Server is using CentOS 7

# uname -a
Linux host 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# docker version                                                                                                                                            
Client: Docker Engine - Community
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:33:55 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:32:17 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@isobit
Copy link

isobit commented Mar 16, 2021

I'm running into this same exact problem on Ubuntu 18.04, seems to happen intermittently on some of our machines - containers get created but won't start, process tree shows runc init, attaching strace causes runc init to exit with error loading seccomp filter into kernel: loading seccomp filter: invalid argument.

$ uname -a
Linux <REDACTED> 5.4.0-1039-aws #41~18.04.1-Ubuntu SMP Fri Feb 26 11:20:14 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ docker version
Client: Docker Engine - Community
 Version:           20.10.5
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        55c4c88
 Built:             Tue Mar  2 20:18:05 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.5
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       363e9a8
  Built:            Tue Mar  2 20:16:00 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

@cseufert
Copy link

Another interesting note is that if you kill the runc init child process, docker returns the following error:

docker: Error response from daemon: cannot start a stopped process: unknown.

Also if I try and docker rm -f container1 and then immeidately start it (docker run --restart=always ... my/image:v1.0again (after killing init runc process, sometimes it will start up, sometimes it will take 5+ tries (kill, rm, run) to get the container to start, but it will eventually start up. Also interesting that @isobit and I are running the exact same version of docker, but different kernels (I assume being ubuntu)

@isobit
Copy link

isobit commented Mar 16, 2021

@cseufert we're on 5.4.0, updated my uname -a above to include

@cseufert
Copy link

I wonder if its this commit:
7a8d716#diff-3693bdb9f47092da6d6e35007a1689f74ddc39be8e017d475a593481c437ac3b

I am tempted to try and drop in runc 1.0.0-rc92 and see if that changes anything, but not sure I want to do that on a production machine.

@isobit
Copy link

isobit commented Mar 19, 2021

So it turns out for us the root cause of this issue was saturated disk throughput (we were exhausting our EBS burst credits), which makes sense given that this issue seems to be related to backlogged pipes.

@wu0407
Copy link

wu0407 commented Mar 31, 2021

same issue containerd/containerd#5280

@dresnick-sf
Copy link

I am tempted to try and drop in runc 1.0.0-rc92 and see if that changes anything, but not sure I want to do that on a production machine.

I tried that and it finally solved my problem.

@cseufert
Copy link

I ended up dropping in the rc92 binaries and yes, its seems to be working fine.

@cyphar
Copy link
Member

cyphar commented May 30, 2021

We have fixed what I believe to be the root cause of this issue in 1.0.0-rc94 (though you should update to 1.0.0-rc95 since it fixes a security issue). The PR was #2871.

@cyphar cyphar closed this as completed May 30, 2021
@cyphar
Copy link
Member

cyphar commented May 30, 2021

Actually the original issue pre-dates the issue fixed by #2871, but there has been no update from the original reporter so I'm keeping this closed (the original issue looked as though containerd was not reading from the container's stdio pipes).

@hemangjoshi37a
Copy link

Has this issue been solved or not because I am having this issue with my PC and it is quite annoying to hard reset the pc
after every couple of hours. Please if anyone has any solutoin let me know. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests