Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spawning PTY processes is many times slower on Docker 18.09 #502

Closed
2 of 3 tasks
spantaleev opened this issue Nov 26, 2018 · 13 comments
Closed
2 of 3 tasks

Spawning PTY processes is many times slower on Docker 18.09 #502

spantaleev opened this issue Nov 26, 2018 · 13 comments

Comments

@spantaleev
Copy link

  • This is a bug report
  • This is a feature request
  • I searched existing issues before opening this one

Expected behavior

Spawning processes which use a pseudo TTY is consistently fast.

Actual behavior

Sometimes, spawning processes that use a pseudo TTY inside a container is many times slower than usual.

The first time I've encountered this problem (on a really slow machine), the time to spawn such a process in a container jumped from something reasonable (~0.5 seconds) to something much slower (8-10 seconds).

On faster machines, it's still a reproducible problem (see below). It's much less noticeable as we're talking about smaller numbers, but there's at least a factor of 4-5x slowdown (process spawning goes from taking 0.1 seconds to 0.5+ seconds).

Steps to reproduce the behavior

As described in Additional environment details (below), I've managed to reproduce it on various CentOS 7 machines (I've tried Hetzner Cloud servers and Digital Ocean droplets and it's reproducible on both).

Before describing the steps, let me first define a few files for a "benchmark" Docker image that would be used to illustrate the performance problem.

bench.py:

print('Starting')

import pexpect
import time

s = time.time()
print('Spawn..')
# Well, you would usually do this with an interactive program.
# This is just for illustrative purposes.
pexpect.spawn('whoami')
print(time.time() - s)

bench.sh:

#!/bin/bash

# Strace is also available in the image, so running with it is possible
#time docker run -it --rm --security-opt seccomp:unconfined docker-benchmark strace -r python bench.py

time docker run -it --rm docker-benchmark python bench.py

Steps to reproduce on a clean CentOS 7 install:

# Make sure we're up to date and have the latest Docker (18.09)
yum update -y
yum install -y wget
wget https://download.docker.com/linux/centos/docker-ce.repo -O /etc/yum.repos.d/docker-ce.repo
yum install -y docker-ce
systemctl start docker # just start, do not enable (!) on startup

mkdir docker-benchmark
cd docker-benchmark
# Copy over the necessary files (bench.sh, bench.py, Dockerfile)

docker build -t docker-benchmark -f Dockerfile .

# Run this a couple of times. It should be fast (e.g. 0.1 sec.)
./bench.sh

# Restarting will likely cause performance to degrade
systemctl restart docker

# Try benchmarking again now. It should be consistently slower (e.g. 0.4+ sec)
./bench.sh

reboot -f

cd docker-benchmark
systemctl start docker

# Run this a couple of times. It's likely fast again.
./bench.sh

# Restarting will likely cause performance to degrade
systemctl restart docker

# Try benchmarking again now. It should be consistently slower.
./bench.sh


# Downgrade Docker (18.09 -> 18.06)
systemctl stop docker
yum remove -y docker-ce docker-ce-cli
yum install -y docker-ce-18.06.1.ce-3.el7

systemctl start docker

# Try benchmarking again now. It should be fast.
# Restarting Docker, rebooting or whatever, will not slow it down.
./bench.sh

Output of docker version:

Client:
 Version:           18.09.0
 API version:       1.39
 Go version:        go1.10.4
 Git commit:        4d60db4
 Built:             Wed Nov  7 00:48:22 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.0
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.4
  Git commit:       4d60db4
  Built:            Wed Nov  7 00:19:08 2018
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 18.09.0
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 3.10.0-862.14.4.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 991.7MiB
Name: centos-s-1vcpu-1gb-fra1-01
ID: HJ5J:T65L:WR5U:W7GA:X3E7:NSEQ:HBIE:F6DF:B47K:C4OG:K33Q:I3XY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

Additional environment details (AWS, VirtualBox, physical, etc.)

I have first encountered this problem on a CentOS 7.5 KVM machine running on Proxmox.

I have since also managed to reproduce it using CentOS 7 on:

  • Hetzner Cloud VM servers (€2.99 server in the Falkenstein region)
  • Digital Ocean droplets ($5 droplet in the Frankfurt region)

As the flow for reproducing in Steps to reproduce on a clean CentOS 7 install (above) says, this is only an issue with Docker 18.09. Downgrading to 18.06 makes the issue go away.

I have tried to reproduce the problem on Ubuntu 18.04 (LTS), but I haven't managed to.
Or maybe I have.. Actually the performance I've managed to achieve on Ubuntu 18.04 matches the slow benchmarks from Centos 7 + Docker 18.09. Even if I downgrade Docker to 18.06 on Ubuntu 18.04, it's still just as slow.. In summary, it's always slow on Ubuntu, regardless of the Docker version.. So maybe there is a problem there too.

Additional debugging output (strace):

bench.sh (above) also contains a line to run the same benchmark with strace -r.

I've received the following results (last part is most interesting as it shows where the slowdown is).

Slowness at SIGCHLD:

    ... more stuff above ...
     0.000026 stat("/dev/pts/1", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
     0.000033 open("/dev/pts/1", O_RDWR|O_NOCTTY) = 6
     0.000035 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9d30ff89d0) = 8
     0.000232 close(6)                  = 0
     0.000276 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000066 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000117 fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(5, 2), ...}) = 0
     0.000039 ioctl(5, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000031 fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(5, 2), ...}) = 0
     0.000033 ioctl(5, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000062 lseek(5, 0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
     0.000056 lseek(5, 0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
     0.000114 close(4)                  = 0
     0.000039 read(3, "", 4096)         = 0
     0.648181 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8, si_uid=0, si_status=0, si_utime=27, si_stime=37} ---
     0.000037 close(3)                  = 0
     0.000102 ioctl(5, TIOCSWINSZ, {ws_row=24, ws_col=80, ws_xpixel=0, ws_ypixel=0}) = 0
     0.000125 close(5)                  = 0
     0.000050 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
     0.100275 wait4(8, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 8
     0.000154 write(1, "0.7546336650848389\n", 190.7546336650848389
) = 19
     0.000185 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f9d3067e0c0}, {sa_handler=0x7f9d30a96af0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f9d3067e0c0}, 8) = 0
     0.006374 sigaltstack(NULL, {ss_sp=0x55aca4608c90, ss_flags=0, ss_size=8192}) = 0
     0.000051 sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
     0.000242 exit_group(0)             = ?
     0.001359 +++ exited with 0 +++

or slowness at close():

    ... more stuff above ...
     0.000027 stat("/dev/pts/1", {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
     0.000032 open("/dev/pts/1", O_RDWR|O_NOCTTY) = 6
     0.000038 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f35d72f39d0) = 8
     0.000235 close(6)                  = 0
     0.000289 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000063 ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000114 fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(5, 2), ...}) = 0
     0.000040 ioctl(5, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000031 fstat(5, {st_mode=S_IFCHR|0666, st_rdev=makedev(5, 2), ...}) = 0
     0.000034 ioctl(5, TCGETS, {B38400 opost isig icanon echo ...}) = 0
     0.000052 lseek(5, 0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
     0.000057 lseek(5, 0, SEEK_CUR)     = -1 ESPIPE (Illegal seek)
     0.000083 close(4)                  = 0
     0.000035 read(3, "", 4096)         = 0
     0.647527 close(3)                  = 0
     0.000104 ioctl(5, TIOCSWINSZ, {ws_row=24, ws_col=80, ws_xpixel=0, ws_ypixel=0}) = 0
     0.000129 close(5)                  = 0
     0.000047 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=100000}) = ? ERESTARTNOHAND (To be restarted if no handler)
     0.000495 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=8, si_uid=0, si_status=0, si_utime=24, si_stime=39} ---
     0.000018 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=99545}) = 0 (Timeout)
     0.099814 wait4(8, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 8
     0.000204 write(1, "0.7540490627288818\n", 190.7540490627288818
) = 19
     0.000292 rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f35d69790c0}, {sa_handler=0x7f35d6d91af0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f35d69790c0}, 8) = 0
     0.008504 sigaltstack(NULL, {ss_sp=0x5564cd06cc90, ss_flags=0, ss_size=8192}) = 0
     0.000052 sigaltstack({ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}, NULL) = 0
     0.000216 exit_group(0)             = ?
     0.001180 +++ exited with 0 +++
@spantaleev
Copy link
Author

TLDR: the RLIMIT_NOFILE (nofile - number of open files limit) used when starting the underlying runc container affects PTY performance a lot. The higher the limit, the slower it is. Using --ulimit nofile=1024 is a good workaround.


I've made some progress figuring this out.

After finding out that podman is also always affected by this same performance problem, I dug in deeper. I figured it may be a bug in runc or something like that.

Because it seems a little difficult to obtain what runc is called from when coming from Docker, I've proceeded with testing using podman.

I managed to start a container with podman and find the runc config.json file that it's launching the container with.

After exploding the problematic container to make an OCI bundle, I could test that with both the config.json coming from podman and a default one generated by runc (runc spec).

Using the runc spec-generated config.json lead to no performance troubles.

Using the config.json generated by podman was leading to performance problems.

Comparing the differences, I found out that the part that affects performance is the RLIMIT_NOFILE setting.

By default (as generated by runc spec), RLIMIT_NOFILE is 1024:1024.
There's no performance trouble with this.

The config.json (as generated by podman.. and I assume Docker does a similar thing) has a much higher RLIMIT_NOFILE: 1048576:1048576. For some reason, accessing a PTY (e.g. /dev/pts/1) is much much slower with such a limit.


Workaround & Benchmarks

It seems like this limit is easily controllable using a --ulimit nofile=X flag to docker run (or podman run).

I've since found out that low values for the nofile limit lead to good performance when dealing with PTYs.. and that the higher the value, the slower it gets.

I'll provide a bunch of results here (docker run -it --rm --ulimit nofile=X docker-benchmark python bench.py) for different values of X:

  • --ulimit nofile=1024: 0.10 seconds
  • --ulimit nofile=10240: 0.11 seconds
  • --ulimit nofile=102400: 0.19 seconds
  • --ulimit nofile=1024000: 1.00 seconds

Why the nofile limit would affect performance so much, I don't know.

It doesn't seem like it's ultimately a Docker bug (maybe the kernel is doing something crazy and that should be reported/fixed?), but.. Whichever part generates the runc spec file with such high default values (1048576) seems to be causing the trouble. Or is it simply inheriting the limit from somewhere else?

Perhaps due to packaging differences, Docker 18.06 and 18.09 (at least on CentOS 7) use different nofile limits and that yields wildly different results.

Even on the same CentOS 7 machine, starting the docker systemd service at different times, seems to "randomly" make it inherit different limits. If one is lucky, they'll get a somewhat nice and reasonable small limit of 65536.. But after a docker restart, it's usually 16 times higher (1048576).


I haven't checked what the nofile limit situation is on other distros.. But given my previous finding that "it's always slow on Ubuntu, no matter the Docker version", I'd image that it's a high value there.

Perhaps CentOS luckily had a lower nofile limit.. At least in the past.

@thaJeztah
Copy link
Member

Even on the same CentOS 7 machine, starting the docker systemd service at different times, seems to "randomly" make it inherit different limits. If one is lucky, they'll get a somewhat nice and reasonable small limit of 65536.. But after a docker restart, it's usually 16 times higher (1048576).

That's odd. By default, the container should inherit the containerd configuration (docker 18.09), or the dockerd configuration (docker 18.06 and lower) (on Docker 18.09 and up, containerd is running as a separate service, no longer being a child-process of dockerd);

ps auxf

...
root      3263  0.3  4.2 670856 42884 ?        Ssl  Jan03  29:08 /usr/bin/containerd
root     19957  0.2  0.4  10616  4892 ?        Sl   14:33   0:00  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/399e534fb00189cb745eeca24b36f49ece449e7075aadf413844c291f445d126 -address /run/conta
root     19981  1.0  0.0   1296     4 pts/0    Ss+  14:33   0:00      \_ sh
...

Containerd has a 1048576 limit for nofile (maximum / infinity) to prevent the accounting overhead of the kernel. (I opened a tracking issue to review those configs though; #73)

systemctl cat containerd.service
# /lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target

[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity

[Install]
WantedBy=multi-user.target

So not sure how it could be switching between 65536 (which may be the default on the host) and 1048576 (default for the container)

@kolyshkin @crosbymichael PTAL

@spantaleev
Copy link
Author

spantaleev commented Jan 9, 2019

On a brand new server, after installing Docker and starting it for the first time (systemctl start docker), I see the following systemd service status (systemctl status docker):

● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-01-09 16:42:26 CET; 2s ago
     Docs: https://docs.docker.com
 Main PID: 3359 (dockerd)
    Tasks: 18
   Memory: 131.5M
   CGroup: /system.slice/docker.service
           ├─3359 /usr/bin/dockerd -H unix://
           └─3380 containerd --config /var/run/docker/containerd/containerd.toml --log-level info

With that, grep 'Max open files' /proc/<PID>/limits' says 65536 for both processes (PIDs 3359 and 3380 in this case).

systemctl status containerd says:

● containerd.service - containerd container runtime
   Loaded: loaded (/usr/lib/systemd/system/containerd.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-01-09 16:42:25 CET; 9s ago
     Docs: https://containerd.io
  Process: 3358 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 3363 (containerd)
    Tasks: 9
   Memory: 59.8M
   CGroup: /system.slice/containerd.service
           └─3363 /usr/bin/containerd

That process (/usr/bin/containerd) has a limit of 1048576.

pstree -a -l shows that the tree looks something like this:

  ├─containerd
  │   └─8*[{containerd}]
  │...........
  │...........
  ├─dockerd -H unix://
  │   ├─containerd --config /var/run/docker/containerd/containerd.toml --log-level info
  │   │   └─8*[{containerd}]
  │   └─8*[{dockerd}]


If we restart Docker now (systemctl restart docker), the tree starts to look like this:

  ├─containerd
  │   └─8*[{containerd}]
  │...........
  │...........
  ├─dockerd -H unix://
  │   └─7*[{dockerd}]

Since docker.service requires (well, BindsTo) containerd.service, and containerd.service gets started as part of docker.service's initialization, I'm guessing that it inherits its limits at first.

After stopping/restarting docker.service, containerd probably gets reparented and that's the reason other limits start applying?

@thaJeztah
Copy link
Member

Thanks for that. Hm.. I have a hunch what may be happening.

I suspect there may be a race condition; if the dockerd daemon is started, it looks for a running containerd service by checking for its socket. If no socket is found, it will start containerd as a child-process (which was how it was started in Docker 18.06 and before).

@seemethere @crosbymichael ^^

@thaJeztah
Copy link
Member

Let me spin up a clean machine and see if I can reproduce that. (It's a bit orthogonal to the original issue, but of course contributes to the issue if the limits are different in both situations)

@kolyshkin
Copy link

I think I saw a similar thing earlier. The value of RLIMIT_NOFILE is used in a loop to close all (potentially opened) file descriptors which could leak to a child (before doing fork/exec)...

Nope, it was not close() but fcntl(df, F_SETFD, FD_CLOEXEC) but still the same issue. Here it is, an issue in RPM: : rpm-software-management/rpm#444

Now, there might be a similar code in runc and/or in any software that runs inside a container. Just strace it (inside and out) and look for massive amounts of close or fcntl calls. If you'll find it, the solution is similar to that in the above rpm issue (see the second commit).

@kolyshkin
Copy link

@spantaleev if you have some time to play with it, see my previous comment ^^^^

@maetthu
Copy link

maetthu commented Jan 16, 2019

I think I saw a similar thing earlier. The value of RLIMIT_NOFILE is used in a loop to close all (potentially opened) file descriptors which could leak to a child (before doing fork/exec)...

@kolyshkin I can reproduce the loop to close fds with pycompile, e.g. install a random python package:

$ docker run --rm -it ubuntu:bionic
root@a0eaaa3bff90:/# apt-get update
[...]
root@a0eaaa3bff90:/# apt install python-pkg-resources
[...]
Setting up python-pkg-resources (39.0.1-2) ...

The setup command /usr/bin/python2 /usr/bin/pycompile -p python-pkg-resources just loops through RLIMIT_NOFILE fds, which, if set to inifity in the systemd unit, goes on and hogs the CPU for quite a while:

root@8fff674cd310:/# grep "open files" /proc/1/limits
Max open files            1073741816           1073741816           files     

Corresponding strace:

[...]
close(616283354)                        = -1 EBADF (Bad file descriptor)
close(616283355)                        = -1 EBADF (Bad file descriptor)
close(616283356)                        = -1 EBADF (Bad file descriptor)
close(616283357)                        = -1 EBADF (Bad file descriptor)
[...]

@kolyshkin
Copy link

kolyshkin commented Jan 17, 2019

I have a fix for python 2.7: python/cpython#11584

Python3 apparently solved this -- tested on 3.7.2 and 3.6.7 with the following proggy:

import subprocess
subprocess.check_call(["/bin/true"], close_fds=True)

(by looking at the source code of python3 I see they are also getting an fd list from /proc/self/fd).

But when I'm using pexpect module for python3, unfortunately it's still there. Still need to investigate it, ran out of time tonight.

@spantaleev
Copy link
Author

Sounds like great progress! Thanks for tracking it down that far!

It appears PyPy is similarly affected.

docker run -it --rm pypy:2-6.0.0 pypy -c $'import subprocess\nimport time\ns = time.time()\nsubprocess.check_call([\'/bin/true\'], close_fds=True)\nprint(time.time() - s)\n'
0.35715508461

docker run -it --rm pypy:3-6.0.0 pypy3 -c $'import subprocess\nimport time\ns = time.time()\nsubprocess.check_call([\'/bin/true\'], close_fds=True)\nprint(time.time() - s)\n'
0.001934051513671875

Looking at PyPy's subprocess.py history, it seems like they keep the stdlib up to date with CPython, so your patch should eventually trickle down there as well.

@kolyshkin
Copy link

I have a fix for python 2.7: python/cpython#11584

It was rejected (python 2.7 fixes are not accepted unless it's something critical or security-related). They have also pointed out to https://github.com/google/python-subprocess32 which is a backport of relevant functionality from Python3 (which uses /proc/self/fd).

Anyway, it looks like Python2 is WONTFIX, and Python3 is good, except for pexpect/ptyprocess. I have filed an issue for that: pexpect/ptyprocess#50; let's see what maintainers can say

@spantaleev
Copy link
Author

spantaleev commented Jan 21, 2019

Thanks for all your hard work on this, @kolyshkin! 🙇‍♂️

Looks like it is hitting some problems here and there (WONTFIX Python 2, hard to fix pexpect/ptyprocess), but at least the issue has been found and reported upstream, where it belongs.

I guess this issue can be closed now.

@kolyshkin
Copy link

By the way, a workaround is to have something like this:

kir@kd:~/go/src/github.com/docker/docker$ cat /etc/docker/daemon.json-ulimits 
{
	"default-ulimits": {
		"nofile": {
			"Name": "nofile",
			"Hard": 1024,
			"Soft": 1024
		},
		"nproc": {
			"Name": "nproc",
			"Soft": 65536,
			"Hard": 65536
		}
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants