-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spawning PTY processes is many times slower on Docker 18.09 #502
Comments
TLDR: the I've made some progress figuring this out. After finding out that podman is also always affected by this same performance problem, I dug in deeper. I figured it may be a bug in runc or something like that. Because it seems a little difficult to obtain what runc is called from when coming from Docker, I've proceeded with testing using podman. I managed to start a container with podman and find the runc After exploding the problematic container to make an OCI bundle, I could test that with both the Using the Using the Comparing the differences, I found out that the part that affects performance is the By default (as generated by The Workaround & BenchmarksIt seems like this limit is easily controllable using a I've since found out that low values for the nofile limit lead to good performance when dealing with PTYs.. and that the higher the value, the slower it gets. I'll provide a bunch of results here (
Why the It doesn't seem like it's ultimately a Docker bug (maybe the kernel is doing something crazy and that should be reported/fixed?), but.. Whichever part generates the runc spec file with such high default values ( Perhaps due to packaging differences, Docker 18.06 and 18.09 (at least on CentOS 7) use different Even on the same CentOS 7 machine, starting the I haven't checked what the Perhaps CentOS luckily had a lower |
That's odd. By default, the container should inherit the containerd configuration (docker 18.09), or the
Containerd has a systemctl cat containerd.service
# /lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target
[Service]
ExecStartPre=/sbin/modprobe overlay
ExecStart=/usr/bin/containerd
KillMode=process
Delegate=yes
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
[Install]
WantedBy=multi-user.target So not sure how it could be switching between @kolyshkin @crosbymichael PTAL |
On a brand new server, after installing Docker and starting it for the first time (
With that,
That process (
If we restart Docker now (
Since After stopping/restarting |
Thanks for that. Hm.. I have a hunch what may be happening. I suspect there may be a race condition; if the |
Let me spin up a clean machine and see if I can reproduce that. (It's a bit orthogonal to the original issue, but of course contributes to the issue if the limits are different in both situations) |
I think I saw a similar thing earlier. The value of RLIMIT_NOFILE is used in a loop to close all (potentially opened) file descriptors which could leak to a child (before doing fork/exec)... Nope, it was not Now, there might be a similar code in runc and/or in any software that runs inside a container. Just strace it (inside and out) and look for massive amounts of |
@spantaleev if you have some time to play with it, see my previous comment ^^^^ |
@kolyshkin I can reproduce the loop to close fds with pycompile, e.g. install a random python package:
The setup command
Corresponding strace:
|
I have a fix for python 2.7: python/cpython#11584 Python3 apparently solved this -- tested on 3.7.2 and 3.6.7 with the following proggy: import subprocess
subprocess.check_call(["/bin/true"], close_fds=True) (by looking at the source code of python3 I see they are also getting an fd list from /proc/self/fd). But when I'm using |
Sounds like great progress! Thanks for tracking it down that far! It appears PyPy is similarly affected.
Looking at PyPy's subprocess.py history, it seems like they keep the stdlib up to date with CPython, so your patch should eventually trickle down there as well. |
It was rejected (python 2.7 fixes are not accepted unless it's something critical or security-related). They have also pointed out to https://github.com/google/python-subprocess32 which is a backport of relevant functionality from Python3 (which uses /proc/self/fd). Anyway, it looks like Python2 is WONTFIX, and Python3 is good, except for pexpect/ptyprocess. I have filed an issue for that: pexpect/ptyprocess#50; let's see what maintainers can say |
Thanks for all your hard work on this, @kolyshkin! 🙇♂️ Looks like it is hitting some problems here and there (WONTFIX Python 2, hard to fix pexpect/ptyprocess), but at least the issue has been found and reported upstream, where it belongs. I guess this issue can be closed now. |
By the way, a workaround is to have something like this: kir@kd:~/go/src/github.com/docker/docker$ cat /etc/docker/daemon.json-ulimits
{
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 1024,
"Soft": 1024
},
"nproc": {
"Name": "nproc",
"Soft": 65536,
"Hard": 65536
}
}
} |
Expected behavior
Spawning processes which use a pseudo TTY is consistently fast.
Actual behavior
Sometimes, spawning processes that use a pseudo TTY inside a container is many times slower than usual.
The first time I've encountered this problem (on a really slow machine), the time to spawn such a process in a container jumped from something reasonable (~0.5 seconds) to something much slower (8-10 seconds).
On faster machines, it's still a reproducible problem (see below). It's much less noticeable as we're talking about smaller numbers, but there's at least a factor of 4-5x slowdown (process spawning goes from taking 0.1 seconds to 0.5+ seconds).
Steps to reproduce the behavior
As described in Additional environment details (below), I've managed to reproduce it on various CentOS 7 machines (I've tried Hetzner Cloud servers and Digital Ocean droplets and it's reproducible on both).
Before describing the steps, let me first define a few files for a "benchmark" Docker image that would be used to illustrate the performance problem.
bench.py:
bench.sh:
Steps to reproduce on a clean CentOS 7 install:
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
I have first encountered this problem on a CentOS 7.5 KVM machine running on Proxmox.
I have since also managed to reproduce it using CentOS 7 on:
As the flow for reproducing in Steps to reproduce on a clean CentOS 7 install (above) says, this is only an issue with Docker 18.09. Downgrading to 18.06 makes the issue go away.
I have tried to reproduce the problem on Ubuntu 18.04 (LTS), but I haven't managed to.
Or maybe I have.. Actually the performance I've managed to achieve on Ubuntu 18.04 matches the slow benchmarks from Centos 7 + Docker 18.09. Even if I downgrade Docker to 18.06 on Ubuntu 18.04, it's still just as slow.. In summary, it's always slow on Ubuntu, regardless of the Docker version.. So maybe there is a problem there too.
Additional debugging output (strace):
bench.sh
(above) also contains a line to run the same benchmark withstrace -r
.I've received the following results (last part is most interesting as it shows where the slowdown is).
Slowness at
SIGCHLD
:or slowness at
close()
:The text was updated successfully, but these errors were encountered: