Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker daemon threads can become affixed to a container's namespace, causing network errors later when the container is stopped #1113

Closed
alindeman opened this issue Apr 14, 2016 · 12 comments

Comments

@alindeman
Copy link
Contributor

Background

I'm able to consistently reproduce an issue where some operating system threads for the Docker daemon start in a network namespace for a container, and stay in that namespace indefinitely. If or when the Go runtime decides to schedule code execution on one of these threads at a later point, that network namespace may be in an broken state, especially if the container has been since torn down. In many instances I've observed, the network namespace will no longer have any routes in its route table.

The symptoms can manifest as docker commands on the host (e.g., docker pull or docker push) erroring with network is unreachable or unknown host.

Requisite Version & System Information

[root@devenv proc]# docker info
Containers: 31
 Running: 0
 Paused: 0
 Stopped: 31
Images: 180
Server Version: 1.11.0
Storage Driver: devicemapper
 Pool Name: docker-8:16-67174528-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 10.86 GB
 Data Space Total: 107.4 GB
 Data Space Available: 23.19 GB
 Metadata Space Used: 19.17 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.128 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host bridge
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.703 GiB
Name: devenv
ID: XEWN:LJUD:FPKB:6Q4A:24K2:UIXX:DIDD:OQSO:D6PH:AT45:GNN6:RC7X
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

Steps to Reproduce

I find that it's easiest to reproduce when the Docker daemon is started afresh, when the fewest number of operating system threads have been started by the Go runtime.

I also find that it's easiest to reproduce when starting a decent number of containers all at once. I threw together a docker-compose.yml file that starts 10 Redis containers:

version: "2"
services:
  redis0:
    image: redis:latest
  redis1:
    image: redis:latest
  redis2:
    image: redis:latest
  redis3:
    image: redis:latest
  redis4:
    image: redis:latest
  redis5:
    image: redis:latest
  redis6:
    image: redis:latest
  redis7:
    image: redis:latest
  redis8:
    image: redis:latest
  redis9:
    image: redis:latest

Start all the containers at once with:

docker-compose up -d

Tear down the containers and the network with:

docker-compose down

Navigate to /proc/<docker pid>/task and view the network namespace for each thread:

[root@devenv task]# ls -al */ns/net
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 6382/ns/net -> net:[4026531956]
[ ... elided ... ]
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 6992/ns/net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0 Apr 14 02:01 7044/ns/net -> net:[4026533029]

If the bug manifested itself, there will be a thread or two in a different namespace than the rest. In my example, thread 7044 is in a namespace previously used by one of the containers, but since the container has been killed and the network torn down, it's no longer functional:

[root@devenv task]# nsenter --net=7044/ns/net ip route
[root@devenv task]# nsenter --net=7044/ns/net ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
[root@devenv task]# nsenter --net=7044/ns/net ping google.com
ping: unknown host google.com

The bug doesn't always manifest as there is non-determinism involved (which I try to explain later on). Sometimes I need to restart the Docker daemon and try again to get it to manifest.

Because thread 7044 (or any like it) are running as part of the Docker daemon and on top of the Go runtime, the Go runtime is free to schedule code to execute on it. If it does--for instance to handle a Remote API request to pull an image--that code will fail with a network error.

Analysis

I believe these rogue threads are created when the libnetwork code starts the embedded DNS server. I'll do my best to step through the code path I believe to be problematic.

When a new container in a non-default network is created, the sandbox.SetKey function is invoked to, among other duties, setup the DNS resolver. Specifically, the function returned by sb.resolver.SetupFunc() is invoked within the network namespace of the container in order to start the resolver.

libnetwork provides functions like InvokeFunc and nsInvoke to execute code in a given network namespace.

nsInvoke invokes the function provided as its first argument just before switching namespaces, then switches namespaces, then invokes the function provided as its second argument (now within the new namespace), and finally switches back to the original namespace. Because the Go runtime could ordinarily preempt the code and start execution of a different piece of code on the same operating system thread, nsInvoke calls runtime.LockOSThread() to make sure the current goroutine is the only one that can use the operating system thread. This prevents other code from inadvertently executing in the network namespace while the operating system thread is switched over to the container namespace.

However, I've determined in my analysis that it is also not safe to start a new goroutine in the critical section where the namespace is switched. Starting a new goroutine in that region has the potential to prompt the Go runtime to start a new operating system thread--cloned from the current one, which inherits any network namespace settings--if the Go scheduler doesn't have any other available threads to run code on.

And, unfortunately, there are pieces of code that (inadvertently?) start goroutines during the setup of the embedded DNS server. Specifically, the resolver.SetupFunc function invokes iptables.RawCombinedOutputNative. Following the call chain, we arrive at some code that invokes exec.Command(...).CombinedOutput().

The exec package is implemented in Go and is pretty easy to read. The CombinedOutput method invokes the Run method, which in turn invokes the Start method. And within the Start method, several goroutines are spawned to monitor for errors (via the errch channel).

Unfortunately these goroutines are spawned during the time when the network namespace is switched; and when a goroutine is spawned, the Go runtime is free to clone(2) a new operating system thread to run the routine, if it doesn't have any other threads available. And when the Docker runtime is early in its runtime, I believe this can happen easily.

To verify this was the issue, I ran sysdig to follow the syscalls that led up to a thread being spawned that I observed was 'stuck' in the container namespace:

First, the setns call is invoked to switch namespaces on a parent thread, 29917:

17426 02:35:02.592507436 0 docker (29917) > setns fd=67(<f>/var/run/docker/netns/14d91abe4631) nstype=16(CLONE_NEWNET)
17427 02:35:02.592507720 0 docker (29917) < setns res=0

The thread reports that it is invoking iptables:

17457 02:35:02.592936615 0 docker (29917) < write res=177 data=time="2016-04-14T02:35:02.592919131Z" level=debug msg="/usr/sbin/iptables, [--wa
17458 02:35:02.592939980 0 docker (29917) > openat dirfd=-100(ENETDOWN) name=/dev/null flags=4097(O_RDONLY|O_CLOEXEC) mode=0

It does a typical clone/execve procedure to start iptables. This thread's image is replaced by iptables, so is of no concern:

17492 02:35:02.600341433 0 docker (30070) < clone res=0 exe=/usr/bin/docker args=daemon.-H.fd://.-H.tcp://0.0.0.0:2375.--log-level=debug. tid=30070(docker) pid=30070(docker) ptid=29917(docker) cwd=/ fdlimit=1048576 pgft_maj=0 pgft_min=0 vm_si
ze=765528 vm_rss=24336 vm_swap=0 comm=docker cgroups=cpuset=/.cpu_cgroup=/system.slice/docker.service.cpuacct=/system.slice/docker... flags=0 uid=0 gid=0 vtid=30070(docker) vpid=30070(docker)
17493 02:35:02.600354571 0 docker (30070) > dup fd=70(<f>/dev/null)
17494 02:35:02.600355669 0 docker (30070) < dup res=0(<f>/dev/null)
17495 02:35:02.600355937 0 docker (30070) > dup fd=72(<p>)
17496 02:35:02.600356455 0 docker (30070) < dup res=1(<p>)
17497 02:35:02.600356626 0 docker (30070) > dup fd=72(<p>)
17498 02:35:02.600356737 0 docker (30070) < dup res=2(<p>)
17499 02:35:02.600357022 0 docker (30070) > execve

Next, however, a new operating system thread is created as part of the goroutine(s) spun up in the exec.Start method:

17522 02:35:02.602428507 0 docker (29917) < clone res=30077(docker) exe=/usr/bin/docker args=daemon.-H.fd://.-H.tcp://0.0.0.0:2375.--log-level=debug. tid=29917(docker) pid=29717(docker) ptid=1(systemd) cwd=/ fdlimit=1048576 pgft_maj=0 pgft_min=371 vm_size=773724 vm_rss=36432 vm_swap=0 comm=docker cgroups=cpuset=/.cpu_cgroup=/system.slice/docker.service.cpuacct=/system.slice/docker... flags=47619(CLONE_FILES|CLONE_FS|CLONE_PARENT_SETTID|CLONE_SIGHAND|CLONE_SYSVSEM|CLONE_THREAD|CLONE_VM) uid=0 gid=0 vtid=29917(docker) vpid=29717(docker)

This thread, 30077, is started as a regular course of the Go runtime. It's not used to execve another program: it's used to send to the errch channel in exec.Start. When the goroutine finishes, the Go runtime is free to use the operating system thread for another purpose. And because it was started during the time when its parent thread had changed its namespace, it's more-or-less permanently stuck in the container namespace.

The main thread, 29917, goes on to setns itself back to the original namespace:

18586 02:35:02.620067754 0 docker (29917) > setns fd=11(<f>net:[4026531956]) nstype=16(CLONE_NEWNET)
18587 02:35:02.620069887 0 docker (29917) < setns res=0

All is well for thread 29917. But, no such setns happens with thread 30077 because the Go runtime abstracts threads from the program. As far as I know, there's no way to even be aware of a new operating system thread being spawned.

If and when thread 30077 is used to run other code as a normal course of the Docker daemon operation, that code will run in the network context of the container. When and if the container and/or network is destroyed, the route table for that network namespace will also be destroyed, leaving it with only a loopback interface and an empty route table (shown above in the nsenter commands). It's at this point that any code running on thread 30077 will be unable to reach out to the local network or Internet, so it will return errors for operations like docker pull or docker push.

Because of the nature of the Go runtime and scheduler, these errors will only occur sporadically, only if the code is scheduled on an affected thread.

Thanks to @lstoll and @wfarr for helping me debug this issue.

@thaJeztah
Copy link
Member

Thanks @alindeman @lstoll for the research and a very precise bug report ❤️

@lstoll
Copy link

lstoll commented Apr 14, 2016

I spent some time thinking about this and poking around the go runtime, and it seems like this is unavoidable, and there are no controls exposed to manage OS thread creation and management. Judging by things like golang/go#1435 the prevailing attitude is "functionality should be added to go" as opposed to "more controls should be exposed for those using syscalls".

Based on this, I think a potential solution could be passing channels to the function called by InvokeFunc. Rather than the function doing it's own execs, it would pass the commands in to this, which would get run in a goroutine outside the locked OS thread. We can use SysProcAttr.Cloneflags to set the namespace for the command being run. This should avoid any further thread creation from inside the namespaced thread, but provide a way to run commands inside the namespace.

I'd be happy to give this a shot, I wanted to put it out there to see if anyone had other proposed solutions.

@sanimej
Copy link

sanimej commented Apr 14, 2016

@lstoll @alindeman Thanks for the detailed analysis. As you observed this is a fundamental limitation with Golang because of the lack of control in the OS thread creation/management. libnetwork code is particularly exposed to this problem because we have to switch to the containers' net namespace to do the networking plumbing. We do have a mechanism to protect against this, which helps in many cases. Please see the usages of InitOSContext in libnetwork.

I am trying out a solution, using the reexec option Docker has (it basically spawns a new Docker process, to do just some specific operations). This guarantees that the Docker daemon is completely isolated.

@alindeman
Copy link
Contributor Author

I am trying out a solution, using the reexec option Docker has (it basically spawns a new Docker process, to do just some specific operations). This guarantees that the Docker daemon is completely isolated.

Setting up the network in a totally separate, ephemeral process sounds like a good solution to me 👍

@lstoll
Copy link

lstoll commented Apr 14, 2016

Yep, complete isolation makes even more sense especially if there's
something in place for it 👍

On Thu, Apr 14, 2016 at 11:59 AM Andy Lindeman notifications@github.com
wrote:

I am trying out a solution, using the reexec option Docker has (it
basically spawns a new Docker process, to do just some specific
operations). This guarantees that the Docker daemon is completely isolated.

Setting up the network in a totally separate, ephemeral process sounds
like a good solution to me [image: 👍]


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1113 (comment)

@thaJeztah
Copy link
Member

Think this can be closed; this will be fixed in 1.11.1 through moby/moby#22261 (which was just merged)

@chenchun
Copy link
Contributor

libnetwork code is particularly exposed to this problem because we have to switch to the containers' net namespace to do the networking plumbing. We do have a mechanism to protect against this, which helps in many cases. Please see the usages of InitOSContext in libnetwork.

Never understand what InitOSContext used for. Thanks all for sharing.

@sanimej
Copy link

sanimej commented May 18, 2016

Closing this issue. Fix for this is available in 1.11.1

@bboreham
Copy link
Contributor

Related: golang/go#20676

@coolljt0725
Copy link
Contributor

We still encountered this problem on v1.11.2,
@aboch It seems #1115 doesn't fix this problem completely.
I think we should reopen this issue

@sanimej
Copy link

sanimej commented Jun 27, 2017

@cooljit0725 #1115 fixes one specific case where this issue can happen. Since libnework sets into different namespaces there could be some other trigger as well.

If you have identified a specific sequence to recreate the issue can you share it ? Instead of reopening this we should probably create a new issue with the details on how to recreate the issue.

@coolljt0725
Copy link
Contributor

@sanimej we are trying to find a way to reproduce it more easily and also investigating why this still happening , we will share the progress if we have any.

liusdu pushed a commit to liusdu/moby that referenced this issue Oct 30, 2017
This will initialize defOsSbox once controller is set.
It's a workaround to prevent problems described in
moby/libnetwork#1113
from happenning.

This issue has potential risk to create defOsSbox right on a
none-host namespace thread. thus all containers created with
--net=host won't work as expected. as defOsSbox has been bind to
a container network namespace

Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
liusdu pushed a commit to liusdu/moby that referenced this issue Oct 30, 2017
initialize defOsSbox in controller New()

This will initialize defOsSbox once controller is set.
It's a workaround to prevent problems described in
moby/libnetwork#1113
from happenning.

This issue has potential risk to create defOsSbox right on a
none-host namespace thread. thus all containers created with
--net=host won't work as expected. as defOsSbox has been bind to
a container network namespace

Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>



See merge request docker/docker!617
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants