set up app container networking before containers start #47

rade · 2014-09-11T18:37:35Z

Currently 'weave run' sets up the app container's interface into the weave network after the container has been launched with 'docker run -d'. That means the network may not be available to the container process straight away. Depending on what the container is doing, that can be benign, annoying, or disastrous.

Containers can themselves ensure that the interface is available, by running s.t. like https://github.com/jpetazzo/pipework/blob/master/pipework#L30, i.e.

while ! grep -q ^1$ /sys/class/net/ethwe/carrier 2>/dev/null
do sleep 1
done

before starting the actual container process, but this of course requires containers to have been constructed with weave in mind, which is limiting.

There is no way around this issue w/o some changes to docker.

The text was updated successfully, but these errors were encountered:

rade · 2014-09-11T18:41:32Z

My current thinking is that docker should add a --net=ns:NAMESPACE option to 'docker run' that would work like the default, i.e. --net=bridge, but place the container into the given namespace instead of creating a fresh one.

Then

# weave run <ipaddr>/<mask> <docker-args>

would become

# NS=$(weave prepare <ipaddr>/<mask>)
# docker run -d --net=ns:$NS <docker-args>

Bonus point: no more wrapping of 'docker run'!

kenshin54 · 2014-09-12T11:09:25Z

This PR moby/moby#7436 allow user to set ip and netmask before containers start, but still under review.

rade · 2014-09-13T06:18:29Z

PR moby/moby#7436

That doesn't help. Firstly, it either sets an IP, or specifies a range (via CIDR notation) from which it then picks an IP. So this is not at all what weave does/needs, which is to set both the IP and also the netmask. Secondly, it determines IPs for the docker0 bridge, whereas weave doesn't touch docker0 (so it can play nicely with containers not connected to weave and thus doesn't become an all-or-nothing choice) and operates via its own bridge.

rade · 2014-09-13T11:05:08Z

My current thinking is that docker should add a --net=ns:NAMESPACE

@jpetazzo points out that we can pretty close to that with the existing --net=container:CONTAINERID option, namely by launching a placeholder container, configure the weave networking in that, and then starting the application container with --net=container:.

The main problem with taking this approach is that we end up with an extra container per application container. That container will show up in 'docker ps', which may confuse users. More importantly, it won't get removed when the corresponding application container stops. Furthermore, any network related 'docker run' options, such as exposing ports, setting the hostname, configuring dns, must be supplied when starting the placeholder container; they won't work (and in some case even cause errors) when supplied to the application container. So 'weave run' would have to parse all the docker args, figure out which ones are networking related, and pass them to the placeholder container start. yuck. Plus it means we are still wrapping 'docker run'.

kenshin54 · 2014-09-13T15:23:58Z

@rade You are right. --net=ns:NAMESPACE seems like a better solution.

binocarlos · 2014-09-15T17:04:35Z

+1 to this because it opens the possibility of running short lived jobs that actively make network connections right off the bat (like database backups)

jpetazzo · 2014-09-15T22:06:18Z

Following up our conversation with @rade:

The main problem with taking this approach (--net container:<container_name_or_id>)
is that we end up with an extra container per application container. That container will show
up in 'docker ps', which may confuse users.

I think it's both good and bad. Right, it may confuse users. But at the same time, it
materializes the namespace so that it's visible for the users, the API, etc.

IMHO it's not a huge issue, especially if the name/image/... of the container can
be set to be something explicit (e.g. placeholder_for_weave ... :-))

More importantly, it won't get removed when the corresponding application container stops.

Agreed, but it will allow to re-use it if needed. So it's both good and bad (again).

Furthermore, any network related 'docker run' options, such as exposing ports,
setting the hostname, configuring dns, must be supplied when starting the placeholder
container; they won't work (and in some case even cause errors) when supplied to the
application container.

Ah, wait, some stuff won't work (exposing ports) however hostname and DNS should work,
since they are actually not related to the network namespace.

In the long run, I also hope that we can get something like e.g. --net=pid:<pidnumber>
but in the short term, I hope that --net=container:... can help :-)

If there are things that seem weird/impossible/etc don't hesitate to ping me.
There are many cool things that are not 100% elegant, but can help weave to
be easier to deploy/use/maintain right now, instead of waiting for the right
feature to be there in Docker.

Thanks for making weave, anyway!

rade · 2014-09-15T23:26:01Z

@jpetazzo

It's not ok to leave behind placeholder containers every time an application container terminates.

hostname and DNS should work, since they are actually not related to the network namespace.

Theoretically that is correct. Alas it doesn't work. I had actually tried this before making my claims ;). --hostname causes a Conflicting options: -h and the network mode (--net) error. And --dns and --dns-search are simply ignored. So is -p.

I also hope that we can get something like e.g. --net=pid:<pidnumber>

What is <pidnumber> here? I was hoping to be able to create a fresh namespace with ip netns add <nsname>, configure it, and then run an application container with --net=ns:<nsname>. Docker should effectively take over that network namespace and perform all the normal configuration in it, e.g. create an eth0 connected to the docker0 bridge, etc, etc. And it should make sure that the namespace disappears when the container stops.

jpetazzo · 2014-09-16T00:28:09Z

Understood. So, hmm:

It's not ok to leave behind placeholder containers every time an application container terminates.

I totally agree! But IIRC, you can:

start placeholder container C1
start app container with --net container:C1
remove placeholder container

Regarding hostname and DNS: if it doesn't work, it's a bug in Docker, and we should totally fix it.

I remember discussing "conflicting options" a while ago, and it's totally a bug, since the hostname
is in the UTS namespace (and not the NET namespace). Same thing for DNS.

I have mixed feelings about having Docker deal with the namespaces created by ip netns add <nsname>.
Under the hood, this:

creates a new namespace for the current process
bind-mount this namespace to /var/run/netns/<nsname>

If Docker starts to manage those namespaces as you suggest (giving them IP addresses etc.),
it means that it will do part of the job it does for containers already; and it will have to track them.
It is not impossible or hard, but it is a lot of new code, which IMHO is not strictly necessary.

I suggest that we rehash that a bit more, then we can draw the attention of the maintainers
for the network code to see what they think...?

rade · 2014-09-16T07:27:36Z

@jpetazzo

you can:

start placeholder container C1

start app container with --net container:C1

remove placeholder container

Ah! Yes, you are absolutely right. Have just tried that.

Which leaves two problems then...

port mappings; they cannot be configured when starting the application container, and when configuring them on the placeholder container they disappear when that container is removed
non-working hostname and DNS options; as previously noted

Let's assume that the second issue can be fixed. What can be done about the first?

ip netns add <nsname>.
Under the hood, this:

creates a new namespace for the current process

bind-mount this namespace to /var/run/netns/

afaict the namespaces created with ip netns add are not associated with any process. Certainly none are listed with ip netns pids.

If Docker starts to manage those namespaces as you suggest (giving them IP addresses etc.),
it means that it will do part of the job it does for containers already; and it will have to track them.
It is not impossible or hard, but it is a lot of new code, which IMHO is not strictly necessary.

I would have thought that telling docker to work with an existing network namespace when starting a container, involves a sub-set of the steps it currently performs to configure container networking, namely everything except the namespace creation. What am I missing?

neilellis · 2014-09-25T12:48:39Z

As a workaround in the meantime consider offering a base image for people to use.

rade · 2014-09-25T12:58:54Z

As a workaround in the meantime consider offering a base image for people to use.

Docker images aren't really flexible enough to make this work. We would want to arrive at a situation where derived images on startup execute some code that waits for the network, and then whatever code the user wanted. There is no way to compose process execution in Docker this way. You'd have to drag in a process manager, which, combined with the choice of underlying OS etc would make such a base image extremely opinionated.

neilellis · 2014-09-25T14:51:28Z

Or simply includes a command runWhenReady which executes it's parameters
when the network is ready.

i.e.

RUN "runWhenReady java -jar myapp.jar"

More talking convenience than technical solution.

On Thu, Sep 25, 2014 at 1:58 PM, rade notifications@github.com wrote:

As a workaround in the meantime consider offering a base image for people
to use.

Docker images aren't really flexible enough to make this work. We would
want to arrive at a situation where derived images on startup execute some
code that waits for the network, and then whatever code the user wanted.
There is no way to compose process execution in Docker this way. You'd have
to drag in a process manager, which, combined with the choice of underlying
OS etc would make such a base image extremely opinionated.

—
Reply to this email directly or view it on GitHub
#47 (comment).

rade · 2014-09-25T15:50:17Z

Or simply includes a command runWhenReady.

Ah yes. that could work. Or at the very least we could have such a script available so folks can construct their own images with it.

binocarlos · 2014-09-28T15:58:50Z

This might be totally dumb so ignore if it is - its more of a write up of my explorations in this area.

I've had some success in hijacking the --entrypoint of the container and setting it to a volume mounted script which waits for ethwe.

The waitfornetwork.sh script - mostly from the post above with a run the arguments as a command step at the end:

#!/bin/sh
while ! grep -q ^1$ /sys/class/net/ethwe/carrier 2>/dev/null
do sleep 1
done
$@

I have a box with weave launch 10.255.0.1/8 and weave expose 10.255.0.1/8 (exposed simply so I can ping 10.255.0.1 without running a 2nd container)

Then an ubuntu based container that has an entrypoint of ping - here is the original (don't wait for network) version:

weave run 10.255.0.2/8 binocarlos/ping -c 1 10.255.0.1

And then a version that hijacks the entrypoint:

weave run 10.255.0.2/8 \
  -v /tmp/waitfornetwork.sh:/bin/waitfornetwork.sh \
  --entrypoint="/bin/waitfornetwork.sh" \
  binocarlos/ping ping -c 1 10.255.0.1

And it works! However - there are many problems with this approach in terms of trying to generically apply it.

You need to know the entrypoint and prepend it onto the arguments to the container (i.e. -c 1 10.255.0.1 becomes ping -c 1 10.255.0.1
Knowing the entrypoint of a container involves something like this and the image must already be downloaded:

docker inspect --format '{{ range $index, $element := .Config.Entrypoint }}{{ if eq $index 0 }}{{ $element }}{{ end }}{{ end }}' binocarlos/ping

If the entrypoint has been overridden in the command 1 and 2 are replaced by a modify the entrypoint step

Because of the hackery above it feels better for the user to implicitly state their intention regarding this and not try and have weave do it generically.

I have a couple of images to run that I need to wait for the network (db backups) but there is no way I can access or control the image and so this allows me to get around it.

All of the above obviously goes away if we had --net=ns:NAMESPACE :)

rade · 2014-09-28T21:30:29Z

@binocarlos Interesting. Not something I'd want to attempt to do in a shell script though. Also, waitfornetwork can't be a shell script, since that requires the container to have a shell. Needs to be a standalone executable. All in all though it seems like a viable approach.

binocarlos · 2014-09-29T18:47:42Z

@rade ahh yes good point about using a binary - it makes it totally agnostic to the underlying image of course.

I will do a golang equivalent of the shell script and report back.

errordeveloper · 2014-11-07T10:46:42Z

Ok, but overriding --entrypoint is not going to work if someone's container has /usr/bin/python or whatever else as their entry point.

errordeveloper · 2014-11-07T10:48:00Z

The question is whether there is a way of getting the original entry point somehow like through and environment variable or something?

rade · 2014-11-07T10:49:35Z

Of course you can get the original entrypoint. Docker knows what it is, and it's returned by inspect, as @binocarlos has demonstrated. So this is all do-able.

errordeveloper · 2014-11-07T10:52:09Z

My bad, missed that bit. Looks doable.

rade · 2015-01-05T21:47:40Z

One challenge with the "rewrite entrypoint to a volume-mounted exe" approach is making it work when we don't have direct access to the docker host and instead run weave itself in a container (as per #312)? Where would the exe live s.t. we can volume-mount it into containers?

The answer is --volumes-from... As part of weave launch we start a named container that copies the required exe into a volume created with -v. For example, we could add the exe, and the program to copy it, to the weavetools image, in which case we would start that container with something like docker run --name weavetools -v /home/weave zettio/weavetools /bin/copy-entrypoint-exe /home/weave. We can then start all other containers with --volumes-from=weavetools:ro.

Note that weavetools doesn't need to remain running for this to work, though the container needs to continue to exist. We'd need to take care of removing it with docker rm -v on weave stop.

binocarlos · 2015-01-06T10:51:05Z

Hey guys sorry for delay have been out of action for a coupla months...

@rade Great idea r.e. mounting the exe from a volume - makes this whole thing much more portable and allows the exe to be distributed as a container.

It's mad how fiendishly hard parsing the docker run args can be - I made start in go (which is new for me) and hit that same problem of 'where do the docker arguments end and container arguments begin'

Had not thought of using dockers runconfig.Parse however - that plus some extra wrapping in it's own go library called parse-docker-args would be a good start - I will pick this up again this week and see where my beginner golang skills get me :)

@errordeveloper pausing the container would be great if it works!

The challenge I suppose would be how to check /sys/class/net/ethwe/carrier from the outside of the container and if the initiation of the network continues for a paused container.

rade · 2015-01-06T11:38:21Z

@errordeveloper

Am I understanding right that this implies docker run --entrypoint=foo will not reflect .Config.Entrypoint? This doesn't look like it's been reported yet, is it considered a feature?

Not sure what you mean. The whole point of --entrypoint is to be able to override the entrypoint configured in the image.

have we considered docker pause (or SIGSTOP) with consequent docker unpause (or SIGCONT) approach?

That's just racy; there's no way to guarantee we'd be able to pause the container before it attempts to access the network.

rade · 2015-01-06T11:48:40Z

@squaremo suggested an alternative technique for modifying the entrypoint: run a docker proxy. i.e. a process (in a container, naturally) that the weave script would perform all docker actions on (using DOCKERHOST=... docker ... or docker -H ...), and that forwards requests/responses to/from the real docker daemon. In the process it could rewrite the entrypoints, which, since it is speaking the docker remoting protocol, are easy to pick out and modify.

That same proxy could also take care of #251.

binocarlos · 2015-01-06T13:42:07Z

@rade yes the proxy would be a great way to bypass the cli arguments issue - I've been messing with a docker proxy in node - mdock - and it works a treat because as you say, the arguments are now in nice, already processed JSON format.

When doing this, I had lots of fun and games realizing that docker run was (as #251 points out) in fact a sequence of commands rather than a single action.

So perhaps I can make a simple container based on mdock to quickly test this idea (which feels much better than command line processing the arguments).

lukemarsden · 2015-01-10T17:30:12Z

@binocarlos Do you have your own fork of mdock? It looks like it's meant for single-host <=> many-hosts, rather than single-host docker API proxying.

I'm working on my own single-host Docker proxy in Twisted, but would be curious to see the node version.

binocarlos · 2015-01-10T20:45:32Z

@lukemarsden I've published mdock onto npm here (so npm install mdock will grab the code locally). You are right though - it's trying to be a single entry point to multiple docker servers - not the single destination proxy we need here.

Slight distraction: I had originally called this library flocker (as in flock of dockers) - then I realised clusterhq existed and so renamed it to mdock (multiple dockers) - you are welcome to take over the npm package flocker at any time :)

@rade - here is a rough plan for the proxy I have in mind - I have no idea if a) it will work or b) how it might integrate into weave so feedback is welcome!

We need a container that can be named 'weave-proxy' and:

has the weave command
has --net=host -v /proc:/hostproc so weave can mess with the system (i.e. don't wrap docker #230)
has access to the docker host (via HTTP 2375 or mounted /var/run/docker.sock)
publishes a tcp port and listens for HTTP (.e.g. port 2475)

We need another container that contains a wait-for-weave binary in a volume. This binary will wait for the weave network to be ready and then execute its arguments as a command. So echo "weave rocks" becomes wait-for-weave echo "weave-rocks". Any container that has '--volumes-from=wait-for-weave' can then access that binary.

The HTTP proxy port (2475) will forward all requests to the docker server (2375) by default. Any requests to POST /containers/create will be intercepted and the JSON packet processed as follows:

add a --volumes-from=wait-for-weave
change the entry-point to wait-for-weave $@

Any requests to POST /containers/:id/start will:

ensure the entry-point is wait-for-weave $@
extract $CIDR from the env or auto-create it
forward the request to the docker server
the container is now running but blocked by wait-for-weave
capture the returned $CONTAINER_ID
run a weave with_container_netns $CONTAINER_ID attach $CIDR style command
run a weave tell_dns PUT $CONTAINER_ID $CIDR command
return the $CONTAINER_ID to the docker client

We point the docker client at our proxy:

$ export DOCKER_HOST=tcp://127.0.0.1:2475

Now we can run long-lived servers:

$ docker run -d --env WEAVE_CIDR=10.0.1.10/24 mystack/mysql

And interactive jobs:

$ docker run --env WEAVE_CIDR=10.0.1.11/24 ubuntu bash -c "ping 10.0.1.10"

The interactive job will be blocked by wait-for-weave before attempting to ping 10.0.1.10.

There are a couple of problems I can see:

point number 3 from don't wrap docker #230 - i.e. how do we know what CIDR address to instruct weave with
returning the $CONTAINER_ID to the docker client before wait-for-weave has run the container's original entry point (perhaps adding a delay might help)

I'll give this a try this week and discover all the things that I've missed out :)

lukemarsden · 2015-01-10T20:50:35Z

@binocarlos Thanks for this. By the way, do you live in Bristol? (I do...)

binocarlos · 2015-01-10T21:55:17Z

@lukemarsden No problem - yes I live in Bristol - its a great city!

Perhaps I should come and say hello sometime - I met Richard Wall a few months ago @ a meetup in the engine shed and he was telling about what you guys were doing upstairs.

binocarlos · 2015-01-19T19:43:57Z

I'm making good progress with a container that:

runs weave from inside a container
presents a docker API proxy that will
- rewrite the entry-point for /containers/create
- add --volumes-from=weavetools
will run weave attach before returning the container id for /containers/:id/start

In the meantime - I've uploaded wait-for-weave which is the golang program that will wait for /sys/class/net/ethwe/carrier before running its args as a command.

This binary is what will be mounted in a volume in a container called weavetools

@rade if you could check this (golang being very new to me) I would be most grateful :)

rade · 2015-01-19T19:48:58Z

@binocarlos you may want to use EnsureInterface

binocarlos · 2015-01-19T21:12:55Z

@rade thanks! - that is far more civilized :)

I've added an exit with a non-zero code if it decides there is no ethwe to be found - I'm assuming this is a good plan rather than try to run the entrypoint anyway.

rade · 2015-01-19T21:38:04Z

I've added an exit with a non-zero code if it decides there is no ethwe to be found - I'm assuming this is a good plan rather than try to run the entrypoint anyway.

Agreed.

tangzhankun · 2015-07-08T05:35:30Z

Hi, @rade
Is there any update in docker that can let docker wait for NIC ready in a Docker way? I searched but only found this thread is almost same with what I needs. I saw that you are using another container to replace entrypoint and then run the original one after weave device ready.

rade · 2015-07-08T07:31:20Z

@tangzhankun The answer is Docker Network Plugins. Which obviously does a whole lot more than just wait for an entrypoint.

tangzhankun · 2015-07-08T07:35:40Z

hi, @rade
oh. Thanks for the quick reply. Writing a plugin is obviously overkill for my needs. I think using the wait in entrypoint is enough. BTW, if I want to wait custom NIC name, like eth0, or if0. Will wait-for-weave support this in the future? Or I should build a wait-for-NIC?

rade · 2015-07-08T07:39:13Z

if I want to wait custom NIC name, like eth0, or if0. Will wait-for-weave support this in the future? Or I should build a wait-for-NIC?

Submit a PR :) This should be a straightforward extension.

tangzhankun · 2015-07-08T07:43:11Z

hi, @rade
:) I'll do it if I have time in the future. Haha. Thanks again.

apassi · 2017-05-22T07:24:50Z

Is there somekind summary, what is correct solution to fix this. There is situations like DNS server in docker & weave networks, which needs to be up and running before other containers in reboot.

marccarre · 2017-05-22T09:39:31Z

@apassi, what are you looking for exactly?

Weave Net's Docker proxy already waits for the NIC to be ready before your container gets started, and Weave Net's Docker plugin relies on Docker to do that, but leads to the same outcomes, so dependencies like NIC, DNS servers, etc. should be ready by the time your container's ENTRYPOINT gets run.

apassi · 2017-05-22T12:15:31Z

Basically i like to understand how weave and containers handle host's reboot situation. I havent able to find any "internals" document about weave. After i run weave launch, there seems to run new weave related containers, weavedb etc. but i havent been able to find documents for those..

marccarre · 2017-05-22T15:07:36Z

@apassi, I understand what you meant now, thanks clarifying it.
Indeed, details on the internals of Weave Net, when present at all, are very much diluted in the online user docs. You can find lower-level details in our developers docs, but ultimately the code is the source of truth.

If you think a "Weave Net's internals" documentation could be useful, feel free to open a new GitHub issue (as this is unrelated to this one) and to list exactly what you would like to see there, since as developers, we are biaised and may think some details are obvious. Thanks!

rade added the feature label Sep 11, 2014

rade mentioned this issue Sep 11, 2014

Add attach command to weave. #46

Merged

rade mentioned this issue Sep 26, 2014

Multicast not working #87

Closed

rade mentioned this issue Oct 12, 2014

curious why, and how to debug ping failures in weave network, seeking pointers to primer on router behavior #118

Closed

This was referenced Nov 5, 2014

Add --net=netns:<name> option to use a given netns moby/moby#8216

Closed

make Java InetAddress.getLocalHost() (similar to hostname -i) return the weave IP address #68

Closed

rade mentioned this issue Nov 18, 2014

[dns] populate weaveDNS when it starts #195

Closed

rade mentioned this issue Nov 28, 2014

ability to start weave-networked containers in foreground, and auto-remove them #251

Closed

rade mentioned this issue Jan 10, 2015

don't wrap docker #230

Closed

This was referenced Mar 17, 2015

enable starting of app containers with multiple weave CIDRs #467

Closed

some packet loss and other weirdness when running weave in NAT'ed VMs #469

Open

This was referenced May 12, 2015

Race condition on container run #647

Closed

Minimal implementation of the proxy #655

Merged

rade closed this as completed in 8b66b8a May 19, 2015

rade mentioned this issue May 20, 2015

[proxy] docker exec sees interface as down #707

Closed

rade modified the milestone: 0.11.0 May 24, 2015

set up app container networking before containers start #47

set up app container networking before containers start #47

Comments

rade commented Sep 11, 2014

rade commented Sep 11, 2014

kenshin54 commented Sep 12, 2014

rade commented Sep 13, 2014

rade commented Sep 13, 2014

kenshin54 commented Sep 13, 2014

binocarlos commented Sep 15, 2014

jpetazzo commented Sep 15, 2014

rade commented Sep 15, 2014

jpetazzo commented Sep 16, 2014

rade commented Sep 16, 2014

neilellis commented Sep 25, 2014

rade commented Sep 25, 2014

neilellis commented Sep 25, 2014

rade commented Sep 25, 2014

binocarlos commented Sep 28, 2014

rade commented Sep 28, 2014

binocarlos commented Sep 29, 2014

errordeveloper commented Nov 7, 2014

errordeveloper commented Nov 7, 2014

rade commented Nov 7, 2014

errordeveloper commented Nov 7, 2014

rade commented Jan 5, 2015

binocarlos commented Jan 6, 2015

rade commented Jan 6, 2015

rade commented Jan 6, 2015

binocarlos commented Jan 6, 2015

lukemarsden commented Jan 10, 2015

binocarlos commented Jan 10, 2015

lukemarsden commented Jan 10, 2015

binocarlos commented Jan 10, 2015

binocarlos commented Jan 19, 2015

rade commented Jan 19, 2015

binocarlos commented Jan 19, 2015

rade commented Jan 19, 2015

tangzhankun commented Jul 8, 2015

rade commented Jul 8, 2015

tangzhankun commented Jul 8, 2015

rade commented Jul 8, 2015

tangzhankun commented Jul 8, 2015

apassi commented May 22, 2017

marccarre commented May 22, 2017 • edited Loading

apassi commented May 22, 2017

marccarre commented May 22, 2017

marccarre commented May 22, 2017 •

edited

Loading