Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

set up app container networking before containers start #47

Closed
rade opened this issue Sep 11, 2014 · 45 comments
Closed

set up app container networking before containers start #47

rade opened this issue Sep 11, 2014 · 45 comments
Labels
Milestone

Comments

@rade
Copy link
Member

rade commented Sep 11, 2014

Currently 'weave run' sets up the app container's interface into the weave network after the container has been launched with 'docker run -d'. That means the network may not be available to the container process straight away. Depending on what the container is doing, that can be benign, annoying, or disastrous.

Containers can themselves ensure that the interface is available, by running s.t. like https://github.com/jpetazzo/pipework/blob/master/pipework#L30, i.e.

while ! grep -q ^1$ /sys/class/net/ethwe/carrier 2>/dev/null
do sleep 1
done

before starting the actual container process, but this of course requires containers to have been constructed with weave in mind, which is limiting.

There is no way around this issue w/o some changes to docker.

@rade rade added the feature label Sep 11, 2014
@rade
Copy link
Member Author

rade commented Sep 11, 2014

My current thinking is that docker should add a --net=ns:NAMESPACE option to 'docker run' that would work like the default, i.e. --net=bridge, but place the container into the given namespace instead of creating a fresh one.

Then

# weave run <ipaddr>/<mask> <docker-args>

would become

# NS=$(weave prepare <ipaddr>/<mask>)
# docker run -d --net=ns:$NS <docker-args>

Bonus point: no more wrapping of 'docker run'!

@kenshin54
Copy link

This PR moby/moby#7436 allow user to set ip and netmask before containers start, but still under review.

@rade
Copy link
Member Author

rade commented Sep 13, 2014

PR moby/moby#7436

That doesn't help. Firstly, it either sets an IP, or specifies a range (via CIDR notation) from which it then picks an IP. So this is not at all what weave does/needs, which is to set both the IP and also the netmask. Secondly, it determines IPs for the docker0 bridge, whereas weave doesn't touch docker0 (so it can play nicely with containers not connected to weave and thus doesn't become an all-or-nothing choice) and operates via its own bridge.

@rade
Copy link
Member Author

rade commented Sep 13, 2014

My current thinking is that docker should add a --net=ns:NAMESPACE

@jpetazzo points out that we can pretty close to that with the existing --net=container:CONTAINERID option, namely by launching a placeholder container, configure the weave networking in that, and then starting the application container with --net=container:.

The main problem with taking this approach is that we end up with an extra container per application container. That container will show up in 'docker ps', which may confuse users. More importantly, it won't get removed when the corresponding application container stops. Furthermore, any network related 'docker run' options, such as exposing ports, setting the hostname, configuring dns, must be supplied when starting the placeholder container; they won't work (and in some case even cause errors) when supplied to the application container. So 'weave run' would have to parse all the docker args, figure out which ones are networking related, and pass them to the placeholder container start. yuck. Plus it means we are still wrapping 'docker run'.

@kenshin54
Copy link

@rade You are right. --net=ns:NAMESPACE seems like a better solution.

@binocarlos
Copy link
Contributor

+1 to this because it opens the possibility of running short lived jobs that actively make network connections right off the bat (like database backups)

@jpetazzo
Copy link

Following up our conversation with @rade:

The main problem with taking this approach (--net container:<container_name_or_id>)
is that we end up with an extra container per application container. That container will show
up in 'docker ps', which may confuse users.

I think it's both good and bad. Right, it may confuse users. But at the same time, it
materializes the namespace so that it's visible for the users, the API, etc.

IMHO it's not a huge issue, especially if the name/image/... of the container can
be set to be something explicit (e.g. placeholder_for_weave ... :-))

More importantly, it won't get removed when the corresponding application container stops.

Agreed, but it will allow to re-use it if needed. So it's both good and bad (again).

Furthermore, any network related 'docker run' options, such as exposing ports,
setting the hostname, configuring dns, must be supplied when starting the placeholder
container; they won't work (and in some case even cause errors) when supplied to the
application container.

Ah, wait, some stuff won't work (exposing ports) however hostname and DNS should work,
since they are actually not related to the network namespace.

In the long run, I also hope that we can get something like e.g. --net=pid:<pidnumber>
but in the short term, I hope that --net=container:... can help :-)

If there are things that seem weird/impossible/etc don't hesitate to ping me.
There are many cool things that are not 100% elegant, but can help weave to
be easier to deploy/use/maintain right now, instead of waiting for the right
feature to be there in Docker.

Thanks for making weave, anyway!

@rade
Copy link
Member Author

rade commented Sep 15, 2014

@jpetazzo

It's not ok to leave behind placeholder containers every time an application container terminates.

hostname and DNS should work, since they are actually not related to the network namespace.

Theoretically that is correct. Alas it doesn't work. I had actually tried this before making my claims ;). --hostname causes a Conflicting options: -h and the network mode (--net) error. And --dns and --dns-search are simply ignored. So is -p.

I also hope that we can get something like e.g. --net=pid:<pidnumber>

What is <pidnumber> here? I was hoping to be able to create a fresh namespace with ip netns add <nsname>, configure it, and then run an application container with --net=ns:<nsname>. Docker should effectively take over that network namespace and perform all the normal configuration in it, e.g. create an eth0 connected to the docker0 bridge, etc, etc. And it should make sure that the namespace disappears when the container stops.

@jpetazzo
Copy link

Understood. So, hmm:

It's not ok to leave behind placeholder containers every time an application container terminates.

I totally agree! But IIRC, you can:

  • start placeholder container C1
  • start app container with --net container:C1
  • remove placeholder container

Regarding hostname and DNS: if it doesn't work, it's a bug in Docker, and we should totally fix it.

I remember discussing "conflicting options" a while ago, and it's totally a bug, since the hostname
is in the UTS namespace (and not the NET namespace). Same thing for DNS.

I have mixed feelings about having Docker deal with the namespaces created by ip netns add <nsname>.
Under the hood, this:

  • creates a new namespace for the current process
  • bind-mount this namespace to /var/run/netns/<nsname>

If Docker starts to manage those namespaces as you suggest (giving them IP addresses etc.),
it means that it will do part of the job it does for containers already; and it will have to track them.
It is not impossible or hard, but it is a lot of new code, which IMHO is not strictly necessary.

I suggest that we rehash that a bit more, then we can draw the attention of the maintainers
for the network code to see what they think...?

@rade
Copy link
Member Author

rade commented Sep 16, 2014

@jpetazzo

you can:

  • start placeholder container C1
  • start app container with --net container:C1
  • remove placeholder container

Ah! Yes, you are absolutely right. Have just tried that.

Which leaves two problems then...

  1. port mappings; they cannot be configured when starting the application container, and when configuring them on the placeholder container they disappear when that container is removed
  2. non-working hostname and DNS options; as previously noted

Let's assume that the second issue can be fixed. What can be done about the first?

ip netns add <nsname>.
Under the hood, this:

  • creates a new namespace for the current process
  • bind-mount this namespace to /var/run/netns/

afaict the namespaces created with ip netns add are not associated with any process. Certainly none are listed with ip netns pids.

If Docker starts to manage those namespaces as you suggest (giving them IP addresses etc.),
it means that it will do part of the job it does for containers already; and it will have to track them.
It is not impossible or hard, but it is a lot of new code, which IMHO is not strictly necessary.

I would have thought that telling docker to work with an existing network namespace when starting a container, involves a sub-set of the steps it currently performs to configure container networking, namely everything except the namespace creation. What am I missing?

@neilellis
Copy link

As a workaround in the meantime consider offering a base image for people to use.

@rade
Copy link
Member Author

rade commented Sep 25, 2014

As a workaround in the meantime consider offering a base image for people to use.

Docker images aren't really flexible enough to make this work. We would want to arrive at a situation where derived images on startup execute some code that waits for the network, and then whatever code the user wanted. There is no way to compose process execution in Docker this way. You'd have to drag in a process manager, which, combined with the choice of underlying OS etc would make such a base image extremely opinionated.

@neilellis
Copy link

Or simply includes a command runWhenReady which executes it's parameters
when the network is ready.

i.e.

RUN "runWhenReady java -jar myapp.jar"

More talking convenience than technical solution.

On Thu, Sep 25, 2014 at 1:58 PM, rade notifications@github.com wrote:

As a workaround in the meantime consider offering a base image for people
to use.

Docker images aren't really flexible enough to make this work. We would
want to arrive at a situation where derived images on startup execute some
code that waits for the network, and then whatever code the user wanted.
There is no way to compose process execution in Docker this way. You'd have
to drag in a process manager, which, combined with the choice of underlying
OS etc would make such a base image extremely opinionated.


Reply to this email directly or view it on GitHub
#47 (comment).

@rade
Copy link
Member Author

rade commented Sep 25, 2014

Or simply includes a command runWhenReady.

Ah yes. that could work. Or at the very least we could have such a script available so folks can construct their own images with it.

@binocarlos
Copy link
Contributor

This might be totally dumb so ignore if it is - its more of a write up of my explorations in this area.

I've had some success in hijacking the --entrypoint of the container and setting it to a volume mounted script which waits for ethwe.

The waitfornetwork.sh script - mostly from the post above with a run the arguments as a command step at the end:

#!/bin/sh
while ! grep -q ^1$ /sys/class/net/ethwe/carrier 2>/dev/null
do sleep 1
done
$@

I have a box with weave launch 10.255.0.1/8 and weave expose 10.255.0.1/8 (exposed simply so I can ping 10.255.0.1 without running a 2nd container)

Then an ubuntu based container that has an entrypoint of ping - here is the original (don't wait for network) version:

weave run 10.255.0.2/8 binocarlos/ping -c 1 10.255.0.1

And then a version that hijacks the entrypoint:

weave run 10.255.0.2/8 \
  -v /tmp/waitfornetwork.sh:/bin/waitfornetwork.sh \
  --entrypoint="/bin/waitfornetwork.sh" \
  binocarlos/ping ping -c 1 10.255.0.1

And it works! However - there are many problems with this approach in terms of trying to generically apply it.

  1. You need to know the entrypoint and prepend it onto the arguments to the container (i.e. -c 1 10.255.0.1 becomes ping -c 1 10.255.0.1
  2. Knowing the entrypoint of a container involves something like this and the image must already be downloaded:
docker inspect --format '{{ range $index, $element := .Config.Entrypoint }}{{ if eq $index 0 }}{{ $element }}{{ end }}{{ end }}' binocarlos/ping
  1. If the entrypoint has been overridden in the command 1 and 2 are replaced by a modify the entrypoint step

Because of the hackery above it feels better for the user to implicitly state their intention regarding this and not try and have weave do it generically.

I have a couple of images to run that I need to wait for the network (db backups) but there is no way I can access or control the image and so this allows me to get around it.

All of the above obviously goes away if we had --net=ns:NAMESPACE :)

@rade
Copy link
Member Author

rade commented Sep 28, 2014

@binocarlos Interesting. Not something I'd want to attempt to do in a shell script though. Also, waitfornetwork can't be a shell script, since that requires the container to have a shell. Needs to be a standalone executable. All in all though it seems like a viable approach.

@binocarlos
Copy link
Contributor

@rade ahh yes good point about using a binary - it makes it totally agnostic to the underlying image of course.

I will do a golang equivalent of the shell script and report back.

@errordeveloper
Copy link
Contributor

Ok, but overriding --entrypoint is not going to work if someone's container has /usr/bin/python or whatever else as their entry point.

@errordeveloper
Copy link
Contributor

The question is whether there is a way of getting the original entry point somehow like through and environment variable or something?

@rade
Copy link
Member Author

rade commented Nov 7, 2014

Of course you can get the original entrypoint. Docker knows what it is, and it's returned by inspect, as @binocarlos has demonstrated. So this is all do-able.

@errordeveloper
Copy link
Contributor

My bad, missed that bit. Looks doable.

@rade
Copy link
Member Author

rade commented Jan 5, 2015

One challenge with the "rewrite entrypoint to a volume-mounted exe" approach is making it work when we don't have direct access to the docker host and instead run weave itself in a container (as per #312)? Where would the exe live s.t. we can volume-mount it into containers?

The answer is --volumes-from... As part of weave launch we start a named container that copies the required exe into a volume created with -v. For example, we could add the exe, and the program to copy it, to the weavetools image, in which case we would start that container with something like docker run --name weavetools -v /home/weave zettio/weavetools /bin/copy-entrypoint-exe /home/weave. We can then start all other containers with --volumes-from=weavetools:ro.

Note that weavetools doesn't need to remain running for this to work, though the container needs to continue to exist. We'd need to take care of removing it with docker rm -v on weave stop.

@binocarlos
Copy link
Contributor

Hey guys sorry for delay have been out of action for a coupla months...

@rade Great idea r.e. mounting the exe from a volume - makes this whole thing much more portable and allows the exe to be distributed as a container.

It's mad how fiendishly hard parsing the docker run args can be - I made start in go (which is new for me) and hit that same problem of 'where do the docker arguments end and container arguments begin'

Had not thought of using dockers runconfig.Parse however - that plus some extra wrapping in it's own go library called parse-docker-args would be a good start - I will pick this up again this week and see where my beginner golang skills get me :)

@errordeveloper pausing the container would be great if it works!

The challenge I suppose would be how to check /sys/class/net/ethwe/carrier from the outside of the container and if the initiation of the network continues for a paused container.

@rade
Copy link
Member Author

rade commented Jan 6, 2015

@errordeveloper

Am I understanding right that this implies docker run --entrypoint=foo will not reflect .Config.Entrypoint? This doesn't look like it's been reported yet, is it considered a feature?

Not sure what you mean. The whole point of --entrypoint is to be able to override the entrypoint configured in the image.

have we considered docker pause (or SIGSTOP) with consequent docker unpause (or SIGCONT) approach?

That's just racy; there's no way to guarantee we'd be able to pause the container before it attempts to access the network.

@rade
Copy link
Member Author

rade commented Jan 6, 2015

@squaremo suggested an alternative technique for modifying the entrypoint: run a docker proxy. i.e. a process (in a container, naturally) that the weave script would perform all docker actions on (using DOCKERHOST=... docker ... or docker -H ...), and that forwards requests/responses to/from the real docker daemon. In the process it could rewrite the entrypoints, which, since it is speaking the docker remoting protocol, are easy to pick out and modify.

That same proxy could also take care of #251.

@binocarlos
Copy link
Contributor

@rade yes the proxy would be a great way to bypass the cli arguments issue - I've been messing with a docker proxy in node - mdock - and it works a treat because as you say, the arguments are now in nice, already processed JSON format.

When doing this, I had lots of fun and games realizing that docker run was (as #251 points out) in fact a sequence of commands rather than a single action.

So perhaps I can make a simple container based on mdock to quickly test this idea (which feels much better than command line processing the arguments).

@rade rade mentioned this issue Jan 10, 2015
@lukemarsden
Copy link

@binocarlos Do you have your own fork of mdock? It looks like it's meant for single-host <=> many-hosts, rather than single-host docker API proxying.

I'm working on my own single-host Docker proxy in Twisted, but would be curious to see the node version.

@binocarlos
Copy link
Contributor

@lukemarsden I've published mdock onto npm here (so npm install mdock will grab the code locally). You are right though - it's trying to be a single entry point to multiple docker servers - not the single destination proxy we need here.

Slight distraction: I had originally called this library flocker (as in flock of dockers) - then I realised clusterhq existed and so renamed it to mdock (multiple dockers) - you are welcome to take over the npm package flocker at any time :)

@rade - here is a rough plan for the proxy I have in mind - I have no idea if a) it will work or b) how it might integrate into weave so feedback is welcome!

We need a container that can be named 'weave-proxy' and:

  • has the weave command
  • has --net=host -v /proc:/hostproc so weave can mess with the system (i.e. don't wrap docker #230)
  • has access to the docker host (via HTTP 2375 or mounted /var/run/docker.sock)
  • publishes a tcp port and listens for HTTP (.e.g. port 2475)

We need another container that contains a wait-for-weave binary in a volume. This binary will wait for the weave network to be ready and then execute its arguments as a command. So echo "weave rocks" becomes wait-for-weave echo "weave-rocks". Any container that has '--volumes-from=wait-for-weave' can then access that binary.

The HTTP proxy port (2475) will forward all requests to the docker server (2375) by default. Any requests to POST /containers/create will be intercepted and the JSON packet processed as follows:

  • add a --volumes-from=wait-for-weave
  • change the entry-point to wait-for-weave $@

Any requests to POST /containers/:id/start will:

  • ensure the entry-point is wait-for-weave $@
  • extract $CIDR from the env or auto-create it
  • forward the request to the docker server
  • the container is now running but blocked by wait-for-weave
  • capture the returned $CONTAINER_ID
  • run a weave with_container_netns $CONTAINER_ID attach $CIDR style command
  • run a weave tell_dns PUT $CONTAINER_ID $CIDR command
  • return the $CONTAINER_ID to the docker client

We point the docker client at our proxy:

$ export DOCKER_HOST=tcp://127.0.0.1:2475

Now we can run long-lived servers:

$ docker run -d --env WEAVE_CIDR=10.0.1.10/24 mystack/mysql

And interactive jobs:

$ docker run --env WEAVE_CIDR=10.0.1.11/24 ubuntu bash -c "ping 10.0.1.10"

The interactive job will be blocked by wait-for-weave before attempting to ping 10.0.1.10.

There are a couple of problems I can see:

  • point number 3 from don't wrap docker #230 - i.e. how do we know what CIDR address to instruct weave with
  • returning the $CONTAINER_ID to the docker client before wait-for-weave has run the container's original entry point (perhaps adding a delay might help)

I'll give this a try this week and discover all the things that I've missed out :)

@lukemarsden
Copy link

@binocarlos Thanks for this. By the way, do you live in Bristol? (I do...)

@binocarlos
Copy link
Contributor

@lukemarsden No problem - yes I live in Bristol - its a great city!

Perhaps I should come and say hello sometime - I met Richard Wall a few months ago @ a meetup in the engine shed and he was telling about what you guys were doing upstairs.

@binocarlos
Copy link
Contributor

I'm making good progress with a container that:

  • runs weave from inside a container
  • presents a docker API proxy that will
    • rewrite the entry-point for /containers/create
    • add --volumes-from=weavetools
  • will run weave attach before returning the container id for /containers/:id/start

In the meantime - I've uploaded wait-for-weave which is the golang program that will wait for /sys/class/net/ethwe/carrier before running its args as a command.

This binary is what will be mounted in a volume in a container called weavetools

@rade if you could check this (golang being very new to me) I would be most grateful :)

@rade
Copy link
Member Author

rade commented Jan 19, 2015

@binocarlos you may want to use EnsureInterface

@binocarlos
Copy link
Contributor

@rade thanks! - that is far more civilized :)

I've added an exit with a non-zero code if it decides there is no ethwe to be found - I'm assuming this is a good plan rather than try to run the entrypoint anyway.

@rade
Copy link
Member Author

rade commented Jan 19, 2015

I've added an exit with a non-zero code if it decides there is no ethwe to be found - I'm assuming this is a good plan rather than try to run the entrypoint anyway.

Agreed.

@tangzhankun
Copy link

Hi, @rade
Is there any update in docker that can let docker wait for NIC ready in a Docker way? I searched but only found this thread is almost same with what I needs. I saw that you are using another container to replace entrypoint and then run the original one after weave device ready.

@rade
Copy link
Member Author

rade commented Jul 8, 2015

@tangzhankun The answer is Docker Network Plugins. Which obviously does a whole lot more than just wait for an entrypoint.

@tangzhankun
Copy link

hi, @rade
oh. Thanks for the quick reply. Writing a plugin is obviously overkill for my needs. I think using the wait in entrypoint is enough. BTW, if I want to wait custom NIC name, like eth0, or if0. Will wait-for-weave support this in the future? Or I should build a wait-for-NIC?

@rade
Copy link
Member Author

rade commented Jul 8, 2015

if I want to wait custom NIC name, like eth0, or if0. Will wait-for-weave support this in the future? Or I should build a wait-for-NIC?

Submit a PR :) This should be a straightforward extension.

@tangzhankun
Copy link

hi, @rade
:) I'll do it if I have time in the future. Haha. Thanks again.

@apassi
Copy link

apassi commented May 22, 2017

Is there somekind summary, what is correct solution to fix this. There is situations like DNS server in docker & weave networks, which needs to be up and running before other containers in reboot.

@marccarre
Copy link
Contributor

marccarre commented May 22, 2017

@apassi, what are you looking for exactly?

Weave Net's Docker proxy already waits for the NIC to be ready before your container gets started, and Weave Net's Docker plugin relies on Docker to do that, but leads to the same outcomes, so dependencies like NIC, DNS servers, etc. should be ready by the time your container's ENTRYPOINT gets run.

@apassi
Copy link

apassi commented May 22, 2017

Basically i like to understand how weave and containers handle host's reboot situation. I havent able to find any "internals" document about weave. After i run weave launch, there seems to run new weave related containers, weavedb etc. but i havent been able to find documents for those..

@marccarre
Copy link
Contributor

@apassi, I understand what you meant now, thanks clarifying it.
Indeed, details on the internals of Weave Net, when present at all, are very much diluted in the online user docs. You can find lower-level details in our developers docs, but ultimately the code is the source of truth.

If you think a "Weave Net's internals" documentation could be useful, feel free to open a new GitHub issue (as this is unrelated to this one) and to list exactly what you would like to see there, since as developers, we are biaised and may think some details are obvious. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants