Performance issue with persisting driver endpoint to store #1374

coolljt0725 · 2016-08-04T04:09:43Z

To support container live restore, we persist driver endpoint to store which is a good way
for each network driver. But persisting endpoint to store cause a performance issue. It will take
take more time to run a container and the situation is worse in parallel. Here is some test results using
https://github.com/crosbymichael/docker-stress.
The stress.json is

$ cat stress.json
[
    {"name":"busybox", "args": ["true"]}
]

docker 1.11.2 with live restore(we backport the liver restore patch)

 ./stress --containers 50
INFO[0339] ran 52 containers in 339.05 seconds (0.15 container per sec. or 6.52 sec. per container)
INFO[0339] failures 0

 ./stress --containers 50 -c 10
ERRO[0024] docker: Error response from daemon: failed to create endpoint fervent_noyce on network bridge: failed to save bridge endpoint f3d7dbe to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125
ERRO[0035] docker: Error response from daemon: failed to create endpoint lonely_jepsen on network bridge: failed to save bridge endpoint cb6b394 to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125
INFO[0330] ran 61 containers in 330.20 seconds (0.18 container per sec. or 5.41 sec. per container)
INFO[0330] failures 1

$ time docker run -ti busybox true

real    0m2.720s
user    0m0.017s
sys     0m0.019s
$ time docker run -ti busybox true

real    0m6.190s
user    0m0.015s
sys     0m0.018s
$ time docker run -ti busybox true

real    0m2.837s
user    0m0.014s
sys     0m0.015s
$ time docker run -ti busybox true

real    0m3.864s
user    0m0.017s
sys     0m0.015s
$ time docker run -ti busybox true

real    0m3.694s
user    0m0.016s
sys     0m0.011s
$ time docker run -ti busybox true

real    0m2.747s
user    0m0.015s
sys     0m0.018s

If we increase the concurrent workers of stress, there will be a lot of timeout error, see

./stress -c 50
ERRO[0133] docker: Error response from daemon: failed to create endpoint dreamy_keller on network bridge: failed to save bridge endpoint 9b9f68e to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125
ERRO[0133] docker: Error response from daemon: failed to create endpoint ecstatic_boyd on network bridge: failed to save bridge endpoint b9db3b1 to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125
ERRO[0137] docker: Error response from daemon: failed to update store for object type *libnetwork.endpoint: timeout.
  error=exit status 125
ERRO[0153] docker: Error response from daemon: failed to create endpoint hungry_mclean on network bridge: failed to save bridge endpoint 9678724 to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125
ERRO[0166] docker: Error response from daemon: failed to create endpoint suspicious_leakey on network bridge: failed to save bridge endpoint 39b33ea to store: failed to update bridge store for object type *bridge.bridgeEndpoint: timeout.
  error=exit status 125

docker 1.11.2 without live restore

 ./stress --containers 50
INFO[0228] ran 52 containers in 228.99 seconds (0.23 container per sec. or 4.40 sec. per container)
INFO[0228] failures 0

 ./stress --containers 50 -c 10
INFO[0238] ran 61 containers in 238.24 seconds (0.26 container per sec. or 3.91 sec. per container)
INFO[0238] failures 0

$ time docker run -ti busybox true

real    0m2.379s
user    0m0.014s
sys     0m0.015s
$ time docker run -ti busybox true

real    0m2.387s
user    0m0.022s
sys     0m0.010s
$ time docker run -ti busybox true

real    0m2.584s
user    0m0.020s
sys     0m0.013s

There is significant performance decrease with live restore.
And also with persisting driver endpoint to store, there are also some consistent issues, so I suggest we can reconsider the re-construct endpoint approach, if it can reconstruct, we avoid persisting. I still think the less persisting to store the better.
I think
@aboch @mavenugo WDYT?

The text was updated successfully, but these errors were encountered:

danbeaulieu · 2016-09-06T22:57:25Z

Also being affected by this on docker 1.12.1.

It appears from the time a host is provisioned until the time we start seeing errors is very small.
From the logs, it DOES look like it recovers but it can get back in to the bad state and stay there.

@coolljt0725 interested in the stress app/script. Is it open source?

aboch · 2016-10-05T18:24:28Z

@coolljt0725
A couple of weeks ago I ran your test with docker-stress, and also increased the parallelism to 100 and more with 50/100/1000 containers.

I was not able to reproduce the problem.

I believe it has to do with the disk driver speed (I am running on a pretty empty SSD), its overload and maybe disk space and fragmentation. Somebody has infact reported that the stale data issue is easily reproducible on old spinning hard drivers.

At this moment we have not heard enough of such complains and given the issue seems related to slow disks, it does not seem enough of a strong point for a pervasive code rework. Also given the fact there will likely be issues with the other approach as well, few of which were already raised in #1135 comments, in my opinion we should hold on on this and revisit later.

vincer · 2016-10-24T23:58:10Z

We're running into this issue a lot. We have an image with a VOLUME /foo where there's 1-2GB of data in /foo that needs to be copied from the image into the volume at container creation time. I think that's causing us to run into this issue quickly even though we are running on bare metal hosts with SSDs in a RAID0 config. I can repro with our image with docker-stress easily.

Our workaround for now is to not make that directory a volume, however we incur a performance penalty at runtime due to that. Is there any other workaround to this issue or any progress to for fixing it?

coolljt0725 · 2016-10-25T01:31:45Z

@aboch
This happens on some of our server, I'm sure this has something to do with the driver speed. For now, the workaround for us is to increase the timeout of transientTimeout of boltdb . I hope the libnetwork could set the transientTimeout rather than just use the default timeout. WDYT?

aboch · 2016-10-25T03:41:08Z

@coolljt0725
After this issue, when we looked at the current timeout, we thought 10 seconds was already a big one.
But I agree, we can increase it. Because it is clear something wrong is happening down the chain and that does not seem to be under our control. Minus well we can increase the timeout to a minute, as long as that makes it better. Were you guys able to verify that the bigger timeout improved things ?

Given libnetwork does not create the bolt interface with persistent connection option, it can't currently set the transient timeout. I think it is fine to just change the default in libkv project as you are doing with your PR.

aboch · 2016-10-25T04:09:57Z

@coolljt0725
What about instead changing libkv to set BoltDB.timeout = options.ConnectionTimeout if options.PersistConnection==false && options.ConnectionTimeout != 0.
At the end that is the timeout which is used by bolt.Open()

Then we can control it via libnetwork in https://github.com/docker/libnetwork/blob/release/v0.8/datastore/datastore.go#L136.

coolljt0725 · 2016-10-25T06:37:36Z

@aboch
We had improved the timeout to 2 min on our servers and this problem never happened again, but this is still just a workaround, there are a little loss on container starting time, but it's not a big deal at the moment.

coolljt0725 · 2016-10-25T06:38:59Z

What about instead changing libkv to set BoltDB.timeout = options.ConnectionTimeout if options.PersistConnection==false && options.ConnectionTimeout != 0.
At the end that is the timeout which is used by bolt.Open()
Then we can control it via libnetwork in https://github.com/docker/libnetwork/blob/release/v0.8/datastore/datastore.go#L136.

That's a good idea. 👍

aboch · 2016-12-05T23:01:35Z

Closed by #1546

jeffutter mentioned this issue Aug 17, 2016

failed to update bridge store for object type *bridge.networkConfiguration: timeout #1399

Closed

tswift242 mentioned this issue Sep 2, 2016

Daemon hangs after starting and stopping many containers moby/moby#26251

Closed

danbeaulieu mentioned this issue Sep 8, 2016

Sporadic RPC error "containerd: container did not start before the specified timeout" moby/moby#22226

Closed

aboch mentioned this issue Oct 25, 2016

Allow client to set boltdb timeout docker/libkv#140

Merged

richardpen mentioned this issue Oct 31, 2016

failed to update bridge store for object type *bridge.bridgeEndpoint: timeout" aws/amazon-ecs-agent#574

Closed

aboch closed this as completed Dec 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with persisting driver endpoint to store #1374

Performance issue with persisting driver endpoint to store #1374

coolljt0725 commented Aug 4, 2016 •

edited

Loading

danbeaulieu commented Sep 6, 2016

aboch commented Oct 5, 2016

vincer commented Oct 24, 2016

coolljt0725 commented Oct 25, 2016

aboch commented Oct 25, 2016

aboch commented Oct 25, 2016

coolljt0725 commented Oct 25, 2016

coolljt0725 commented Oct 25, 2016 •

edited

Loading

aboch commented Dec 5, 2016

Performance issue with persisting driver endpoint to store #1374

Performance issue with persisting driver endpoint to store #1374

Comments

coolljt0725 commented Aug 4, 2016 • edited Loading

danbeaulieu commented Sep 6, 2016

aboch commented Oct 5, 2016

vincer commented Oct 24, 2016

coolljt0725 commented Oct 25, 2016

aboch commented Oct 25, 2016

aboch commented Oct 25, 2016

coolljt0725 commented Oct 25, 2016

coolljt0725 commented Oct 25, 2016 • edited Loading

aboch commented Dec 5, 2016

coolljt0725 commented Aug 4, 2016 •

edited

Loading

coolljt0725 commented Oct 25, 2016 •

edited

Loading