Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

Closed
xytis opened this issue Mar 6, 2017 · 19 comments
Closed

Comments

@xytis
Copy link

xytis commented Mar 6, 2017

Nomad version

Nomad v0.5.4

Operating system and Environment details

$ uname -a
Linux docker-06 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ docker version
Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:20:01 2017
 OS/Arch:      linux/amd64

Issue

Reproduced #1814

Reproduction steps

Happens in production. Not reproduced in test environment, suspect that nomad performs allocation on a port which was just released from previous allocation.

Logs

$ journalctl
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11.913373 [DEBUG] driver.docker: docker pull <REDACTED> succeeded
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: starting plugin: /usr/sbin/nomad []string{"/usr/sbin/nomad", "executor", "{\"LogFile\":\"/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/executor.out\",\"LogLevel\":\"DEBUG\"}"}
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: waiting for RPC address for: /usr/sbin/nomad
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: nomad: 2017/03/06 11:40:11 [DEBUG] plugin: plugin address: unix /tmp/plugin543102755
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002221 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin324275750
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002268 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin324275750]}
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002276 [DEBUG] driver.docker: using 268435456 bytes memory for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002281 [DEBUG] driver.docker: using 500 cpu shares for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002301 [DEBUG] driver.docker: binding directories []string{"/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/alloc:/alloc", "/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/local:/local", "/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/secrets:/secrets"} for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002316 [DEBUG] driver.docker: allocated port 10.5.244.35:26136 -> 80 (mapped)
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002329 [DEBUG] driver.docker: exposed port 80
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002407 [DEBUG] driver.docker: setting container name to: <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.005095 [DEBUG] http: Request /v1/node/c9aa33c0-1b3f-1f11-3857-f1bf10e41408/allocations (235.275834ms)
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.026095 [INFO] driver.docker: created container 0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e
Mar 06 11:40:12 docker-06 telegraf[147298]: 2017-03-06T11:40:12Z E! ERROR in input [inputs.exec]: Errors encountered: [exec: Command timed out. for command '/usr/local/sbin/parse_nomad.sh']
Mar 06 11:40:12 docker-06 dockerd[3071]: time="2017-03-06T11:40:12.474919534Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:12 docker-06 dockerd[3071]: time="2017-03-06T11:40:12.560294851Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (34696001c3864bc0dbfda1b74422a6b12c46b1d2e81450066407ea0bb42d49c8): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.560537 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 1): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (34696001c3864bc0dbfda1b74422a6b12c46b1d2e81450066407ea0bb42d49c8): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:14 docker-06 dockerd[3071]: time="2017-03-06T11:40:14.016189913Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:14 docker-06 dockerd[3071]: time="2017-03-06T11:40:14.101933907Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (382c6dc0b84f485d3b6c3bd80f9f0743b6f7a16e79fb440aa000e98ed54ee9d3): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:14 docker-06 nomad[77332]: 2017/03/06 11:40:14.102148 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 2): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (382c6dc0b84f485d3b6c3bd80f9f0743b6f7a16e79fb440aa000e98ed54ee9d3): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:15 docker-06 telegraf[147298]: 2017-03-06T11:40:15Z E! ERROR: input [inputs.procstat] took longer to collect than collection interval (10s)
Mar 06 11:40:15 docker-06 dockerd[3071]: time="2017-03-06T11:40:15.556613480Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:15 docker-06 dockerd[3071]: time="2017-03-06T11:40:15.638172362Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (9a082444d249b134693efc07026bb351c20075bc117fa1038ca9d731c24a73c5): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:15 docker-06 nomad[77332]: 2017/03/06 11:40:15.638367 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 3): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (9a082444d249b134693efc07026bb351c20075bc117fa1038ca9d731c24a73c5): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:17 docker-06 dockerd[3071]: time="2017-03-06T11:40:17.094884796Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:17 docker-06 dockerd[3071]: time="2017-03-06T11:40:17.172650262Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (2384c371c651564a75779b5f263a8876bdb954e8b4f6c56ce71e314f3c572845): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:17 docker-06 nomad[77332]: 2017/03/06 11:40:17.172823 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 4): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (2384c371c651564a75779b5f263a8876bdb954e8b4f6c56ce71e314f3c572845): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:18 docker-06 dockerd[3071]: time="2017-03-06T11:40:18.635416443Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:18 docker-06 dockerd[3071]: time="2017-03-06T11:40:18.703832987Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (804fa1f7d69385c8ccdcf2e468b8e26f6e4dd1d2549c657bc6db52a48b6faca9): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:18 docker-06 nomad[77332]: 2017/03/06 11:40:18.704044 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 5): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (804fa1f7d69385c8ccdcf2e468b8e26f6e4dd1d2549c657bc6db52a48b6faca9): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:20 docker-06 nomad[77332]: 2017/03/06 11:40:20.128229 [DEBUG] http: Request /v1/agent/self (565.367µs)
Mar 06 11:40:20 docker-06 dockerd[3071]: time="2017-03-06T11:40:20.146648650Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:20 docker-06 nomad[77332]: 2017/03/06 11:40:20.186162 [DEBUG] http: Request /v1/nodes?prefix=c9aa33c0-1b3f-1f11-3857-f1bf10e41408 (55.405756ms)
$ cat /var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/executor.out
2017/03/06 11:40:12.000996 [DEBUG] sylog-server: launching syslog server on addr: /tmp/plugin324275750
$ nomad alloc-status a8cc717c
ID                  = a8cc717c
Eval ID             = ae9a7ce2
Name                = <REDACTED>.<REDACTED>[3]
Node ID             = c9aa33c0
Job ID              = <REDACTED>
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 03/06/17 11:39:56 UTC

Task "<REDACTED>" is "dead"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
500 MHz  256 MiB  0 B   0     http: 10.5.244.35:26136

Recent Events:
Time                   Type            Description
03/06/17 11:40:20 UTC  Not Restarting  Error was unrecoverable
03/06/17 11:40:20 UTC  Driver Failure  failed to start task "<REDACTED>" for alloc "a8cc717c-a557-c4a4-6dd8-0e269506a31f": Failed to start container 0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e: API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (a4cedc9e5c29631b83321bd15ca1a6fb1a7da4057f1b3e871d8146ed239e8e38): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
03/06/17 11:39:57 UTC  Driver          Downloading image <REDACTED>
03/06/17 11:39:57 UTC  Received        Task received by client

Note that nomad client is not responsive during the deploy attempts (that might be a separate bug)

$ cat /usr/local/sbin/parse_nomad.sh
#!/bin/bash
export NOMAD_ADDR=https://localhost:4646
export NOMAD_CLIENT_CERT=/etc/nomad/tls/nomad.pem
export NOMAD_CLIENT_KEY=/etc/nomad/tls/nomad-key.pem
export NOMAD_CACERT=/etc/nomad/tls/ca.pem
STR=$(nomad node-status -self | grep Allocated -A 2 )

echo "${STR}" |  awk -F'[ /]+' 'NR == "3" {print "{\"cur_CPU\": "$1",\"max_CPU\": "$2",\"cur_Mem\": "$4",\"max_Mem\": "$6",\"cur_Disc\": "$8",\"max_Disc\": "$10"}"}'
@gokhansengun
Copy link

gokhansengun commented Mar 7, 2017

+1 to this, happening time-to-time in my test cluster too. Was not lucky enough yet to see what was wrong but suspect that Nomad is trying to allocate a TCP port in a TIME_WAIT state. I see this issue in v0.5.2

@dadgar
Copy link
Contributor

dadgar commented Mar 10, 2017

@xytis Can you show the output of cat /proc/sys/net/ipv4/ip_local_port_range

@xytis
Copy link
Author

xytis commented Mar 10, 2017

[root@docker-06 vytis]# cat /proc/sys/net/ipv4/ip_local_port_range
10000	65000

@dadgar
Copy link
Contributor

dadgar commented Mar 13, 2017

@xytis It might be too late but is anything bound to that port?

@xytis
Copy link
Author

xytis commented Mar 13, 2017

I can not be certain. But host is running docker and nomad. I suspected they could have made the collision happen on their own, due to some race condition.

At the time of reporting, the port was free.

@gokhansengun
Copy link

Happened to me again. This time after an allocation was unsuccessful in the first attempt for some other reason and then continuing in the second attempt. Second attempt resulted in with this issue.

@dadgar
Copy link
Contributor

dadgar commented Mar 14, 2017

@gokhansengun can you share the output of nomad alloc-status <id>

@gokhansengun
Copy link

@dadgar I am running a test cluster where I test lots of things. Sorry cleaned everything for this time but happening nearly daily, will update here with the info.

@gokhansengun
Copy link

@dadgar did not need to wait until tomorrow, happened again, logs below.

ubuntu@worker-node-01:~/mgmt/scripts$ nomad alloc-status baa659f8
ID                 = baa659f8
Eval ID            = bd5361d2
Name               = backoffice-portal.backoffice-portal[0]
Node ID            = 092f4d9f
Job ID             = backoffice-portal
Client Status      = failed
Client Description = <none>
Created At         = 03/14/17 11:18:21 UTC

Task "backoffice-portal" is "dead"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
600 MHz  512 MiB  0 B   0     web: 172.22.44.121:46814

Recent Events:
Time                   Type            Description
03/14/17 11:18:57 UTC  Not Restarting  Error was unrecoverable
03/14/17 11:18:57 UTC  Driver Failure  failed to start task 'backoffice-portal' for alloc 'baa659f8-2243-7b64-2681-8cad79b462c8': Failed to start container 909acca03edc1e608d46207bfe84c42e7acaa96b77bf22024742569512c71169: API error (500): {"message":"driver failed programming external connectivity on endpoint backoffice-portal-baa659f8-2243-7b64-2681-8cad79b462c8 (7ee29894bb1aeca83d70110b658b7906b9bc669919f99261ab1ee7a29f7aec4c): Error starting userland proxy: listen tcp 172.22.44.121:46814: bind: address already in use"}
03/14/17 11:18:22 UTC  Received        Task received by client
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                        NAMES
909acca03edc        gsengun/bio-id-backoffice-portal   "/bin/bash /script..."   7 minutes ago       Created                                                                                                                                                                                                          backoffice-portal-baa659f8-2243-7b64-2681-8cad79b462c8
8e10af305af5        mongo:3.4.0                        "/entrypoint.sh mo..."   10 minutes ago      Up 10 minutes       172.22.44.121:27017->27017/tcp, 172.22.44.121:27017->27017/udp                                                                                                                               mongodb-cluster-a71970cb-1479-fcc0-65d8-66415b21ea79
ubuntu@worker-node-01:~/mgmt/scripts$ sudo netstat -p | grep 46814
tcp        0      0 172.22.44.121:46814     172.22.44.121:8500      ESTABLISHED 2831/nomad
tcp6       0      0 172.22.44.121:8500      172.22.44.121:46814     ESTABLISHED 2748/consul

@dadgar
Copy link
Contributor

dadgar commented Mar 14, 2017

@gokhansengun If the allocations are all still there or next time it happens can you run:

$ curl http://127.0.0.1:4646/v1/node/<node-id>/allocations?pretty=true
$ sudo netstat -nlp

@gokhansengun
Copy link

@dadgar added logs to secret Gist and shared with you in Gitter.

@ghost
Copy link

ghost commented May 20, 2017

We see this issue about once a week in our production env. I would like to see Nomad intercept this error and attempt to restart the container on another random port. Is this possible?

@dadgar
Copy link
Contributor

dadgar commented May 24, 2017

@gokhansengun Did changing the available port range fix this for you? We talked about that on Gitter.

@gokhansengun
Copy link

@dadgar sorry Alex, I did not update you guys on this. I waited 0.5.6 before doing the change you suggested. Then all of a sudden, issue has not occurred again. I am not sure if it was < 0.5.6 or the refactoring I did all over the job dispatching.

@dadgar
Copy link
Contributor

dadgar commented May 25, 2017

@gokhansengun Thanks for the update.

@grobinson-blockchain What version of Nomad are you on?

@ghost
Copy link

ghost commented May 25, 2017

We are on Nomad v0.5.6.

@anishmashankar
Copy link

We are facing the same issue on Nomad v0.5.6. A job with 120 task-groups with one task each. One of the tasks did not start for the same reason.

@djenriquez
Copy link

djenriquez commented May 31, 2018

Not sure if this is fixed. Running Nomad v0.8.0 saw this in a deployment of a docker task with a count of 70:

nomad -v
Nomad v0.8.0 (751b42aaa60f1bc679e5062e16db7d15f8ff9654)

docker -v
Docker version 17.12.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3
Failed to start container 25d554b3682d850580bb41d223cb82db891932f5511e191bad822b11c11379c9: API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-9b51d619-f7ea-105d-c6a0-40fec309c38e (4c0681fe8b3a36ae62bdf225a2d2531bf6b494268852763534585860e78e0544): Error starting userland proxy: listen tcp 10.7.10.154:28268: bind: address already in use

I can't get this thing to get out of the pending state =/. Had to drain the node it was on then disable the drain. Any ideas?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants