Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] All stopped batch jobs restart and Docker daemon enter in stuck state after clients is restarted. #6438

Closed
jonatasfreitasv opened this issue Oct 8, 2019 · 8 comments

Comments

@jonatasfreitasv
Copy link

jonatasfreitasv commented Oct 8, 2019

Nomad version

0.9.5

Operating system and Environment details

CentosOS

Issue

Oct 08 14:43:08 nomad-client-1 containerd[1573]: time="2019-10-08T14:43:08.398942292Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/cfbbcf22400fb08a194c43d8a828ce7cc7cba56bcc313623ff1c34b7e03fe4c9/shim.sock" debug=false pid=89331
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 399(vethc5dc747) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 399(vethc5dc747) entered forwarding state
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.435Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=fcb11c224f2ccd91df9da9b48f06bacfc818d7a6f38143c71b82da421c8c5d19 error="Post http://unix.sock/containers/fcb11c224f2ccd91df9da9b48f06bacfc818d7a6f38143c71b82da421c8c5d19/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.435Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=27cc5e7fb7b9079875dbfdde0c5d9bc7cd170aab65dd858588919b761247ec26 error="Post http://unix.sock/containers/27cc5e7fb7b9079875dbfdde0c5d9bc7cd170aab65dd858588919b761247ec26/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 859(vethe11142c) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 859(vethe11142c) entered disabled state
Oct 08 14:43:08 nomad-client-1 kernel: device vethe11142c entered promiscuous mode
Oct 08 14:43:08 nomad-client-1 kernel: IPv6: ADDRCONF(NETDEV_UP): vethe11142c: link is not ready
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 859(vethe11142c) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 859(vethe11142c) entered forwarding state
Oct 08 14:43:08 nomad-client-1 containerd[1573]: time="2019-10-08T14:43:08.496102592Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/97bec97c391871cbddac169f28232cef653ca2e51f5076a4c018bf67e4f3ccaa/shim.sock" debug=false pid=89353
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.4965] manager: (veth1d935dd): new Veth device (/org/freedesktop/NetworkManager/Devices/2277)
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.5036] manager: (vetha377293): new Veth device (/org/freedesktop/NetworkManager/Devices/2278)
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.5120] device (vethc5dc747): carrier: link connected
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 860(veth6efe334) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 860(veth6efe334) entered disabled state
Oct 08 14:43:08 nomad-client-1 kernel: device veth6efe334 entered promiscuous mode
Oct 08 14:43:08 nomad-client-1 kernel: IPv6: ADDRCONF(NETDEV_UP): veth6efe334: link is not ready
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 860(veth6efe334) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 860(veth6efe334) entered forwarding state
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.5733] manager: (veth1736d50): new Veth device (/org/freedesktop/NetworkManager/Devices/2279)
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.5845] manager: (vethe11142c): new Veth device (/org/freedesktop/NetworkManager/Devices/2280)
Oct 08 14:43:08 nomad-client-1 containerd[1573]: time="2019-10-08T14:43:08.601355134Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/12495c8ec42961d3d001af0c5ad0c18e7648318d3f502191eba0986af4cc76b7/shim.sock" debug=false pid=89371
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.6217] manager: (veth8bbab06): new Veth device (/org/freedesktop/NetworkManager/Devices/2281)
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.6282] manager: (veth6efe334): new Veth device (/org/freedesktop/NetworkManager/Devices/2282)
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.661Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=071c6bc449de9e0ddbc31355e30636d97625696ee1fdd0d96750e1289bc1e592 error="Post http://unix.sock/containers/071c6bc449de9e0ddbc31355e30636d97625696ee1fdd0d96750e1289bc1e592/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.666Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=521d4b7ef2195df0afae7511685dcd87971b62a703073cbb1654c859103bc7cd error="Post http://unix.sock/containers/521d4b7ef2195df0afae7511685dcd87971b62a703073cbb1654c859103bc7cd/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 859(vethe11142c) entered disabled state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 860(veth6efe334) entered disabled state
Oct 08 14:43:08 nomad-client-1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethbe1c687: link becomes ready
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 388(vethbe1c687) entered blocking state
Oct 08 14:43:08 nomad-client-1 kernel: docker0: port 388(vethbe1c687) entered forwarding state
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.785Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=d8909ced71d32fbd3ddf39c0f27ea816b443b23d8fd94294c66038303ea02d5f error="Post http://unix.sock/containers/d8909ced71d32fbd3ddf39c0f27ea816b443b23d8fd94294c66038303ea02d5f/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Oct 08 14:43:08 nomad-client-1 NetworkManager[1228]: <info>  [1570545788.7937] device (vethbe1c687): carrier: link connected
Oct 08 14:43:08 nomad-client-1 nomad[82286]: 2019-10-08T14:43:08.811Z [ERROR] client.driver_mgr.docker: failed to start container: driver=docker container_id=188f3a244b18f38c0dca84a1ab2ce4426b0adba8d028b926dce92fd792870299 error="Post http://unix.sock/containers/188f3a244b18f38c0dca84a1ab2ce4426b0adba8d028b926dce92fd792870299/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Hello!

After I run a lot of jobs (+10k) and all finished, when I start and stop nomad clientes Docker Daemon enter in stuck state and consume all cpus and i/o.

And too start all stopped batch jobs again...

Any idea about this?

@jonatasfreitasv jonatasfreitasv changed the title [BUG - Docker] Docker daemon enter in stuck state [BUG] All stopped jobs restart and Docker daemon enter in stuck state after clients is restarted. Oct 8, 2019
@jonatasfreitasv jonatasfreitasv changed the title [BUG] All stopped jobs restart and Docker daemon enter in stuck state after clients is restarted. [BUG] All stopped batch jobs restart and Docker daemon enter in stuck state after clients is restarted. Oct 8, 2019
@cgbaker cgbaker self-assigned this Oct 8, 2019
@jonatasfreitasv
Copy link
Author

@cgbaker any other information I need to send for investigation?

@jonatasfreitasv
Copy link
Author

This errors is normal?

Oct 08 18:37:01 nomad-client-1 nomad[15674]: 2019-10-08T18:37:01.769Z [ERROR] client.alloc_runner.task_runner.task_hook.logmon.nomad: reading plugin stderr: alloc_id=cb7af9ea-6d20-e890-3a09-dab205d967cc task=c1dc61df.nor-542435181 error="read |0: file already closed"
Oct 08 18:37:02 nomad-client-1 nomad[15674]: 2019-10-08T18:37:02.364Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=77afcec7-08a9-70ee-780d-5b4607d22b9e task=c1dc61df.rio-541653320 error="container stopped"
Oct 08 18:37:02 nomad-client-1 nomad[15674]: 2019-10-08T18:37:02.959Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=5e399533-2dd6-c62b-3208-1f0517d5707a task=c1dc61df.sul-545505121 error="container stopped"
Oct 08 18:37:03 nomad-client-1 nomad[15674]: 2019-10-08T18:37:03.714Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=941a1e25-4274-853a-9883-28a8151ee513 task=c1dc61df.spo-519940550 error="container stopped"
Oct 08 18:37:04 nomad-client-1 nomad[15674]: 2019-10-08T18:37:04.210Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=3d3e10dd-4ce4-da47-0bde-b9fa052603dd task=c1dc61df.rio-537957662 error="container stopped"

@cgbaker
Copy link
Contributor

cgbaker commented Oct 8, 2019

thanks, @jonatasfreitasv. I reproduced the issue and I'm tracking down the cause. I'll let you know if there's anything else I need.

@cgbaker
Copy link
Contributor

cgbaker commented Oct 8, 2019

@jonatasfreitasv : this issue was fixed by #6216 and/or #6207, which is part of 0.9.6.
please try with 0.9.6 and let me know if you still see the issue.

@jonatasfreitasv
Copy link
Author

@cgbaker Nice!!! I'm start test now.

@cgbaker
Copy link
Contributor

cgbaker commented Oct 9, 2019

did 0.9.6 resolve this issue? if so, please close this issue in github.

@jonatasfreitasv
Copy link
Author

Yes! For now is ok!!! Thx a lot...

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants