Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver start recoverable #1891

Merged
merged 2 commits into from
Oct 31, 2016
Merged

Driver start recoverable #1891

merged 2 commits into from
Oct 31, 2016

Conversation

dadgar
Copy link
Contributor

@dadgar dadgar commented Oct 29, 2016

This PR fixes a regression in which we weren't handling recoverable errors and adds unit tests to the task runner and docker driver to prevent them.

Fixes #1858

@dadgar dadgar merged commit 9f5f130 into master Oct 31, 2016
@dadgar dadgar deleted the f-driver-start-recoverable branch October 31, 2016 17:26
@jippi
Copy link
Contributor

jippi commented Nov 2, 2016

@dadgar I'm still seeing the following under Nomad v0.5.0-rc1 ('a8c8199e413d387021a15d7a1400c8b8372124d6+CHANGES')

failed to start task 'server' for alloc '737fd32b-c732-d677-4f85-564dfcec4696': Failed to create container from image quay.io/bownty/php: Post http://unix.sock/containers/create?name=server-737fd32b-c732-d677-4f85-564dfcec4696: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Shouldn't those be recoverable too?

@dadgar
Copy link
Contributor Author

dadgar commented Nov 2, 2016

@jippi 6a0999c should fix.

@pdalbora
Copy link

pdalbora commented Feb 8, 2017

Seeing a similar error on Nomad v0.5.4:

...
2017/02/08 18:22:38.382328 [WARN] client: failed to start task "MyTask" for alloc "1cbf1ba3-98c7-8bfc-3ee5-4b1aa4a41b12": Failed to start container c81de3b4c77f103a93f3fc5e87ab970f92c3194bff90ee0cc395ce82c6bd2179: Post http://unix.sock/containers/c81de3b4c77f103a93f3fc5e87ab970f92c3194bff90ee0cc395ce82c6bd2179/start: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2017/02/08 18:22:38.382529 [INFO] client: Not restarting task: MyTask for alloc: 1cbf1ba3-98c7-8bfc-3ee5-4b1aa4a41b12
...

In this case, the timeout is happening when starting the container. Should this timeout be retry-able as well?

@dadgar
Copy link
Contributor Author

dadgar commented Feb 8, 2017

@pdalbora Was docker functioning properly on that machine? That endpoint timing out makes me think docker was unresponsive and failing would be correct (future versions could use that signal to push the task onto another driver)

@pdalbora
Copy link

pdalbora commented Feb 9, 2017

@dadgar Yes, I too thought it was strange for the start endpoint to time out, but Docker otherwise seemed to be working fine. My guess is dockerd was overloaded, as we were also pulling some pretty hefty containers on the same machine. Are there any particular logs that would help for you to look at? This was in a test environment that I've since destroyed, but it's reproducible.

@dadgar
Copy link
Contributor Author

dadgar commented Feb 9, 2017

@pdalbora The most useful thing would be the reproduction steps!

@pdalbora
Copy link

@dadgar It's reproducible in our rather complex testing environment. It will take me some time to narrow it down to a portable reproducible test case.

@dadgar
Copy link
Contributor Author

dadgar commented Feb 14, 2017

@pdalbora Hmm okay. If you can that would be awesome because I haven't been able to reproduce and thus fix :(

@tino
Copy link

tino commented Apr 21, 2018

We run into the same on 0.7.1. It looks like this in hashiui:
image

The errors are:

21 minutes | 0 seconds | Received |   |   | 0
21 minutes | 0 seconds | Task Setup | Building Task Directory |   | 0
21 minutes | 0 seconds | Driver | Downloading image docker.ownrepo.nl/app-backend:production-2018.14.0 |   | 0
21 minutes | 0 seconds | Driver Failure | failed  to initialize task "p-default_worker" for alloc  "0b749e88-cc98-d45f-7371-3855c00b1340": Failed to pull  `docker.ownrepo.nl/app-backend:production-2018.14.0`: Error: image  app-backend:production-2018.14.0 not found |   | 0
21 minutes | 0 seconds | Not Restarting | Error was unrecoverable

It may fail on 1 or 2 machines, but there are always others that do run ok. And sometimes it works for all after some retries.

The weird thing is, the image does exist, as on the other machines it pulls fine.

Other times the errors is:

46 minutes | 15 seconds | Driver Failure | failed  to initialize task "p-web" for alloc  "a1944824-4434-bfce-64a3-f62d2aed9a9b": Failed to pull  `docker.ownrepo.nl/app-backend/production:2018.14.0`: API error  (500): {"message":"Get https://docker.ownrepo.nl/v2/: net/http:  request canceled while waiting for connection (Client.Timeout exceeded  while awaiting headers)"}

@goutham-sabapathy
Copy link

What is the issue and fix for this. Because we are facing the same issue.

nomad version
Nomad v1.1.12+ent (69b50b5)

docker version
Client:
Version: 20.10.7
API version: 1.41
Go version: go1.15.14
Git commit: f0df350
Built: Wed Nov 17 03:05:36 2021
OS/Arch: linux/amd64
Context: default
Experimental: true

Server:
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.15.14
Git commit: b0f5bc3
Built: Wed Nov 17 03:06:14 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.6
GitCommit: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc:
Version: 1.0.0
GitCommit: 84113eef6fc27af1b01b3181f31bbaf708715301
docker-init:
Version: 0.19.0
GitCommit: de40ad0

@tgross
Copy link
Member

tgross commented May 2, 2022

Hi @goutham-sabapathy! This is a long-closed issue and was on a version that had a very different model for task driver plugins. Please open a new issue describing what you're seeing. Thanks!

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker "Client.Timeout" is considered fatal even tough the job policy is default "retry"
6 participants