Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker "Client.Timeout" is considered fatal even tough the job policy is default "retry" #1858

Closed
jippi opened this issue Oct 25, 2016 · 2 comments · Fixed by #1891
Closed

Comments

@jippi
Copy link
Contributor

jippi commented Oct 25, 2016

Nomad Version

0.4.1

Driver

Docker

Problem

Retrying docker container doesn't work in case of docker timeout

A timeout to the docker process should not be fatal, and could be retried.. I'm not even sure why it would timeout, the server is more or less idle with no cpu, ram or io pressure.

Not sure what exact step is done during the executation, if its pulling the image from quay.io that is slow.. in that case, it should still be retried - maybe even on a different client?

-> nomad alloc-status -stats 605be8d0-ceb9-ea6e-8596-5c6bc90d1d15
ID            = 605be8d0
Eval ID       = cdf62a8d
Name          = production-popularity.impression-counter[0]
Node ID       = 22c97a7e
Job ID        = production-popularity
Client Status = failed

Task "server" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
500 MHz  256 MiB  300 MiB  0

Recent Events:
Time                   Type            Description
10/25/16 11:01:10 UTC  Not Restarting  Error was unrecoverable
10/25/16 11:01:10 UTC  Driver Failure  failed to start task 'server' for alloc '605be8d0-ceb9-ea6e-8596-5c6bc90d1d15': Failed to create container from image quay.io/bownty/php: Post http://unix.sock/containers/create?name=server-605be8d0-ceb9-ea6e-8596-5c6bc90d1d15: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
10/25/16 10:59:59 UTC  Received        Task received by client

Job File

https://gist.github.com/jippi/d2c60ae634f931ef379652481c57216f

Client log file

https://gist.github.com/jippi/2bef00eb2d3c335c8d9a98fd2ceeeb99

50 lines "grep context" around the allocation ID

Please don't worry about nomad version being "0.4.3", it's a 0.4.1 tag build with #1816 and and #1762 cherry-picked in as recommended by @dadgar

@jippi
Copy link
Contributor Author

jippi commented Oct 28, 2016

I'm getting hit by this more and more often.....

Every 0.2s: nomad alloc-status -stats 31079a1f                                                                                                                                                                        Fri Oct 28 05:57:36 2016

ID            = 31079a1f
Eval ID       = 979c8bae
Name          = production-importer.state-machine[0]
Node ID       = 22c97a7e
Job ID        = production-importer
Client Status = failed

^[1mTask "server" is "dead"^[0m^[0m
Task Resources
CPU      Memory   Disk     IOPS  Addresses
500 MHz  256 MiB  300 MiB  0

Recent Events:
Time                   Type            Description
10/28/16 05:55:45 UTC  Not Restarting  Error was unrecoverable
10/28/16 05:55:45 UTC  Driver Failure  failed to start task 'server' for alloc '31079a1f-4038-5903-80a0-161ffa17052d': Failed to create container from image quay.io/bownty/php: Post http://unix.sock/containers/create?name=server-31079a1f-
4038-5903-80a0-161ffa17052d: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
10/28/16 05:54:34 UTC  Received        Task received by client

Trying to add "docker.cleanup.image" = "false" to client options does not levitate the issue either...

Could there be a retry or a longer timeout on starting containers? maybe configurable on the client

As suggested initially, this kind of error should be retried on the client, or retried on a different box, as its a highly transient, non-permanent error condition

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant