Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad fails to restart docker task when client reconnects to server after connection loss #2184

Closed
drscre opened this issue Jan 11, 2017 · 2 comments

Comments

@drscre
Copy link

drscre commented Jan 11, 2017

Nomad v0.5.1

I have two machines:
nomad server (bootstrap = 1)
nomad client which runs docker tasks.

I was investigating why my tasks ended up dead after running for a while.

It boiled down to a connectivity issue.
When network connection between client and server fails, task group enters "lost" state.
When later client machine rediscovers nomad server, Nomad presumably tries to restart task group and fails with "Failed to create container: no such image"

docker log: (registry.lingualeo-funk.com/config-service:dev-301 is an image used by task)

Handler for POST /containers/dc1f44d9da8f37291b74c505467392668c9921d6fa73730fd88eaa7d2becb427/stop returned error: Container dc1f44d9da8f37291b74c505467392668c9921d6fa73730fd88eaa7d2becb427 is already stopped
Handler for GET /images/registry.lingualeo-funk.com/config-service:dev-301/json returned error: No such image: registry.lingualeo-funk.com/config-service:dev-301
Handler for POST /containers/cf40b4cc63b9bb3717ec9e0e31cb175e239bb3152bbe973fe2238c3f8d470239/stop returned error: Container cf40b4cc63b9bb3717ec9e0e31cb175e239bb3152bbe973fe2238c3f8d470239 is already stopped
Handler for POST /containers/create returned error: No such image: registry.lingualeo-funk.com/config-service:dev-301

Nomad status during connection failure:

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
config-service  1       0         0        0       0         1

Evaluations
ID        Priority  Triggered By  Status    Placement Failures
d2297b4c  50        node-update   blocked   N/A - In Progress
0a14de18  50        node-update   complete  true
5a916b1b  50        job-register  complete  false

Placement Failure
Task Group "config-service":
  * No nodes were eligible for evaluation
  * No nodes are available in datacenter "production"

Allocations
ID        Eval ID   Node ID   Task Group      Desired  Status  Created At
38189d21  5a916b1b  d203b13d  config-service  stop     lost    01/11/17 22:43:15 UTC

Nomad status after connection is back again:

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
config-service  0       0         0        1       1         0

Evaluations
ID        Priority  Triggered By  Status    Placement Failures
af5ace80  50        node-update   complete  false
5e83a333  50        node-update   complete  false
d2297b4c  50        node-update   complete  false
0a14de18  50        node-update   complete  true
5a916b1b  50        job-register  complete  false

Allocations
ID        Eval ID   Node ID   Task Group      Desired  Status    Created At
6649b0c8  d2297b4c  d203b13d  config-service  run      failed    01/11/17 22:48:18 UTC
38189d21  5a916b1b  d203b13d  config-service  stop     complete  01/11/17 22:43:15 UTC

PS. I don't quite get why Nomad client tries to restart task? Why not just leave it running?

@dadgar
Copy link
Contributor

dadgar commented Feb 14, 2017

Hey I am going to close this since a lot of the docker issues around docker images were fixed in 0.5.2+. As for why it gets restarted, the server detects the client is gone and tries to place it on a new machine. You just so happen to have a small enough cluster that it replaced on the same node.

@dadgar dadgar closed this as completed Feb 14, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants