Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker driver stalls downloading images with poor DNS servers #157

Closed
sarahhodne opened this issue Sep 29, 2015 · 4 comments
Closed

docker driver stalls downloading images with poor DNS servers #157

sarahhodne opened this issue Sep 29, 2015 · 4 comments

Comments

@sarahhodne
Copy link

I was going through the tutorial at home, and couldn't get any of the Docker containers to start up at all. Every time I did nomad run, it would just get stuck in a "pending" state, even if I waited for 30 minutes:

vagrant@nomad:/vagrant$ nomad status example
ID          = example
Name        = example
Type        = service
Priority    = 50
Datacenters = dc1
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
7c8d14cd-f54d-678b-dac0-faafce248b9a  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
bce93254-48a6-49ac-9de2-0283077b1556  7c8d14cd-f54d-678b-dac0-faafce248b9a  73eeaf7a-bf6b-800f-b567-7377697c075c  cache      run      pending
d380e273-296c-86de-5e5d-12bbccc1ba86  7c8d14cd-f54d-678b-dac0-faafce248b9a  73eeaf7a-bf6b-800f-b567-7377697c075c  cache      run      pending
ffaf4920-162d-47d3-1041-8ccd8b460bcb  7c8d14cd-f54d-678b-dac0-faafce248b9a  9c36d562-60d9-04ff-112a-7be192faba3e  cache      run      pending

I ran sudo docker images to see if the images had been downloaded at all, since I suspected that's what it got stuck on, and saw this:

vagrant@nomad:/vagrant$ sudo docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
<none>              <none>              d688671efba6        2 weeks ago         109 MB

So it looks like somewhere in the docker pull something stopped working, and it never got to the tagging state. In my case it seems to have been a DNS server that wasn't responding a lot of the time, causing a lot of DNS resolution timeouts. This causing docker pull to fail is maybe a Docker bug, but it would be nice if Nomad could somehow catch this happening so the allocation won't be pending forever.

@cbednarski
Copy link
Contributor

Thanks for reporting this! I would like to improve this but this might be a bit tricky for us to fix for two reasons:

  1. The DNS lookup failure is happening in the docker daemon, not the nomad process.
  2. The failure happens on the nomad agent node and there is not a great way to inform the client without passing logs back. See HTTP endpoint to tail job logs, stdout, and stderr  #277

I think we can improve this via #277 by adding simple remote logging to the CLI but solving this a way to provide more immediate feedback will be difficult without some additional plumbing on our side.

@dadgar
Copy link
Contributor

dadgar commented Mar 22, 2016

Nomad retries docker pulls if they fail. If docker pull doesn't fail but is misbehaving it is a docker issue.

@dadgar dadgar closed this as completed Mar 22, 2016
@kurtwheeler
Copy link

I think I just ran into this bug. I have 154 dispatch jobs all on the same node that say they're running, but their allocations are stuck in a pending state:

$ nomad status SALMON_0_12288/dispatch-1541535060-3b915938
ID            = SALMON_0_12288/dispatch-1541535060-3b915938
Name          = SALMON_0_12288/dispatch-1541535060-3b915938
Submit Date   = 2018-11-06T20:11:00Z
Type          = batch
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
jobs        0       1         0        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
6af316bb  0ccc140a  jobs        0        run      pending  40m9s ago  39m27s ago

And it's because they're waiting on downloading the docker image:

Recent Events:
Time                  Type        Description
2018-11-06T20:11:42Z  Driver      Downloading image ccdlstaging/dr_salmon:v1.1.6-dev
2018-11-06T20:11:42Z  Task Setup  Building Task Directory
2018-11-06T20:11:42Z  Received    Task received by client

I checked to see if that image had been downloaded:

$ docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
ccdlstaging/dr_foreman         v1.1.6-dev          e4e43ffe14ac        48 minutes ago      752 MB
ccdlstaging/dr_downloaders     v1.1.6-dev          937d64f2500c        48 minutes ago      1.03 GB
ccdlstaging/dr_smasher         v1.1.6-dev          7a27e6bf7ce3        51 minutes ago      1.58 GB
ccdlstaging/dr_downloaders     v1.1.5-dev          cce20920e2f5        28 hours ago        1.03 GB
docker/docker-bench-security   latest              82523a3d637f        9 months ago        31.7 MB
busybox                        latest              5b0d59026729        9 months ago        1.15 MB

and it hadn't so I tried to manually pull it onto the node:

$ docker pull ccdlstaging/dr_salmon:v1.1.6-dev
v1.1.6-dev: Pulling from ccdlstaging/dr_salmon
18d680d61657: Already exists 
0addb6fece63: Already exists 
78e58219b215: Already exists 
eb6959a66df2: Already exists 
681fa79ced90: Already exists 
6da690b18cb2: Already exists 
f4f1d90dc102: Already exists 
5f86024f6c88: Already exists 
845bdb6b5bff: Already exists 
c0c840a944df: Already exists 
92cf3aa48705: Already exists 
2a0a7cbf8772: Already exists 
f95dc3322e9e: Already exists 
b629317ebcab: Already exists 
7507392ece52: Already exists 
856d142a4e00: Already exists 
d15e592d4d9e: Already exists 
cbaaf994edca: Already exists 
9f5edf172ca7: Already exists 
0c0a2c08563c: Already exists 
2e1fdf5d8f8a: Already exists 
cfbb782df8af: Already exists 
d4e167c88c1a: Already exists 
386bc4f51a5a: Already exists 
768202b3f30c: Already exists 
a61c0af1bc41: Already exists 
2a9faa4f47e8: Already exists 
1255108907e4: Already exists 
46337c104093: Already exists 
cddfc4966485: Already exists 
cc31134a0d70: Already exists 
67da5c9ed657: Already exists 
e06ca47d6538: Already exists 
e36075509bb4: Already exists 
c99e31d8564e: Already exists 
94ade1886bb2: Already exists 
977f6036fc08: Already exists 
6b004961624e: Already exists 
Digest: sha256:02087773a5b7399bbcca199852070dbbb72e7453bd72bb3b7083eec0598d6f56

And all of the layers were downloaded, it just hadn't gotten tagged. However after I pulled it the image showed as being on the node:

$ docker images
REPOSITORY                     TAG                 IMAGE ID            CREATED             SIZE
ccdlstaging/dr_foreman         v1.1.6-dev          e4e43ffe14ac        50 minutes ago      752 MB
ccdlstaging/dr_downloaders     v1.1.6-dev          937d64f2500c        50 minutes ago      1.03 GB
ccdlstaging/dr_salmon          v1.1.6-dev          dbfe27673c1c        50 minutes ago      2.39 GB
ccdlstaging/dr_smasher         v1.1.6-dev          7a27e6bf7ce3        53 minutes ago      1.58 GB
ccdlstaging/dr_downloaders     v1.1.5-dev          cce20920e2f5        28 hours ago        1.03 GB
docker/docker-bench-security   latest              82523a3d637f        9 months ago        31.7 MB
busybox                        latest              5b0d59026729        9 months ago        1.15 MB

However all of the jobs that are waiting on that image are still waiting, for about 45 minutes now.

I saw @dadgar's comment:

Nomad retries docker pulls if they fail. If docker pull doesn't fail but is misbehaving it is a docker issue.

so I know I have a docker issue. The question is how do I make Nomad work? It's still just stuck with 154 jobs waiting on a Docker image that is now downloaded. Also, given that this Docker issue is now over two years old, is there any workaround this within Nomad? Is there a parameter that I could set that would have the effect of: if downloading a Docker image takes more than 30 minutes consider it a failure?

I guess my move here is to manually kill all those jobs and let them get rescheduled, but having to resolve this issue manually is rather painful and won't be a great solution when it happens at 2 am in the morning and isn't discovered for hours.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants