Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Image Pull fails #4934

Closed
n-junge opened this issue Nov 29, 2018 · 14 comments
Closed

Docker Image Pull fails #4934

n-junge opened this issue Nov 29, 2018 · 14 comments

Comments

@n-junge
Copy link

n-junge commented Nov 29, 2018

Nomad version

Nomad v0.8.6 ('ab54ebcfcde062e9482558b7c052702d4cb8aa1b'+CHANGES)

Operating system and Environment details

Windows 10 Enterprise 64-bit

Issue

Nomad Task fails trying to pull Docker image. Some images can be pulled, others cannot.

Example with 2 restarts:

Recent Events: 
Time                       Type            Description
2018-11-29T13:30:27+01:00  Not Restarting  Exceeded allowed attempts 2 in interval 24h0m0s and mode is "fail"
2018-11-29T13:30:27+01:00  Driver Failure  failed to initialize task "ping-task" for alloc "3c81b66a-2cb2-9fc9-56ea-44dec51f0791": Failed to pull `microsoft/windowsservercore:ltsc2016`: context canceled
2018-11-29T13:30:17+01:00  Driver          Docker image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (0 B/0 B) layers: 0 waiting/0 pulling
2018-11-29T13:28:17+01:00  Driver          Downloading image microsoft/windowsservercore:ltsc2016
2018-11-29T13:28:02+01:00  Restarting      Task restarting in 15.754195448s
2018-11-29T13:28:02+01:00  Driver Failure  failed to initialize task "ping-task" for alloc "3c81b66a-2cb2-9fc9-56ea-44dec51f0791": Failed to pull `microsoft/windowsservercore:ltsc2016`: context canceled
2018-11-29T13:27:52+01:00  Driver          Docker image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (0 B/0 B) layers: 0 waiting/0 pulling
2018-11-29T13:25:52+01:00  Driver          Downloading image microsoft/windowsservercore:ltsc2016
2018-11-29T13:25:33+01:00  Restarting      Task restarting in 18.519750579s
2018-11-29T13:25:33+01:00  Driver Failure  failed to initialize task "ping-task" for alloc "3c81b66a-2cb2-9fc9-56ea-44dec51f0791": Failed to pull `microsoft/windowsservercore:ltsc2016`: context canceled
2018-11-29T13:24:03+01:00  Driver          Docker image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (1.457 GiB/1.457 GiB) layers: 0 waiting/0 pulling
2018-11-29T13:21:53+01:00  Driver          Docker image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (1.457 GiB/1.457 GiB) layers: 0 waiting/0 pulling
2018-11-29T13:19:53+01:00  Driver          Docker image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (1.457 GiB/1.457 GiB) layers: 0 waiting/0 pulling
2018-11-29T13:17:53+01:00  Driver          Downloading image microsoft/windowsservercore:ltsc2016
2018-11-29T13:17:53+01:00  Task Setup      Building Task Directory
2018-11-29T13:17:53+01:00  Received        Task received by client

There is no problem pulling the images manually.

@preetapan
Copy link
Contributor

@n-junge any more diagnostic info you can share? that looks like it took 6 minutes and still timed out but the log messages are a bit misleading (pulled 2/2)

Pls share your job spec file so we can try to reproduce this.

@n-junge
Copy link
Author

n-junge commented Nov 29, 2018

I noticed that to, it always shows that all layers are pulled. I think Nomad times out while Docker is extracting the layers, but this shouldn't happen.

The job file is very simple.

job ping {
  datacenters = ["dc1"]
  type = "batch"
  group "ping-group" {
    count = 1
    restart {
      attempts = 2
      delay    = "15s"
      mode     = "fail"
    }
    task "ping-task" {
      driver = "docker"
      config {
        image = "microsoft/windowsservercore:ltsc2016"
	command = "ping"
	args = ["localhost", "-n", "25"]
      }
    }
  }
}

This is the server/client log (extracts):

2018/11/29 15:46:48.647151 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 1/2 (67.75 MiB/1.457 GiB) layers: 0 waiting/1 pulling - est 208.9s remaining
2018/11/29 15:46:58.644334 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 1/2 (411 MiB/1.457 GiB) layers: 0 waiting/1 pulling - est 51.9s remaining
2018/11/29 15:47:08.642179 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 1/2 (712.3 MiB/1.457 GiB) layers: 0 waiting/1 pulling - est 32.6s remaining
2018/11/29 15:47:18.640426 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 1/2 (1.133 GiB/1.457 GiB) layers: 0 waiting/1 pulling - est 10.0s remaining
2018/11/29 15:47:25.396971 [DEBUG] driver.docker: using client connection initialized from environment
2018/11/29 15:47:26.398966 [DEBUG] client: state changed, updating node and re-registering.
2018/11/29 15:47:26.399959 [INFO] client: node registration complete
2018/11/29 15:47:28.638697 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 1/2 (1.346 GiB/1.457 GiB) layers: 0 waiting/1 pulling - est 3.2s remaining
2018/11/29 15:47:38.637383 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (1.457 GiB/1.457 GiB) layers: 0 waiting/0 pulling
2018/11/29 15:47:48.636326 [DEBUG] driver.docker: image microsoft/windowsservercore:ltsc2016 pull progress: Pulled 2/2 (1.457 GiB/1.457 GiB) layers: 0 waiting/0 pulling

and then it prints the last line until it fails with the message from the alloc status above

@n-junge
Copy link
Author

n-junge commented Dec 14, 2018

?

@cpumanaz
Copy link

cpumanaz commented Feb 3, 2019

I have experienced this as well. The same status pulled N/N 0 waiting/0 pulling but it stays like that until times out.

@Miserlou
Copy link

Miserlou commented Feb 27, 2019

Seeing a lot of this in 0.9.0-beta3 (Linux)

2019-02-27T17:34:40Z  Driver Failure  Failed to pull `myorg/myimage:v1.9.4-dev`: context canceled
2019-02-27T17:34:10Z  Driver          Docker image pull progress: Pulled 28/28 (9.108 GiB/9.108 GiB) layers: 0 waiting/0 pulling
2019-02-27T17:32:00Z  Driver          Docker image pull progress: Pulled 27/28 (6.908 GiB/9.108 GiB) layers: 0 waiting/1 pulling - est 36.5s remaining
2019-02-27T17:30:00Z  Driver          Downloading image

@cpumanaz
Copy link

cpumanaz commented Feb 27, 2019 via email

@cgbaker
Copy link
Contributor

cgbaker commented Mar 5, 2019

In 0.8.x and 0.9.0, if Nomad doesn't receive any updates from the Docker daemon for over two minutes, it will abort and reschedule the task. In general, this isn't a problem, because the Docker daemon is somewhat chatty. I tried to verify this with some very large Docker images on a machine with very low bandwidth, and things worked as expected (in my case, after about 20 minutes, the image finished downloading and the task ran). If any of you could provide the following, it would be helpful for reproducing this and fixing it:

  • Docker version on client node
  • Client logs
  • Job spec (if possible)
  • Docker image, or some general statistics (size, number of layers) if they seem relevant

I appreciate your help.

@Miserlou
Copy link

Miserlou commented Mar 5, 2019

Is that two minutes configurable?

@Miserlou
Copy link

Miserlou commented Mar 5, 2019

Also does that reschedule occur even if the jobspec forbids retrying?

@cgbaker
Copy link
Contributor

cgbaker commented Mar 5, 2019

The two minutes is not configurable right now... it certainly could/should be made configurable, on a global level if not at the job-level. This error (timeout during pull) is marked as recoverable error and is subject to the configured restart and reschedule policy.

@stale
Copy link

stale bot commented May 10, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@kaspergrubbe
Copy link

@cgbaker Do you know if this is configurable now?

@holtwilkins
Copy link
Contributor

I just came across this, hopefully this fixes the issue: https://www.nomadproject.io/docs/drivers/docker#pull_activity_timeout

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants