job submitted with tcp docker endpoint fails after 60 seconds #1184

achattaway · 2016-05-18T19:34:43Z

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Output from nomad version
Nomad v0.3.2 - also tried 4.0 dev

Operating system and Environment details

Centos 7 current (yum updated)

Issue

setting client docker.endpoint to a TCP connection will cause the container to fail and be restarted after 60 seconds

Reproduction steps

run a single node nomad server / client not in dev mode and with docker.endpoint set to "docker.endpoint" = "tcp://0.0.0.0:2375"
or
"docker.endpoint" = "tcp://127.0.0.1:2375"
run any simple job
wait 60 seconds

Nomad Server logs (if appropriate)

2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: allocs: (place 0) (update 1) (migrate 0) (stop 0) (ignore 0)
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: 1 in-place updates of 1
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (294.158Âµs)
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (326.546Âµs)
2016/05/18 15:31:55 [DEBUG] worker: submitted plan for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: setting status to complete
2016/05/18 15:31:55 [DEBUG] client: updated allocations at index 84 (pulled 1) (filtered 0)
2016/05/18 15:31:55 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 0)
2016/05/18 15:31:55 [DEBUG] worker: updated evaluation <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>
2016/05/18 15:31:55 [DEBUG] worker: ack for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:56 [DEBUG] client: state changed, updating node.
2016/05/18 15:31:56 [DEBUG] client: node registration complete
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (146.105Âµs)
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (149.335Âµs)
2016/05/18 15:32:52 [ERR] driver.docker: failed to wait for 42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64; container already terminated
2016/05/18 15:32:52 [INFO] client: task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" failed: Wait returned exit code 0, signal 0, and error Post http://127.0.0.1:2375/containers/42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64/wait: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016/05/18 15:32:52 [INFO] client: Restarting task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" in 17.084959638s
2016/05/18 15:32:52 [DEBUG] plugin: /tmp/nomad/nomad: plugin process exited
2016/05/18 15:32:52 [DEBUG] client: updated allocations at index 88 (pulled 0) (filtered 1)
2016/05/18 15:32:52 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)

Nomad Client logs (if appropriate)

Job file (if appropriate)

job "test" {
region = "global"
datacenters = ["dc1"]
type = "service"
priority = 50
group "alpine" {
count = 1

task "alpine" {
  driver = "docker"

  config {
    image = "alpine:latest"
    command = "/bin/sh"
    tty=true
    interactive=true
    }

  resources {
    network {
      port "ha" {}
      mbits = 10
      }
    }
  }
}

}

The text was updated successfully, but these errors were encountered:

lfarnell · 2016-05-18T20:03:51Z

I was investigating this as well but on the windows platform which now supports docker containers and range in size from hundreds of mb's to gb's and noticed in the code that the default timeout is hard coded at 1 minute. On a slow network or in a virtualized environment i can see this being problematic. It would be nice if in the task block we could define a timeout option like below

task "webservice" {
    driver = "docker"
    config = {
        image = "redis"
        labels = {
            group = "webservice-cache"
        }
       timeout = "2m"
    }
}

When doing this on a linux box, the images were much smaller and didn't seem to have any issues. The error message you received is different then what i recieved:
2016/05/18 10:22:33 [ERR] driver.docker: failed pulling container microsoft/iis:windowsservercore: net/http: request canceled (Client.Timeout exceeded while reading body)
2016/05/18 10:22:33 [ERR] client: failed to start task 'redis' for alloc 'fbe52450-f588-9369-7a24-9bd803553efd': failed to create image: Failed to pull microsoft/iis:windowsservercore: net/http: request canceled (Client.Timeout exceeded while reading body). Yours looks to be something with headers as mine is the body. Not sure as to why it would fail on headers.

achattaway · 2016-05-18T20:17:18Z

This will also then fail at 2 minutes. I changed the hard coded value to 2m then 10m and it failed at each point. The actual problem is that a timeout is being set for a command that could perhaps never end. The wait command should only return when the container is terminated and that could essentially be never. It's the same thing as fork() wait()

docker remote api definition:
Wait a container
POST /containers/(id or name)/wait
Block until container id stops, then returns the exit code

achattaway · 2016-05-18T20:37:25Z

I should add there is an easy workaround for this. Set the TCP endpoint to the unix socket and it's fine. ie. The unix socket must use a different mechanism without a timeout.
I couldn't see a reason not to do this in my environment it's just that everything else we use actually uses the TCP socket so it's an exception.

achattaway · 2016-05-18T22:09:44Z

@lfarnell actually yours is a different but related issue. I also saw this when I was testing large images pulled over a slow network. I agree with your solution that there should be a timeout for this however I would set it at the image level.
ie

task "webservice" {
    driver = "docker"
    config = {
        image = {
          tag = "redis"
          pull_timeout = "2m"
          }
        labels = {
            group = "webservice-cache"
        }

    }
}

github-actions · 2022-12-22T02:14:54Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/driver/docker labels May 19, 2016

dadgar added this to the v0.4 milestone May 19, 2016

diptanu mentioned this issue Jun 11, 2016

Using a different client for collecting stats and waiting on containers #1257

Merged

diptanu closed this as completed in #1257 Jun 11, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job submitted with tcp docker endpoint fails after 60 seconds #1184

job submitted with tcp docker endpoint fails after 60 seconds #1184

achattaway commented May 18, 2016

lfarnell commented May 18, 2016

achattaway commented May 18, 2016 •

edited

Loading

achattaway commented May 18, 2016 •

edited

Loading

achattaway commented May 18, 2016 •

edited

Loading

github-actions bot commented Dec 22, 2022

job submitted with tcp docker endpoint fails after 60 seconds #1184

job submitted with tcp docker endpoint fails after 60 seconds #1184

Comments

achattaway commented May 18, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

lfarnell commented May 18, 2016

achattaway commented May 18, 2016 • edited Loading

achattaway commented May 18, 2016 • edited Loading

achattaway commented May 18, 2016 • edited Loading

github-actions bot commented Dec 22, 2022

achattaway commented May 18, 2016 •

edited

Loading

achattaway commented May 18, 2016 •

edited

Loading

achattaway commented May 18, 2016 •

edited

Loading