Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job submitted with tcp docker endpoint fails after 60 seconds #1184

Closed
achattaway opened this issue May 18, 2016 · 5 comments
Closed

job submitted with tcp docker endpoint fails after 60 seconds #1184

achattaway opened this issue May 18, 2016 · 5 comments

Comments

@achattaway
Copy link

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Output from nomad version
Nomad v0.3.2 - also tried 4.0 dev

Operating system and Environment details

Centos 7 current (yum updated)

Issue

setting client docker.endpoint to a TCP connection will cause the container to fail and be restarted after 60 seconds

Reproduction steps

run a single node nomad server / client not in dev mode and with docker.endpoint set to "docker.endpoint" = "tcp://0.0.0.0:2375"
or
"docker.endpoint" = "tcp://127.0.0.1:2375"
run any simple job
wait 60 seconds

Nomad Server logs (if appropriate)

2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: allocs: (place 0) (update 1) (migrate 0) (stop 0) (ignore 0)
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: 1 in-place updates of 1
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (294.158µs)
2016/05/18 15:31:55 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (326.546µs)
2016/05/18 15:31:55 [DEBUG] worker: submitted plan for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:55 [DEBUG] sched: <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>: setting status to complete
2016/05/18 15:31:55 [DEBUG] client: updated allocations at index 84 (pulled 1) (filtered 0)
2016/05/18 15:31:55 [DEBUG] client: allocs: (added 0) (removed 0) (updated 1) (ignore 0)
2016/05/18 15:31:55 [DEBUG] worker: updated evaluation <Eval '88aa1256-d141-8eb5-869b-128136e3ec03' JobID: 'test'>
2016/05/18 15:31:55 [DEBUG] worker: ack for evaluation 88aa1256-d141-8eb5-869b-128136e3ec03
2016/05/18 15:31:56 [DEBUG] client: state changed, updating node.
2016/05/18 15:31:56 [DEBUG] client: node registration complete
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03 (146.105µs)
2016/05/18 15:31:56 [DEBUG] http: Request /v1/evaluation/88aa1256-d141-8eb5-869b-128136e3ec03/allocations (149.335µs)
2016/05/18 15:32:52 [ERR] driver.docker: failed to wait for 42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64; container already terminated
2016/05/18 15:32:52 [INFO] client: task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" failed: Wait returned exit code 0, signal 0, and error Post http://127.0.0.1:2375/containers/42f1211121489678df7578b91946064ad835c8040c64eb1945a1d954effbdf64/wait: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2016/05/18 15:32:52 [INFO] client: Restarting task "alpine" for alloc "209570af-5eeb-db1b-58cf-5181524c7d0a" in 17.084959638s
2016/05/18 15:32:52 [DEBUG] plugin: /tmp/nomad/nomad: plugin process exited
2016/05/18 15:32:52 [DEBUG] client: updated allocations at index 88 (pulled 0) (filtered 1)
2016/05/18 15:32:52 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 1)

Nomad Client logs (if appropriate)

Job file (if appropriate)

job "test" {
region = "global"
datacenters = ["dc1"]
type = "service"
priority = 50
group "alpine" {
count = 1

task "alpine" {
  driver = "docker"

  config {
    image = "alpine:latest"
    command = "/bin/sh"
    tty=true
    interactive=true
    }

  resources {
    network {
      port "ha" {}
      mbits = 10
      }
    }
  }
}

}

@lfarnell
Copy link
Contributor

I was investigating this as well but on the windows platform which now supports docker containers and range in size from hundreds of mb's to gb's and noticed in the code that the default timeout is hard coded at 1 minute. On a slow network or in a virtualized environment i can see this being problematic. It would be nice if in the task block we could define a timeout option like below

task "webservice" {
    driver = "docker"
    config = {
        image = "redis"
        labels = {
            group = "webservice-cache"
        }
       timeout = "2m"
    }
}

When doing this on a linux box, the images were much smaller and didn't seem to have any issues. The error message you received is different then what i recieved:
2016/05/18 10:22:33 [ERR] driver.docker: failed pulling container microsoft/iis:windowsservercore: net/http: request canceled (Client.Timeout exceeded while reading body)
2016/05/18 10:22:33 [ERR] client: failed to start task 'redis' for alloc 'fbe52450-f588-9369-7a24-9bd803553efd': failed to create image: Failed to pull microsoft/iis:windowsservercore: net/http: request canceled (Client.Timeout exceeded while reading body). Yours looks to be something with headers as mine is the body. Not sure as to why it would fail on headers.

@achattaway
Copy link
Author

achattaway commented May 18, 2016

This will also then fail at 2 minutes. I changed the hard coded value to 2m then 10m and it failed at each point. The actual problem is that a timeout is being set for a command that could perhaps never end. The wait command should only return when the container is terminated and that could essentially be never. It's the same thing as fork() wait()

docker remote api definition:
Wait a container
POST /containers/(id or name)/wait
Block until container id stops, then returns the exit code

@achattaway
Copy link
Author

achattaway commented May 18, 2016

I should add there is an easy workaround for this. Set the TCP endpoint to the unix socket and it's fine. ie. The unix socket must use a different mechanism without a timeout.
I couldn't see a reason not to do this in my environment it's just that everything else we use actually uses the TCP socket so it's an exception.

@achattaway
Copy link
Author

achattaway commented May 18, 2016

@lfarnell actually yours is a different but related issue. I also saw this when I was testing large images pulled over a slow network. I agree with your solution that there should be a timeout for this however I would set it at the image level.
ie

task "webservice" {
    driver = "docker"
    config = {
        image = {
          tag = "redis"
          pull_timeout = "2m"
          }
        labels = {
            group = "webservice-cache"
        }

    }
}

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants