Nomad client hangs indefinitely on startup #1202

ghost · 2016-05-24T11:22:31Z

Nomad version

Nomad v.0.3.2

Issue

The Nomad client hangs on Starting Nomad agent... at startup. It never makes progress.

● nomad.service - nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled)
   Active: active (running) since Tue 2016-05-24 11:13:51 UTC; 1min 50s ago
 Main PID: 7206 (nomad)
   CGroup: /system.slice/nomad.service
           └─7206 /usr/local/bin/nomad agent -config /etc/nomad.d

May 24 11:13:51 backend-eu-west-1 systemd[1]: Started nomad.
May 24 11:13:51 backend-eu-west-1 nomad[7206]: Loaded configuration from /etc/nomad.d/config.json
May 24 11:13:51 backend-eu-west-1 nomad[7206]: ==> Starting Nomad agent...

When started via an interactive terminal:

/usr/local/bin/nomad agent -config /etc/nomad.d -log-level=DEBUG
    Loaded configuration from /etc/nomad.d/config.json
    ==> Starting Nomad agent...

Nomad configuration file:

{
  "advertise": {
    "http": "10.133.0.5:4646",
    "rpc": "10.133.0.5:4647",
    "serf": "10.133.0.5:4648"
  },
  "bind_addr": "0.0.0.0",
  "client": {
    "enabled": true,
    "servers": [
      "consul-master-eu-west-1",
      "consul-master-eu-west-2",
      "consul-master-eu-west-3"
    ]
  },
  "data_dir": "/var/nomad",
  "datacenter": "europe-west1",
  "region": "europe"
}

The text was updated successfully, but these errors were encountered:

camerondavison · 2016-05-24T14:25:32Z

I have seen this a couple of times. I turned on DEBUG logging, and still nothing. The only way I could get it to start up was to start killing existing docker containers that were running.

dadgar · 2016-05-24T16:33:53Z

Can you give background on to what the client was running, how it was restarted? If you can give reproduction steps that would be awesome!

ghost · 2016-05-24T17:00:21Z

@dadgar so I can reproduce this error under the same conditions reported by @a86c6f7964. If you restart the Nomad client whilst you have running containers and those containers continue to run after Nomad has exited then the Nomad client is unable to start back up.

diptanu · 2016-06-12T14:35:30Z

@grobinson-blockchain @a86c6f7964 So I started an agent in the server mode and another in the client mode, ran a redis container(the one that nomad init generates) and killed the client. At that point the nomad executor and container were still running, then I restarted the client, which connected back with the executor and I could even stop the container.

Can I get some more help in reproducing this?

camerondavison · 2016-06-12T14:45:03Z

I can try again. But I think that it has to do with having many containers
running.

On Sun, Jun 12, 2016, 9:35 AM Diptanu Choudhury notifications@github.com
wrote:

@grobinson-blockchain https://github.com/grobinson-blockchain
@a86c6f7964 https://github.com/a86c6f7964 So I started an agent in the
server mode and another in the client mode, ran a redis container(the one
that nomad init generates) and killed the client. At that point the nomad
executor and container were still running, then I restarted the client,
which connected back with the executor and I could even stop the container.

Can I get some more help in reproducing this?

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1202 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBVWf6-Ab-T1JhbiaQbg02l6H7pNAks5qLBk5gaJpZM4IlYOS
.

wuub · 2016-06-15T18:07:52Z

I can try again. But I think that it has to do with having many containers running.

Not necessarily. I've got a very simple cluster (3 server nodes, 1 client node). With a single system job and I'm experiencing identical behaviour.

Nomad Agent

Nomad v0.4.0-dev ('bc09a0444722617a3a0ee0daa28d24b93d9d3e5b+CHANGES')

Nomad Config

region = "us"
datacenter = "us-central1"

bind_addr = "0.0.0.0"
data_dir = "/var/nomad"

advertise {
    http = "10.xxx.0.39:4646"
    rpc = "10.xxx.0.39:4647"
}

client {
    enabled = true
    network_speed = 10000  # AFAICT GCE has 10GBe network
    servers = ["consul-1-us-central1", "consul-2-us-central1", "consul-3-us-central1"]
    options {
        "docker.auth.config" = "/root/.docker/config.json"
        "driver.raw_exec.enable" = "1"
    }
}

Job Spec

job "internal-fluentd" {
    region = "us"
    datacenters = ["us-central1"]

    type = "system"

    group "fluentd" {
        task "fluentd" {
            driver = "docker"

            config {
                image = "gcr.io/.../fluentd:emiter"
                port_map {
                    forward = 24224
                    http = 9880
                    syslog = 5140
                }
            }

            service {
                name = "${TASKGROUP}"
                tags = ["emiter"]
                port = "http"
                check {
                    name = "alive"
                    type = "tcp"
                    interval = "10s"
                    timeout = "2s"
                }
            }

            resources {
                cpu = 500 
                memory = 128 
                network {
                    mbits = 10
                    port "http" {
                    }
                    port "forward" {    
                    }
                }
            }
        }
    }
}

Initially everything works as expected, Nomad starts and launched system job:

Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using state directory /var/nomad/client
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:42:36 apps-1 nomad[1432]:     2016/06/15 17:42:36 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:43 apps-1 nomad[1432]:     2016/06/15 17:42:43 [INFO] driver.docker: stopped container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:49:53 apps-1 systemd[1]: Stopping Nomad Agent...
Jun 15 17:49:53 apps-1 nomad[1432]: ==> Caught signal: interrupt
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: requesting shutdown
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client.rpcproxy: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: RPC failed to server 10.xxx.0.35:4647: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: shutdown complete
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 15 17:49:53 apps-1 systemd[1]: Stopped Nomad Agent.

As seen above, I stopped it w/ SIGINT.

System job docker container is present after nomad has exited:

root@apps-1:/home/wuub# docker ps
CONTAINER ID        IMAGE                                           COMMAND                CREATED             STATUS              PORTS                                                                                                                            NAMES
7207b7c63d28        gcr.io/internal/internal-fluentd:emiter   "/bin/sh -c fluentd"   7 minutes ago       Up 7 minutes        5140/tcp, 10.xxx.0.39:21284->9880/tcp, 10.xxx.0.39:21284->9880/udp, 10.xxx.0.39:54833->24224/tcp, 10.xxx.0.39:54833->24224/udp   fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489

When I issue systemctl nomad start things don't go according to plan

-- Logs begin at Wed 2016-06-15 17:42:16 UTC, end at Wed 2016-06-15 17:50:27 UTC. --
Jun 15 17:42:21 apps-1 systemd[1]: Started Nomad Agent.
Jun 15 17:42:21 apps-1 nomad[1432]:     Loaded configuration from /etc/nomad.d/nomad.hcl
Jun 15 17:42:21 apps-1 nomad[1432]: ==> Starting Nomad agent...
Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent configuration:
Jun 15 17:42:23 apps-1 nomad[1432]:                  Atlas: <disabled>
Jun 15 17:42:23 apps-1 nomad[1432]:                 Client: true
Jun 15 17:42:23 apps-1 nomad[1432]:              Log Level: INFO
Jun 15 17:42:23 apps-1 nomad[1432]:                 Region: us (DC: us-central1)
Jun 15 17:42:23 apps-1 nomad[1432]:                 Server: false
Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using state directory /var/nomad/client
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:42:36 apps-1 nomad[1432]:     2016/06/15 17:42:36 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:43 apps-1 nomad[1432]:     2016/06/15 17:42:43 [INFO] driver.docker: stopped container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:49:53 apps-1 systemd[1]: Stopping Nomad Agent...
Jun 15 17:49:53 apps-1 nomad[1432]: ==> Caught signal: interrupt
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: requesting shutdown
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client.rpcproxy: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: RPC failed to server 10.xxx.0.35:4647: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: shutdown complete
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 15 17:49:53 apps-1 systemd[1]: Stopped Nomad Agent.
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Unit entered failed state.
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Jun 15 17:50:27 apps-1 systemd[1]: Started Nomad Agent.
Jun 15 17:50:27 apps-1 nomad[3491]:     Loaded configuration from /etc/nomad.d/nomad.hcl
Jun 15 17:50:27 apps-1 nomad[3491]: ==> Starting Nomad agent...

<<<<LOCK>>>>
Nomad just sits here. No ports opened, nothing. 
But If i stop the remaining docker container: 
$ docker stop fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489

<<<<LOCKDOWN IMMEDIATLY ENDS>>>> 

Jun 15 17:51:52 apps-1 nomad[3491]: ==> Nomad agent configuration:
Jun 15 17:51:52 apps-1 nomad[3491]:                  Atlas: <disabled>
Jun 15 17:51:52 apps-1 nomad[3491]:                 Client: true
Jun 15 17:51:52 apps-1 nomad[3491]:              Log Level: INFO
Jun 15 17:51:52 apps-1 nomad[3491]:                 Region: us (DC: us-central1)
Jun 15 17:51:52 apps-1 nomad[3491]:                 Server: false
Jun 15 17:51:52 apps-1 nomad[3491]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] client: using state directory /var/nomad/client
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] driver.docker: re-attaching to docker process: 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] driver.docker: couldn't re-attach to the plugin process: error creating rpc client for executor plugin: Reattachment process not found
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [ERR] client: failed to open handle to task 'fluentd' for alloc '4fe5acc1-fcfa-d7d8-a25b-8459b42d8489': error creating rpc client for executor plugin: Reattachment process not fo
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: a container with the name fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489 already exists; will attempt to purge and re-create
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: purged container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: created container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: started container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:51:58 apps-1 nomad[3491]:     2016/06/15 17:51:58 [INFO] driver.docker: stopped container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [ERR] driver.docker: failed to create container from image gcr.io/internal/internal-fluentd:emiter: no such image
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [ERR] client: failed to start task 'fluentd' for alloc '5f41d451-11e5-764c-5978-cde0b7b99567': Failed to create container from image gcr.io/internal/internal-fluentd:emiter
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [INFO] client: Not restarting task: fluentd for alloc: 5f41d451-11e5-764c-5978-cde0b7b99567

But the system job enters dead state and never recovers:

root@apps-1:/home/wuub# nomad status
ID                Type    Priority  Status
internal-fluentd  system  50        dead

dadgar · 2016-06-16T18:18:19Z

Hey @grobinson-blockchain @a86c6f7964 @wuub,

So I tried everything you guys have said and still couldn't reproduce this. Restarted the client like 30 times. If one of you can reliably get into this state, can you issue a SIGQUIT (kill -3 <pid>) and give me the stack dump + what version of Nomad you are on (preferably the 0.4 - rc).

Thanks,
Alex

wuub · 2016-06-17T12:37:47Z

@dadgar I can repro this every time, so if you need me to try something else just let me know.

SIGQUIT goroutines dump is here, same nomad version as before (v0.4.0-dev ('bc09a0444722617a3a0ee0daa28d24b93d9d3e5b+CHANGES')

https://gist.github.com/wuub/9f1671d70ffb041dec8f2d9c77404437

System information:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04 LTS
Release:    16.04
Codename:   xenial

$ docker info 
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.4.0-24-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.613 GiB
Name: apps-1
ID: [...]
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

$ sudo apt show docker-engine
Package: docker-engine
Version: 1.11.2-0~xenial
Priority: optional
Section: admin
Maintainer: Docker <support@docker.com>
Installed-Size: 73.4 MB
Depends: iptables, init-system-helpers (>= 1.18~), lsb-base (>= 4.1+Debian11ubuntu7), libapparmor1 (>= 2.6~devel), libc6 (>= 2.17), libdevmapper1.02.1 (>= 2:1.02.97), libltdl7 (>= 2.4.6), libseccomp2 (>= 2.1.0), libsystemd0
Recommends: aufs-tools, ca-certificates, cgroupfs-mount | cgroup-lite, git, xz-utils, apparmor
Conflicts: docker (<< 1.5~), docker-engine-cs, docker.io, lxc-docker, lxc-docker-virtual-package
Homepage: https://dockerproject.org
Download-Size: 14.5 MB
APT-Manual-Installed: yes
APT-Sources: https://apt.dockerproject.org/repo ubuntu-xenial/main amd64 Packages

Instance is created with following terraform :

variable "apps-servers" {
  default = 1
}


resource "google_compute_disk" "apps" {
  count = "${var.apps-servers}"
  name  = "apps-disk-${count.index + 1}"
  type  = "pd-ssd"
  zone  = "${lookup(var.instance_zones, count.index)}"
  image = "ubuntu-1604-xenial-v20160516a"
  size  = 10
}

resource "google_compute_instance" "apps" {
  count        = "${var.apps-servers}"
  name         = "apps-${count.index + 1}"
  machine_type = "n1-standard-1"
  zone         = "${lookup(var.instance_zones, count.index)}"
  tags         = ["apps"]

  disk {
    disk        = "${element(google_compute_disk.apps.*.name, count.index)}"
    auto_delete = false
  }

  network_interface {
    network       = "default"
    access_config = {
    }
  }

  can_ip_forward = true

  service_account {
    scopes = [
      "logging-write",
      "https://www.googleapis.com/auth/monitoring.write",
    ]
  }
}

dadgar · 2016-06-17T18:48:31Z

@wuub Thanks! That stack trace had what we needed. Looks like it is hung calling docker stop. Will get the timeout to work properly. Though don't know why it is hung

dadgar · 2016-06-17T18:51:23Z

@wuub So we make the following request to docker: POST /containers/<container-id>/stop?t=10. Could you do the same and see what happens when Nomad is in the hung state. Could potentially expose a docker engine bug

wuub · 2016-06-17T21:53:45Z

I reproduced nomad's hanged state, waited a while, then issued stop command. It took several seconds (expected, since fluentd does graceful shutdown that takes a while to complete), but ultimately docker retuned "204 No Content" and removed the container.
After that Nomad resumed boot almost immediately and correctly recreated fluentd container.

docker ps
CONTAINER ID        IMAGE                                                      COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                  NAMES
51ba1f9dd0e3        gcr.io/internal/internal-fluentd:emiter              "/bin/sh -c fluentd"     5 minutes ago       Up 5 minutes        5140/tcp, 24220/tcp, 24231/tcp, 10.240.0.39:38270->9880/tcp, 10.240.0.39:38270->9880/udp, 10.240.0.39:20685->24224/tcp, 10.240.0.39:20685->24224/udp   fluentd-e64646d8-a538-7c3f-5c6e-7daddeacebb2

# curl --request POST -v --unix-socket /var/run/docker.sock  http:/containers/51ba1f9dd0e3af05fb4f8ca44d3e4261498ef5616d0b6d2ce23e3c31190fb340/stop?t=10
*   Trying /var/run/docker.sock...
* Connected to http (/var/run/docker.sock) port 80 (#0)
> POST /containers/51ba1f9dd0e3af05fb4f8ca44d3e4261498ef5616d0b6d2ce23e3c31190fb340/stop?t=10 HTTP/1.1
> Host: http
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 204 No Content
< Server: Docker/1.11.2 (linux)
< Date: Fri, 17 Jun 2016 21:44:23 GMT
< 
* Connection #0 to host http left intact

wuub · 2016-06-28T11:43:52Z

@dadgar I think I found the reason, after stracing nomad binary a few times I noticed something like this:

[pid  2537] write(5, "POST /containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 HTTP/1.1\r\nHost: unix.sock\r\nUser-Agent: go-dockerclient\r\nContent-Length: 0\r\nContent-Type: pl"..., 254) = 254

When I replayed it using curl, I was able to replicate inifnite lockup:

> root@apps-1:/home/wuub# curl --request POST -v --unix-socket /var/run/docker.sock  http:/containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 
*   Trying /var/run/docker.sock...
* Connected to http (/var/run/docker.sock) port 80 (#0)
> POST /containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 HTTP/1.1
> Host: http
> User-Agent: curl/7.47.0
> Accept: */*
>

After retrying out-of-band with curl & t=10 nomad job that was stuck in the background unlocked and booted quasi successfully.

I'll do some spelunking to find out why it wants to wait so long and let you know.

wuub · 2016-06-28T12:12:06Z

@dadgar
Pretty sure it's this line:
client.StopContainer(pid.ContainerID, uint(pid.KillTimeout*time.Second))

nomad/client/driver/docker.go

Line 881 in 964e133

    
           if e := client.StopContainer(pid.ContainerID, uint(pid.KillTimeout*time.Second)); e != nil {

the author probably had something like this in mind:
h.client.StopContainer(h.containerID, uint(h.killTimeout.Seconds()))

nomad/client/driver/docker.go

Line 954 in 964e133

err := h.client.StopContainer(h.containerID, uint(h.killTimeout.Seconds()))

I'll test it in a minute and send PR if fixes my problem

Unit mismatch caused docker driver to wait almost indefinitely during boot (when one or more containers were a bit uncooperative during StopContainer()) This should fix problems described in hashicorp#1202

dadgar · 2016-06-28T16:05:15Z

@wuub if the problem is fixed we should close this. Let me know

wuub · 2016-06-29T09:25:44Z

@dadgar If I understand correctly this fix made it to 0.4.0?
@grobinson-blockchain it would be great if you could give it a try and confirm that it fixes your issues as well

dadgar · 2016-06-29T16:22:25Z

Yes it made it into 0.4 was waiting for confirmation since I have never been able to reproduce

wuub · 2016-07-01T11:46:56Z

After more testing I can confirm that this is now solved for us :) Thanks

tarasvaskiv · 2019-05-19T19:51:15Z

I fix this by restarting docker, sudo systemctl restart docker.service
this issue was fixed in #5568

github-actions · 2022-11-23T02:22:35Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added type/bug theme/client labels May 24, 2016

wuub mentioned this issue Jun 28, 2016

Fix docker driver lockup during nomad boot #1363

Merged

ghost closed this as completed Jul 1, 2016

github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad client hangs indefinitely on startup #1202

Nomad client hangs indefinitely on startup #1202

ghost commented May 24, 2016

camerondavison commented May 24, 2016

dadgar commented May 24, 2016

ghost commented May 24, 2016

diptanu commented Jun 12, 2016

camerondavison commented Jun 12, 2016

wuub commented Jun 15, 2016

dadgar commented Jun 16, 2016

wuub commented Jun 17, 2016 •

edited

Loading

dadgar commented Jun 17, 2016

dadgar commented Jun 17, 2016 •

edited

Loading

wuub commented Jun 17, 2016

wuub commented Jun 28, 2016

wuub commented Jun 28, 2016

dadgar commented Jun 28, 2016 •

edited

Loading

wuub commented Jun 29, 2016

dadgar commented Jun 29, 2016

wuub commented Jul 1, 2016

tarasvaskiv commented May 19, 2019

github-actions bot commented Nov 23, 2022

Nomad client hangs indefinitely on startup #1202

Nomad client hangs indefinitely on startup #1202

Comments

ghost commented May 24, 2016

Nomad version

Issue

camerondavison commented May 24, 2016

dadgar commented May 24, 2016

ghost commented May 24, 2016

diptanu commented Jun 12, 2016

camerondavison commented Jun 12, 2016

wuub commented Jun 15, 2016

Nomad Agent

Nomad Config

Job Spec

dadgar commented Jun 16, 2016

wuub commented Jun 17, 2016 • edited Loading

dadgar commented Jun 17, 2016

dadgar commented Jun 17, 2016 • edited Loading

wuub commented Jun 17, 2016

wuub commented Jun 28, 2016

wuub commented Jun 28, 2016

dadgar commented Jun 28, 2016 • edited Loading

wuub commented Jun 29, 2016

dadgar commented Jun 29, 2016

wuub commented Jul 1, 2016

tarasvaskiv commented May 19, 2019

github-actions bot commented Nov 23, 2022

wuub commented Jun 17, 2016 •

edited

Loading

dadgar commented Jun 17, 2016 •

edited

Loading

dadgar commented Jun 28, 2016 •

edited

Loading