Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker containers don't exit until after 5 minute timeout #2119

Closed
pgporada opened this issue Dec 15, 2016 · 13 comments · Fixed by #2177
Closed

Docker containers don't exit until after 5 minute timeout #2119

pgporada opened this issue Dec 15, 2016 · 13 comments · Fixed by #2177

Comments

@pgporada
Copy link
Contributor

pgporada commented Dec 15, 2016

Nomad version

Nomad v0.5.1 on 3 servers and 3 clients

[root@centos-7-srv1 ~]# nomad server-members
Name                     Address        Port  Status  Leader  Protocol  Build  Datacenter  Region
centos-7-srv1.us-east-1  192.168.33.10  4648  alive   false   2         0.5.1  molecule    us-east-1
centos-7-srv2.us-east-1  192.168.33.11  4648  alive   true    2         0.5.1  molecule    us-east-1
centos-7-srv3.us-east-1  192.168.33.12  4648  alive   false   2         0.5.1  molecule    us-east-1

[root@centos-7-srv1 ~]# nomad node-status
ID        DC        Name          Class   Drain  Status
b973a2ab  molecule  centos-7-cl1  <none>  false  ready
ea1c52a4  molecule  centos-7-cl3  <none>  false  ready
7a734562  molecule  centos-7-cl2  <none>  false  ready

Operating system and Environment details

All 6 nodes running
Linux centos-7-srv1 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Issue

Signal to shutdown a container isn't fired.

Reproduction steps

Create a nomad job with count set to 2.

Issue commands nomad plan phil.nomad and nomad run phil.nomad

You will see the 2 containers correctly start up on the appropriate clients.

ID          = phil
Name        = phil
Type        = service
Priority    = 50
Datacenters = molecule
Status      = running
Periodic    = false

^[1mSummary^[0m^[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         2        0       20        0

^[1mAllocations^[0m^[0m
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
57bc8460  e1123b6b  ea1c52a4  cache	  run	   running   12/15/16 20:19:11 UTC
e698cf48  e1123b6b  7a734562  cache	  run	   running   12/15/16 20:18:16 UTC

Change the count from 2 to 0, run a nomad plan phil.nomad and nomad run phil.nomad and you will see the containers continuing to stay up for ~5 extra minutes. You will also see a

ID          = phil
Name        = phil
Type        = service
Priority    = 50
Datacenters = molecule
Status      = dead
Periodic    = false

^[1mSummary^[0m^[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       21        0

^[1mAllocations^[0m^[0m
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
57bc8460  e1123b6b  ea1c52a4  cache	  stop     complete  12/15/16 20:19:11 UTC
e698cf48  9b746ea4  7a734562  cache	  stop     running   12/15/16 20:18:16 UTC

Nomad Server logs (if appropriate)

[root@centos-7-srv1 ~]# cat /var/log/nomad/nomad.log 
    Loaded configuration from /etc/nomad/server.hcl
==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: false
             Log Level: INFO
                Region: us-east-1 (DC: molecule)
                Server: true
               Version: 0.5.1

==> Nomad agent started! Log data will stream in below:

    2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv1.us-east-1 192.168.33.10
    2016/12/15 14:23:45.635033 [INFO] nomad: starting 1 scheduling worker(s) for [system service batch _core]
    2016/12/15 14:23:45 [INFO] raft: Node at 192.168.33.10:4647 [Follower] entering Follower state (Leader: "")
    2016/12/15 14:23:45.636033 [INFO] nomad: adding server centos-7-srv1.us-east-1 (Addr: 192.168.33.10:4647) (DC: molecule)
    2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv2.us-east-1 192.168.33.11
    2016/12/15 14:23:45.772685 [INFO] nomad: adding server centos-7-srv2.us-east-1 (Addr: 192.168.33.11:4647) (DC: molecule)
    2016/12/15 14:23:45.772737 [INFO] nomad: Attempting bootstrap with nodes: [192.168.33.10:4647 192.168.33.11:4647]
    2016/12/15 14:23:45 [INFO] serf: EventMemberJoin: centos-7-srv3.us-east-1 192.168.33.12
    2016/12/15 14:23:45.776479 [INFO] server.consul: successfully contacted 2 Nomad Servers
    2016/12/15 14:23:45.776601 [INFO] nomad: adding server centos-7-srv3.us-east-1 (Addr: 192.168.33.12:4647) (DC: molecule)
    2016/12/15 14:23:47 [WARN] raft: Rejecting vote request from 192.168.33.12:4647 since we have a leader: 192.168.33.11:4647
    2016/12/15 14:23:48 [WARN] raft: Rejecting vote request from 192.168.33.12:4647 since we have a leader: 192.168.33.11:4647
    2016/12/15 14:23:49 [WARN] raft: Heartbeat timeout from "192.168.33.11:4647" reached, starting election
    2016/12/15 14:23:49 [INFO] raft: Node at 192.168.33.10:4647 [Candidate] entering Candidate state
    2016/12/15 14:23:49 [INFO] raft: Node at 192.168.33.10:4647 [Follower] entering Follower state (Leader: "")

Nomad Client logs (if appropriate)

Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 20:25:48 2016

centos-7-cl1

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

    2016/12/15 20:14:27.187577 [INFO] client: node registration complete
    2016/12/15 20:15:32.435209 [INFO] driver.docker: created container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
    2016/12/15 20:15:33.127100 [INFO] driver.docker: started container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
    2016/12/15 20:15:33.130685 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 20:16:42.892537 [INFO] driver.docker: stopped container 7798b58ca79d0a55d757076f36c62a104e18b128b9e34a336862887c45951a2d
                                                                                                                                                                                         
                                                                                                                                                                                         


                                                                                                                                                                                         
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 20:25:47 2016

centos-7-cl2

CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
aa9adce9803d        redis:3.2           "docker-entrypoint.sh"   7 minutes ago       Up 7 minutes        10.0.2.15:40067->6379/tcp, 10.0.2.15:40067->6379/udp   redis-e698cf48-b708-2212-8
83f-b2022380b1c5

    2016/12/15 20:17:27.832225 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 20:18:06.733780 [INFO] driver.docker: stopped container e71f1ed9dba1f56536b298443fe4e16ac11a7690ee2466024554a28a4ed6eb07                                                  
    2016/12/15 20:18:45.839302 [INFO] driver.docker: created container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
    2016/12/15 20:18:46.085315 [INFO] driver.docker: started container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
    2016/12/15 20:18:46.087891 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
                                                                                                                                                                                         



                                                                                                                                                                                         
                                                                                                                                                                                         
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 20:25:47 2016

centos-7-cl3

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

    2016/12/15 20:19:03.445758 [INFO] driver.docker: stopped container d1efd8e4600faa790eaef39c18cac521650f7b8418b4cc04352d6a024d8f8a43
    2016/12/15 20:20:08.928978 [INFO] driver.docker: created container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349                                                  
    2016/12/15 20:20:09.117096 [INFO] driver.docker: started container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349
    2016/12/15 20:20:09.119012 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 20:24:59.953326 [INFO] driver.docker: stopped container 2de7461e2d82d830bfd548bb392fa240ac5054c704419a66ec000709efca8349
                                                                                                                                              

I can still hit the container that should be stopped.

[root@centos-7-cl2 ~]# docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                                  NAMES
aa9adce9803d        redis:3.2           "docker-entrypoint.sh"   8 minutes ago       Up 8 minutes        10.0.2.15:40067->6379/tcp, 10.0.2.15:40067->6379/udp   redis-e698cf48-b708-2212-883f-b2022380b1c5
[root@centos-7-cl2 ~]# telnet 10.0.2.15 40067
Trying 10.0.2.15...
Connected to 10.0.2.15.
Escape character is '^]'.
ping
+PONG

When this happens, the container will hang around indefinitely. The signal(s) will be fired during the next nomad run phil.nomad assuming that I make a change to the job file.

centos-7-cl2

=============>
    2016/12/15 20:18:45.839302 [INFO] driver.docker: created container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
    2016/12/15 20:18:46.085315 [INFO] driver.docker: started container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
    2016/12/15 20:18:46.087891 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 20:30:17.234377 [INFO] driver.docker: stopped container aa9adce9803d80759a2397ded741bd6d5e740056b1437272307418c97a93712d
<=============

    2016/12/15 20:31:01.851016 [INFO] driver.docker: created container b13507cd35da28130c4156122097d57338cd4f7c57db355c2257d86747802b77

This is not exactly ideal.

Job file (if appropriate)

count=2

job "phil" {
  datacenters = ["molecule"]
  region = "us-east-1"
  type = "service"
  update {
    stagger = "10s"
    max_parallel = 1
  }
  group "cache" {
    count = 2
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 300
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        network {
          mbits = 10
          port "db" {}
        }
      }
      service {
        name = "global-redis-check"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

count=0

job "phil" {
  datacenters = ["molecule"]
  region = "us-east-1"
  type = "service"
  update {
    stagger = "10s"
    max_parallel = 1
  }
  group "cache" {
    count = 0
    restart {
      attempts = 10
      interval = "5m"
      delay = "25s"
      mode = "delay"
    }
    ephemeral_disk {
      size = 300
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        network {
          mbits = 10
          port "db" {}
        }
      }
      service {
        name = "global-redis-check"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}
@pgporada pgporada changed the title Lowering count doesn't fire signals to client and causes containers to hang around indefinitely Lowering count sometimes doesn't fire signals to client and causes containers to hang around indefinitely Dec 15, 2016
@dadgar
Copy link
Contributor

dadgar commented Dec 15, 2016

Can you do nomad alloc-status e698cf48? And paste the full logs of the client running that alloc?

How reliably does this happen?

@pgporada
Copy link
Contributor Author

pgporada commented Dec 15, 2016

I can force this to happen within ~2 minutes by changing the count and running the job, waiting for the job output, running the job, "rinse and repeat". This particular change was count=2 to count=1 on my job from above.

My alloc-status log was cleared for some reason so nomad alloc-status e698cf48 returned

[root@centos-7-srv1 ~]# nomad alloc-status e698cf48
No allocation(s) with prefix or id "e698cf48" found

However, I just reproduced the issue again and here's the output

Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 21:39:59 2016

centos-7-cl1

CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS                                                  NAMES
33bd282be38b        redis:3.2           "docker-entrypoint.sh"   About a minute ago   Up About a minute   10.0.2.15:22996->6379/tcp, 10.0.2.15:22996->6379/udp   redis-75a2dba7-ddd4-1977-
3008-864986fcc2bc

    2016/12/15 21:37:29.609524 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 21:38:12.663605 [INFO] driver.docker: stopped container 7cf5f1d87baf197a46f3cfc36186c95bbc967c0eb3080f02015f05bd2826de3e
    2016/12/15 21:38:25.976447 [INFO] driver.docker: created container 33bd282be38bc217a867e5415d9d5002a1d3486a138d362240937c521fbd9f34
    2016/12/15 21:38:26.108368 [INFO] driver.docker: started container 33bd282be38bc217a867e5415d9d5002a1d3486a138d362240937c521fbd9f34
    2016/12/15 21:38:26.109462 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet


                                                                                                                                                                                         
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 21:40:01 2016

centos-7-cl2

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

    2016/12/15 20:42:15.417791 [INFO] client: using alloc directory /opt/nomad/alloc
    2016/12/15 20:42:15.421197 [INFO] fingerprint.cgroups: cgroups are available
    2016/12/15 20:42:15.423134 [INFO] fingerprint.consul: consul agent is available
    2016/12/15 20:42:18.457273 [INFO] client: Node ID "7a734562-617a-a4c0-cefe-c798cd13fc27"
    2016/12/15 20:42:18.479463 [INFO] client: node registration complete

                                                                                                                                                                                         





                                                                                                                                                                                         
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Every 2.0s: hostname ; echo ; docker ps ; echo ; tail -n 5 /var/log/nomad/nomad.log                                                                               Thu Dec 15 21:40:00 2016

centos-7-cl3

CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS                                                  NAMES
e84548972440        redis:3.2           "docker-entrypoint.sh"   About a minute ago   Up About a minute   10.0.2.15:32857->6379/tcp, 10.0.2.15:32857->6379/udp   redis-540e7bbf-8754-2592-
870c-836249b1b30a

    2016/12/15 21:37:28.788605 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 21:38:19.763671 [INFO] driver.docker: stopped container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
    2016/12/15 21:38:26.000837 [INFO] driver.docker: created container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b                                                  
    2016/12/15 21:38:26.162356 [INFO] driver.docker: started container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
    2016/12/15 21:38:26.165919 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
[root@centos-7-srv1 ~]# nomad run phil.nomad 
==> Monitoring evaluation "ea3acdee"
    Evaluation triggered by job "phil"
    Allocation "75a2dba7" modified: node "b973a2ab", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ea3acdee" finished with status "complete"
[root@centos-7-srv1 ~]# nomad alloc-status 75a2dba7
ID                 = 75a2dba7
Eval ID            = ea3acdee
Name               = phil.cache[0]
Node ID            = b973a2ab
Job ID             = phil
Client Status      = running
Client Description = <none>
Created At         = 12/15/16 21:38:25 UTC

Task "redis" is "running"
Task Resources
CPU        Memory          Disk  IOPS  Addresses
2/100 MHz  992 KiB/10 MiB  0 B   0     db: 10.0.2.15:22996

Recent Events:
Time                   Type      Description
12/15/16 21:38:26 UTC  Started   Task started by client
12/15/16 21:38:25 UTC  Received  Task received by client
[root@centos-7-srv1 ~]# nomad alloc-status 540e7bbf
ID                 = 540e7bbf
Eval ID            = b382ea91
Name               = phil.cache[1]
Node ID            = ea1c52a4
Job ID             = phil
Client Status      = running
Client Description = <none>
Created At         = 12/15/16 21:38:25 UTC

Task "redis" is "running"
Task Resources
CPU        Memory          Disk  IOPS  Addresses
2/100 MHz  992 KiB/10 MiB  0 B   0     db: 10.0.2.15:32857

Recent Events:
Time                   Type      Description
12/15/16 21:38:26 UTC  Started   Task started by client
12/15/16 21:38:25 UTC  Received  Task received by client
[root@centos-7-srv1 ~]# nomad status phil
ID          = phil
Name        = phil
Type        = service
Priority    = 50
Datacenters = molecule
Status      = running
Periodic    = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         2        0       28        0

Allocations
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
540e7bbf  b382ea91  ea1c52a4  cache       stop     running   12/15/16 21:38:25 UTC
75a2dba7  ea3acdee  b973a2ab  cache       run      running   12/15/16 21:38:25 UTC

In this particular case, the containers decided that ~5 minutes was enough time and got into the correct state.

[root@centos-7-srv1 ~]# nomad alloc-status 540e7bbf
ID                 = 540e7bbf
Eval ID            = b382ea91
Name               = phil.cache[1]
Node ID            = ea1c52a4
Job ID             = phil
Client Status      = complete
Client Description = <none>
Created At         = 12/15/16 21:38:25 UTC

Task "redis" is "dead"
Task Resources
CPU      Memory  Disk  IOPS  Addresses
100 MHz  10 MiB  0 B   0     db: 10.0.2.15:32857

Recent Events:
Time                   Type      Description
12/15/16 21:43:46 UTC  Killed    Task successfully killed
12/15/16 21:43:46 UTC  Killing   Sent interrupt. Waiting 5s before force killing
12/15/16 21:38:26 UTC  Started   Task started by client
12/15/16 21:38:25 UTC  Received  Task received by client

@pgporada
Copy link
Contributor Author

I should note the version of docker I am using

[root@centos-7-cl1 ~]# docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:23:11 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.5.4
 Git commit:   b9f10c9
 Built:        Wed Jun  1 21:23:11 2016
 OS/Arch:      linux/amd64

@dadgar
Copy link
Contributor

dadgar commented Dec 15, 2016

Can you paste the full client logs of node ea1c52a4. Can't really follow that style of logs. Also are you running Nomad clients in containers?

@pgporada
Copy link
Contributor Author

I am running Nomad on the host itself via systemd. If you would like to schedule a screenshare or something like that, I'd be happy to oblige.

==> Starting Nomad agent...
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: INFO
                Region: us-east-1 (DC: molecule)
                Server: false
               Version: 0.5.1

==> Nomad agent started! Log data will stream in below:

    2016/12/15 20:42:15.500090 [INFO] client: using state directory /opt/nomad/client
    2016/12/15 20:42:15.500170 [INFO] client: using alloc directory /opt/nomad/alloc
    2016/12/15 20:42:15.502557 [INFO] fingerprint.cgroups: cgroups are available
    2016/12/15 20:42:15.504121 [INFO] fingerprint.consul: consul agent is available
    2016/12/15 20:42:18.544095 [INFO] client: Node ID "ea1c52a4-785f-d00a-911b-65e5e79fba02"
    2016/12/15 20:42:18.564452 [INFO] client: node registration complete
    2016/12/15 21:37:28.225018 [INFO] driver.docker: created container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
    2016/12/15 21:37:28.758178 [INFO] driver.docker: started container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
    2016/12/15 21:37:28.788605 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 21:38:19.763671 [INFO] driver.docker: stopped container 5e20b09acb1efc40a8156a0dd7ef0546684f9c799eae1b0df7cf9d29eefdd43c
    2016/12/15 21:38:26.000837 [INFO] driver.docker: created container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
    2016/12/15 21:38:26.162356 [INFO] driver.docker: started container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b
    2016/12/15 21:38:26.165919 [WARN] client: error fetching stats of task redis: stats collection hasn't started yet
    2016/12/15 21:43:46.819799 [INFO] driver.docker: stopped container e8454897244036676264d18c0fe46446290d01158ba35c7fcc2fa4e7da92f10b

@dadgar
Copy link
Contributor

dadgar commented Dec 15, 2016

@pgporada Appreciate the offer. Hopefully we can get to the root of it async. If it is possible could you reproduce while running at log level DEBUG for both the server and client and paste the results.

@pgporada
Copy link
Contributor Author

I will get you that by tomorrow. We're about to kick off a company Christmas party. Thank you for the help so far.

@pgporada
Copy link
Contributor Author

pgporada commented Dec 19, 2016

Sorry for the delay. Please see the enclosed files

  • centos-7-srv1.zip

    • consul.log with standard logging
    • nomad.log with debug logging
    • httpd.nomad which is the job I've been planning/running for this round of expirements
    • server_status.txt which is output from the server I've been running the jobs from. You will see the problem here
    • server_status_allocations.txt which shows the job allocations. The problem can be seen here as well.
    • troubleshooting_script the commands I ran to gather this data
  • centos-7-cl3.zip

  • centos-7-cl2.zip

  • centos-7-cl1.zip

    • consul.log with standard logging
    • nomad.log with debug logging

Specifically, why did it take 5+ minutes to kill a container?

+/- Job: "httpd"
    Datacenters {
  Datacenters: "molecule"
}
+/- Task Group: "cache" (6 destroy)
  +/- Count: "6" => "0" (forces destroy)
      Task: "httpd"

Scheduler dry-run:
- All tasks successfully allocated.

---

Mon Dec 19 14:45:28 UTC 2016
==> Monitoring evaluation "6b81a051"
    Evaluation triggered by job "httpd"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "6b81a051" finished with status "complete"

---

ID                 = 05364f3b
Eval ID            = 3ca39c14
Name               = httpd.cache[3]
Node ID            = a42fd16a
Job ID             = httpd
Client Status      = complete
Client Description = <none>
Created At         = 12/19/16 14:45:08 UTC

Task "httpd" is "dead"
Task Resources
CPU      Memory  Disk  IOPS  Addresses
100 MHz  10 MiB  0 B   0     http: 10.0.2.15:50777

Recent Events:
Time                   Type      Description
12/19/16 14:50:40 UTC  Killed    Task successfully killed
12/19/16 14:50:39 UTC  Killing   Sent interrupt. Waiting 5s before force killing
12/19/16 14:45:08 UTC  Started   Task started by client
12/19/16 14:45:08 UTC  Received  Task received by client

@bluen
Copy link

bluen commented Dec 19, 2016

We're hitting the same issue - I guess that somehow the timeout in docker.go line 89 is hit for some reason, or maybe JobGCInterval or NodeGCInterval which are also set to 5 minutes.

@dadgar
Copy link
Contributor

dadgar commented Dec 19, 2016

Thanks for the logs and report. Will get a fix!

@dadgar dadgar added this to the v0.5.1 milestone Dec 19, 2016
@dadgar dadgar changed the title Lowering count sometimes doesn't fire signals to client and causes containers to hang around indefinitely Docker containers don't exit until after 5 minute timeout Dec 19, 2016
@bluen
Copy link

bluen commented Dec 28, 2016

Just noted that issue #2133 reports a similar symptom for the raw_exec driver.

@dadgar
Copy link
Contributor

dadgar commented Jan 3, 2017

@bluen I think they are unrelated. Docker driver has an http client with an explicit 5 minute timeout. We use this because occasionally docker engine stops being responsive and we don't want it to hang the Nomad client.

dadgar added a commit that referenced this issue Jan 10, 2017
This PR makes GetAllocs use a blocking query as well as adding a sanity
check to the clients watchAllocation code to ensure it gets the correct
allocations.

This PR fixes #2119 and
#2153.

The issue was that the client was talking to two different servers, one
to check which allocations to pull and the other to pull those
allocations.  However the latter call was not with a blocking query and
thus the client would not retreive the allocations it requested.

The logging has been improved to make the problem more clear as well.
@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants