Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional "close of nil channel" panic on task restart #2479

Closed
hobochili opened this issue Mar 24, 2017 · 1 comment · Fixed by #2480
Closed

Occasional "close of nil channel" panic on task restart #2479

hobochili opened this issue Mar 24, 2017 · 1 comment · Fixed by #2480

Comments

@hobochili
Copy link
Contributor

Nomad version

Nomad v0.5.5

Operating system and Environment details

NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"

Issue

I have a task which depends on a template with "ChangeMode": "restart". Occasionally, in roughly 25% of the instances where the template is updated, the restart prompts a "close of nil channel" panic as the task_runner attempts to close the stopCollection channel.

This configuration was working prior to 0.5.3 -> 0.5.5 upgrade. Handling of the stopCollection channel within the task_runner run function was modified in commit 4826d84, but I'm not familiar enough with the code to isolate the bug.

Nomad Client logs

Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) receiving dependency health.service(disque-worker@aws|passing)
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [INFO] (runner) initiating run
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) checking template 5be7920e5f2ab5d81942594895349c82
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/72ab2d7a-25f3-1a39-2d01-17680f51dac5/disque-web/local/env"
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [INFO] (runner) rendered "(dynamic)" => "/var/lib/nomad/alloc/72ab2d7a-25f3-1a39-2d01-17680f51dac5/disque-web/local/env"
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) diffing and updating dependencies
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) health.service(disque@aws|passing) is still needed
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) health.service(disque-worker@aws|passing) is still needed
Mar 24 10:30:42 tauros nomad[9859]:     2017/03/24 10:30:42 [DEBUG] (runner) watching 2 dependencies
Mar 24 10:30:43 tauros nomad[9859]:     2017/03/24 10:30:43.378485 [DEBUG] client: restarting task disque-web for alloc "72ab2d7a-25f3-1a39-2d01-17680f51dac5": consul-template: template with change_mode restart re-rendered
Mar 24 10:30:43 tauros nomad[9859]:     2017/03/24 10:30:43.378526 [DEBUG] client: task being restarted: consul-template: template with change_mode restart re-rendered
Mar 24 10:30:43 tauros nomad[9859]:     2017/03/24 10:30:43.394881 [DEBUG] http: Request /v1/client/stats?region=us-east-1&wait=60000ms (219.138µs)
Mar 24 10:30:43 tauros nomad[9859]:     2017/03/24 10:30:43.686622 [DEBUG] client: updated allocations at index 371634 (total 13) (pulled 0) (filtered 13)
Mar 24 10:30:43 tauros nomad[9859]:     2017/03/24 10:30:43.687030 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 13)
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) receiving dependency health.service(disque@aws|passing)
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [INFO] (runner) initiating run
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) checking template 5be7920e5f2ab5d81942594895349c82
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) rendering "(dynamic)" => "/var/lib/nomad/alloc/72ab2d7a-25f3-1a39-2d01-17680f51dac5/disque-web/local/env"
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) diffing and updating dependencies
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) health.service(disque@aws|passing) is still needed
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) health.service(disque-worker@aws|passing) is still needed
Mar 24 10:30:47 tauros nomad[9859]:     2017/03/24 10:30:47 [DEBUG] (runner) watching 2 dependencies
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.011739 [DEBUG] client: restarting task disque-web for alloc "72ab2d7a-25f3-1a39-2d01-17680f51dac5": consul-template: template with change_mode restart re-rendered
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.113289 [DEBUG] http: Request /v1/agent/servers (168.134µs)
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.391545 [DEBUG] http: Request /v1/client/stats?region=us-east-1&wait=60000ms (98.091µs)
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.560422 [DEBUG] driver.docker: error collecting stats from container 296e2ad2e3204b1512ef7ce4edde0849d2b29b14b49fac1d092ed4c067603e61: io: read/write on closed pipe
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.560526 [INFO] driver.docker: stopped container 296e2ad2e3204b1512ef7ce4edde0849d2b29b14b49fac1d092ed4c067603e61
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48 [DEBUG] plugin: /usr/local/nomad-0.5.5/nomad: plugin process exited
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.608270 [INFO] client: Restarting task "disque-web" for alloc "72ab2d7a-25f3-1a39-2d01-17680f51dac5" in 0s
Mar 24 10:30:48 tauros nomad[9859]:     2017/03/24 10:30:48.608550 [DEBUG] client: task being restarted: consul-template: template with change_mode restart re-rendered
Mar 24 10:30:48 tauros nomad[9859]: panic: close of nil channel
Mar 24 10:30:48 tauros nomad[9859]: goroutine 153 [running]:
Mar 24 10:30:48 tauros nomad[9859]: github.com/hashicorp/nomad/client.(*TaskRunner).run(0xc4205fa580)
Mar 24 10:30:48 tauros nomad[9859]:         /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:972 +0xa9e
Mar 24 10:30:48 tauros nomad[9859]: github.com/hashicorp/nomad/client.(*TaskRunner).Run(0xc4205fa580)
Mar 24 10:30:48 tauros nomad[9859]:         /opt/gopath/src/github.com/hashicorp/nomad/client/task_runner.go:442 +0x556
Mar 24 10:30:48 tauros nomad[9859]: created by github.com/hashicorp/nomad/client.(*AllocRunner).RestoreState
Mar 24 10:30:48 tauros nomad[9859]:         /opt/gopath/src/github.com/hashicorp/nomad/client/alloc_runner.go:190 +0x82f

Excerpts from Job file

      "Tasks": [
        {
          "Name": "disque-web",
          "Driver": "docker",
          "User": "",
          "Config": {
            "args": [
              "-c",
              "source ${NOMAD_TASK_DIR}/env && ${NOMAD_TASK_DIR}/disque-web"
            ],
            "command": "/bin/bash",
            "image": "<private ubuntu image>"
          }
      ]
          "Templates": [
            {
              "SourcePath": "",
              "DestPath": "local/env",
              "EmbeddedTmpl": "export DW_DISQUE_ADDRS=\"{{range $i, $service := service \"disque@aws\"}}{{if ne $i 0}},{{end}}{{.Address}}:{{.Port}}{{end}}\"\nexport DW_WORKER_ADDRS=\"{{range $i, $service := service \"disque-worker@aws\"}}{{if ne $i 0}},{{end}}{{.Address}}:{{.Port}}{{end}}\"\n",
              "ChangeMode": "restart",
              "ChangeSignal": "",
              "Splay": 5000000000,
              "Perms": "",
              "LeftDelim": "",
              "RightDelim": ""
            }
          ]
dadgar added a commit that referenced this issue Mar 24, 2017
This PR fixes an issue that is hit when running templates with restart
mode in which the client could panic when the handle is not running.

Fixes #2479
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants