Nomad node becoming un-responsoive #7987

raunak2004 · 2020-05-16T00:05:13Z

Nomad node became unresponsive which causes no jobs to be scheduled.
The systemctl status nomad returns active but none of the nomad CLI command return any response. Since the nomad is unresponsive any service dependent on nomad also starts acting up as in this case it is zookeeper

Logs:

May 07 18:35:18 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:18.953Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=17ea92a789da5cb775ac62e2039df3b8e2d3ddfc66c48e103516bb9b131b5988
May 07 18:35:18 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:18.953Z [INFO] client.driver_mgr.docker: created container: driver=docker container_id=12cc065111f1cff4333e617c1b2bcd1db3129a160edaa03a6ca7e746fcfa5a48
May 07 18:35:19 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:19.393Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=17ea92a789da5cb775ac62e2039df3b8e2d3ddfc66c48e103516bb9b131b5988
May 07 18:35:19 us02pA-zkp1 nomad[7324]: 2020-05-07T18:35:19.510Z [INFO] client.driver_mgr.docker: started container: driver=docker container_id=12cc065111f1cff4333e617c1b2bcd1db3129a160edaa03a6ca7e746fcfa5a48
May 07 20:03:41 us02pA-zkp1 nomad[7324]: 2020-05-07T20:03:41.525Z [INFO] client.gc: garbage collecting allocation: alloc_id=1e9be7e0-3756-aa5a-3a85-ef0930eb798c reason="forced collection"
May 12 18:00:48 us02pA-zkp1 nomad[7324]: 2020-05-12T18:00:48.295Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.102.7.13:4647
May 12 18:00:48 us02pA-zkp1 nomad[7324]: 2020-05-12T18:00:48.295Z [ERROR] client: error querying node allocations: error="rpc error: EOF"
May 12 18:01:06 us02pA-zkp1 nomad[7324]: 2020-05-12T18:01:06.887Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.102.7.14:4647
May 12 18:01:06 us02pA-zkp1 nomad[7324]: 2020-05-12T18:01:06.887Z [ERROR] client: error querying node allocations: error="rpc error: EOF"
May 12 18:03:00 us02pA-zkp1 systemd[1]: Stopping "HashiCorp Nomad - Workload orchestrator"...
May 12 18:03:00 us02pA-zkp1 nomad[7324]: ==> Caught signal: interrupt
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.378Z [INFO] agent: requesting shutdown
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.378Z [INFO] client: shutting down
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.386Z [INFO] client.plugin: shutting down plugin manager: plugin-type=device
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.387Z [INFO] client.plugin: plugin manager finished: plugin-type=device
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.387Z [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: plugin manager finished: plugin-type=driver
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.394Z [INFO] client.plugin: plugin manager finished: plugin-type=csi
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.412Z [ERROR] consul.sync: failed deregistering agent check: check_id=_nomad-check-a258ff77935a905d2f76d1a6b5912b7771336321 error="Unexpected response code: 500 (Unknown check "_nomad-check-a258ff77935a905d2f76d1a6b5912b7771336321")"
May 12 18:03:00 us02pA-zkp1 nomad[7324]: 2020-05-12T18:03:00.412Z [INFO] agent: shutdown complete
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Failed with result 'exit-code'.
May 12 18:03:00 us02pA-zkp1 systemd[1]: Stopped "HashiCorp Nomad - Workload orchestrator".
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21694 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21695 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21815 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 21855 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 22185 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 22247 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108874 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108875 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108881 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 108998 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 109039 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 109198 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114847 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114848 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114959 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: nomad.service: Found left-over process 114996 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
May 12 18:03:00 us02pA-zkp1 systemd[1]: Started "HashiCorp Nomad - Workload orchestrator".
May 12 18:03:00 us02pA-zkp1 nomad[121900]: ==> Loaded configuration from /etc/nomad.d/acl.hcl, /etc/nomad.d/client.json, /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_vault_management.hcl, /etc/nomad.d/telemetry.json
May 12 18:03:00 us02pA-zkp1 nomad[121900]: ==> Starting Nomad agent...
May 12 18:03:06 us02pA-zkp1 nomad[121900]: ==> Nomad agent configuration:
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Advertise Addrs: HTTP: *.*.*.*:4646
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Bind Addrs: HTTP: 0.0.0.0:4646
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Client: true
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Log Level: INFO
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Region: global (DC: us02pA)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Server: false
May 12 18:03:06 us02pA-zkp1 nomad[121900]: Version: 0.11.1
May 12 18:03:06 us02pA-zkp1 nomad[121900]: ==> Nomad agent started! Log data will stream in below:
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.473Z [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad/plugins
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] client: using state directory: state_dir=/opt/nomad/client
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.474Z [INFO] client: using alloc directory: alloc_dir=/opt/nomad/alloc
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.476Z [INFO] client.fingerprint_mgr.cgroup: cgroups are available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.489Z [INFO] client.fingerprint_mgr.consul: consul agent is available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:00.512Z [INFO] client.fingerprint_mgr.vault: Vault is available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=csi
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=driver
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.513Z [INFO] client.plugin: starting plugin manager: plugin-type=device
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.529Z [WARN] client.server_mgr: no servers available
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.529Z [INFO] client.consul: discovered following servers: servers=[*.*.*.*:4647, *.*.*.*:4647, *.*.*.*:4647]
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.541Z [INFO] client: node registration complete
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:06.547Z [INFO] client: started client: node_id=33933e50-8f99-8963-0d1b-00dee71ba6a1
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.558532 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559114 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.562814 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559576 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.564766 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.559908 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.561780 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.562629 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.564988 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.565369 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.565447 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.566255 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.566340 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.570125 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.570388 [INFO] (runner) starting
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607259 [INFO] (runner) creating new runner (dry: false, once: false)
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607379 [INFO] (runner) creating watcher
May 12 18:03:06 us02pA-zkp1 nomad[121900]: 2020/05/12 18:03:06.607455 [INFO] (runner) starting
May 12 18:03:15 us02pA-zkp1 nomad[121900]: 2020-05-12T18:03:15.716Z [INFO] client: node registration complete
May 12 20:12:05 us02pA-zkp1 nomad[121900]: 2020/05/12 20:12:05.564336 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 12 20:13:23 us02pA-zkp1 nomad[121900]: 2020/05/12 20:13:23.492408 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 12 20:13:23 us02pA-zkp1 nomad[121900]: 2020/05/12 20:13:23.743344 [WARN] (view) kv.block(core/inventory/zk_server_linux): Get "http://*.*.*.*:8500/v1/kv/core/inventory/zk_server_linux?index=37&stale=&wait=60000ms": dial tcp *.*.*.*:8500: connect: connection refused (retry attempt 2 after "500ms")

Configuration used for nomad:

acl {
  enabled = true
  }
{
        "client": {
                "enabled": true,
        "options":{
        "driver.raw_exec.enable":"1"
    },
                "meta": [{ "nomad_version": "0.11.1","linux":"true","zk_server_linux":"true" }]
        }
}
vault {
  enabled = true
  address = "http://vault.service.consul:8200"
}

log_level = "INFO"
data_dir = "/opt/nomad"
log_file = "/opt/nomad/nomad.log"
log_rotate_max_files = 5
log_rotate_bytes = 10000000
disable_update_check = true
datacenter = "us1234"
bind_addr = "0.0.0.0"
advertise {
  http = "*.*.*.*"
  rpc  = "*.*.*.*"
  serf = "*.*.*.*"
}
consul {
  address = "*.*.*.*:8500"
  token   = "7116df5a-921d-e2ab-5e0c-sfsdfsfsfsdf"
}
plugin "docker" {
  config {
    auth {
      config = "/etc/docker/config.json"
    }
  }
}

The text was updated successfully, but these errors were encountered:

tgross · 2020-05-18T12:43:12Z

Hi @raunak2004 and thanks for reporting this! So it looks from the systemd logs that this client was restarted? There's a couple things I'm seeing here.

The systemctl status nomad returns active but none of the nomad CLI command return any response. Since the nomad is unresponsive any service dependent on nomad also starts acting up as in this case it is zookeeper

The Nomad CLI should typically be pointed to the servers, not to the clients. Are you talking only about the CLI commands that directly interact with the "stuck" client (like nomad alloc logs, etc.)?

The log-lines with the longer timestamps are from Nomad's embedded consul-template runner. This looks like it can't reach Consul:

nomad[121900]: 2020/05/12 20:12:05.564336 [WARN] (view) kv.block(core/inventory/zk_server_linux): Unexpected response code: 500 (retry attempt 1 after "250ms")

Is your Consul agent on that host showing any logs that might give us another clue?

Probably unrelated, but from the systemd process tree output, this log line suggests that we have a logmon process that didn't get reparented after the restart (but that shouldn't cause Nomad to become unresponsive at all). It might be helpful if you could share your systemd unit file.

nomad.service: Found left-over process 21694 (nomad) in control group while starting unit. Ignoring.
May 12 18:03:00 us02pA-zkp1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.

(Also, I hope you don't mind, but I wrapped your logs and config in some tags to make it a bit easier to read.)

svudutala · 2020-05-18T14:51:10Z

@tgross Posting it for @raunak2004

Here is the systemd file

[Unit]
Description= "HashiCorp Nomad - Workload orchestrator"
Documentation=https://nomadproject.io/docs/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/nomad.d/nomad.hcl
StartLimitBurst=3
StartLimitIntervalSec=10

[Service]
# Service running as client needs root permissions and client runs everywhere
User=root
Group=ws1applications
ExecStart=/opt/nomad/0.11.1/nomad agent -config /etc/nomad.d
LimitNOFILE=65536
KillMode=process
KillSignal=SIGINT
Restart=on-failure
RestartSec=2
TasksMax=infinity

[Install]
WantedBy=multi-user.target

raunak2004 · 2020-05-18T14:59:37Z

Is your Consul agent on that host showing any logs that might give us another clue?

No there are no logs in consul which would indicate about the issue all the consul servers were healthy when nomad client went unhealthy. As a workaround for now we have restarted the agent.

tgross · 2020-05-18T17:06:25Z

all the consul servers were healthy when nomad client went unhealthy

Ok, but if we saw a 500 error in consul-template I'd expect the local Consul agent to be having an issue. Nothing in the systemd logs for that?

Also, can you describe the failure mode in a little more detail... are you talking only about the CLI commands that directly interact with the "stuck" client (like nomad alloc logs, etc.)?

idrennanvmware · 2020-06-01T13:02:39Z

@tgross I didn't realize @raunak2004 had already opened this issue. Here's more details

#8085

rkettelerij · 2020-06-04T08:05:21Z

So we have #8085, #7987 and #8038. Which al look pretty similar (at first glance).

idrennanvmware · 2020-06-07T13:54:18Z

We just had this happen again today, this time on a windows node.

Powershell shows the service as running - Consul shows the Nomad healthcheck failing.

Restart the service - all returns to normal.

idrennanvmware · 2020-06-08T13:42:58Z

@tgross - here's what we see in the healthcheck on Consul when this scenario occurs.

Service on the linux/windows node shows a status of "running"
Service healthcheck is timing out like the screenshot
Metrics and logs stop writing and being emitted

Remediation is restarting the agent

shoenig · 2020-06-09T20:31:49Z

Next time this happens (to anyone), it would be super helpful to get a stack trace from while the agent is un-responsive.

This can be done by issuing SIGABRT to the Nomad agent, which will then dump the trace into stderr/stdout (or journald logs, etc.).

e.g. pkill -SIGABRT nomad

https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

idrennanvmware · 2020-06-09T20:51:49Z

@shoenig

So I just checked all our clusters dev and prod and none are currently exhibiting this behavior. As soon as this happens again we will execute the command and paste the results here. I'd be surprised if we don't see this in the next few days - it seems to happen about once a week or so

Thanks!

idrennanvmware · 2020-06-09T22:33:49Z

Had it happen. Logs attached :)
hashilogs.txt

@shoenig

You can see the last log at 14:19:30 today then a long period where things looked good (but they weren't) and then around 17:30 we saw the nomad client healthcheck fail with the timeout.

the command you sent restarted the agent and restored functionality

svudutala · 2020-06-09T22:51:11Z

Adding screenshots from Consul and Nomad along with logs @idrennanvmware attached on above thread

pznamensky · 2020-06-10T06:19:16Z

In case a log file with the trace for server process could be helpful, I've attached it here: #8038 (comment)

svudutala · 2020-06-10T15:59:22Z

@shoenig Is there any specific nomad telemetry metric that you think would help if we monitor?

On the same lines, we are planning to add alert specifically on Nomad client HTTP check failing from Consul to capture if this happens next time.

shoenig · 2020-06-15T16:22:40Z

Thanks for providing stack traces, @idrennanvmware , @pznamensky , @kung-foo

Interestingly so far I believe none of them include a goroutine in a running state, which is odd to reconcile with the common symptom of 1 core being used at 100%.

I was looking into golang/go#38933 but that seems unlikely as none of the stack traces reference startTemplateThread

One thing I'm poking at currently is that netpoll has changed substantially between go1.12 and go1.14, which correlates with our Nomad v0.10 -> v0.11 upgrade cycle, when symptoms began. I'm curious about this because towards the end of each trace is something like,

Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: goroutine 69909 [IO wait]:
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.runtime_pollWait(0x7f2c51e96c48, 0x72, 0xffffffffffffffff)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         runtime/netpoll.go:203 +0x55
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*pollDesc).wait(0xc000ed4d98, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_poll_runtime.go:87 +0x45
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*pollDesc).waitRead(...)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_poll_runtime.go:92
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: internal/poll.(*FD).Read(0xc000ed4d80, 0xc000df7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         internal/poll/fd_unix.go:169 +0x19b
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: net.(*netFD).Read(0xc000ed4d80, 0xc000df7000, 0x1000, 0x1000, 0xc, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         net/fd_unix.go:202 +0x4f
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: net.(*conn).Read(0xc0000d53e8, 0xc000df7000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         net/net.go:184 +0x8e
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: bufio.(*Reader).Read(0xc0008fd9e0, 0xc0010a14f0, 0xc, 0xc, 0x3184500, 0x15, 0xc000533ad0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         bufio/bufio.go:226 +0x24f
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: io.ReadAtLeast(0x37b4d20, 0xc0008fd9e0, 0xc0010a14f0, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         io/io.go:310 +0x87
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: io.ReadFull(...)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         io/io.go:329
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc000533ad0, 0x0, 0x0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:458 +0xe3
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc000533ad0)
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:437 +0x2b
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
Jun 09 22:07:22 perflab-us01prfcpclu3-zkp2 nomad[125585]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:113 +0x2a5

Jun 15 09:13:09  nomad[35920]: internal/poll.runtime_pollWait(0x7face04749a8, 0x72, 0xffffffffffffffff)
Jun 15 09:13:09  nomad[35920]:         runtime/netpoll.go:203 +0x55
Jun 15 09:13:09  nomad[35920]: internal/poll.(*pollDesc).wait(0xc00056e198, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_poll_runtime.go:87 +0x45
Jun 15 09:13:09  nomad[35920]: internal/poll.(*pollDesc).waitRead(...)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_poll_runtime.go:92
Jun 15 09:13:09  nomad[35920]: internal/poll.(*FD).Read(0xc00056e180, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         internal/poll/fd_unix.go:169 +0x19b
Jun 15 09:13:09  nomad[35920]: net.(*netFD).Read(0xc00056e180, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0xa000000000, 0x0)
Jun 15 09:13:09  nomad[35920]:         net/fd_unix.go:202 +0x4f
Jun 15 09:13:09  nomad[35920]: net.(*conn).Read(0xc001170158, 0xc0006d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         net/net.go:184 +0x8e
Jun 15 09:13:09  nomad[35920]: bufio.(*Reader).Read(0xc000bf5080, 0xc0008de4e0, 0xc, 0xc, 0xc000256e20, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         bufio/bufio.go:226 +0x24f
Jun 15 09:13:09  nomad[35920]: io.ReadAtLeast(0x37b4d20, 0xc000bf5080, 0xc0008de4e0, 0xc, 0xc, 0xc, 0xc, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         io/io.go:310 +0x87
Jun 15 09:13:09  nomad[35920]: io.ReadFull(...)
Jun 15 09:13:09  nomad[35920]:         io/io.go:329
Jun 15 09:13:09  nomad[35920]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recvLoop(0xc000256dc0, 0x0, 0x0)
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:458 +0xe3
Jun 15 09:13:09  nomad[35920]: github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.(*Session).recv(0xc000256dc0)
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:437 +0x2b
Jun 15 09:13:09  nomad[35920]: created by github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux.newSession
Jun 15 09:13:09  nomad[35920]:         github.com/hashicorp/nomad/vendor/github.com/hashicorp/yamux/session.go:113 +0x2a5

Jun 10 09:03:54 srv1 nomad: internal/poll.runtime_pollWait(0x7f23c80df4e8, 0x72, 0xffffffffffffffff)
Jun 10 09:03:54 srv1 nomad: runtime/netpoll.go:203 +0x55
Jun 10 09:03:54 srv1 nomad: internal/poll.(*pollDesc).wait(0xc003902298, 0x72, 0x200, 0x205, 0xffffffffffffffff)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_poll_runtime.go:87 +0x45
Jun 10 09:03:54 srv1 nomad: internal/poll.(*pollDesc).waitRead(...)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_poll_runtime.go:92
Jun 10 09:03:54 srv1 nomad: internal/poll.(*FD).Read(0xc003902280, 0xc000edd8c0, 0x205, 0x205, 0x0, 0x0, 0x0)
Jun 10 09:03:54 srv1 nomad: internal/poll/fd_unix.go:169 +0x19b

idrennanvmware · 2020-06-15T17:13:24Z

@shoenig - in our case we don't see the CPU spike that others have reported. The agent just becomes completely unresponsive. Almost like its deadlocked.

kung-foo · 2020-06-17T08:52:08Z

@shoenig I have a nomad server currently hung (100% cpu, non-responsive). Anything you want from the system before I restart it?

jrasell · 2020-06-17T08:55:45Z

Next time this happens (to anyone), it would be super helpful to get a stack trace from while the agent is un-responsive.

This can be done by issuing SIGABRT to the Nomad agent, which will then dump the trace into stderr/stdout (or journald logs, etc.).

e.g. pkill -SIGABRT nomad

https://golang.org/pkg/os/signal/#hdr-Default_behavior_of_signals_in_Go_programs

@kung-foo please see this previous comment; this information would be extremely useful. Thanks!

kung-foo · 2020-06-17T10:06:33Z

more stack traces: https://gist.github.com/kung-foo/d01b17fbf20de9836cea2903da2ac46b

jorgemarey · 2020-06-17T10:32:27Z

@kung-foo that seems to match with what's described in golang/go#38051 (comment)

kung-foo · 2020-06-17T11:02:32Z

@jorgemarey agreed. especially given some of the output here: #8163 (comment)

rkettelerij · 2020-06-17T12:19:40Z

See #8163 (comment) for yet another stacktrace of a stuck Nomad server running a 100% cpu (Nomad 0.11.1) .

rkettelerij · 2020-06-18T10:53:12Z

See #8163 (comment) for stacktrace of Nomad client (Nomad 0.11.1). We're going to upgrade to 0.11.3 to determine if the Go update has any effect.

shoenig · 2020-06-23T14:50:09Z

At this point we are documenting the necessity for upgrading to Nomad v0.11.3+ to get past the Go runtime bug.

I'll leave this issue open for a little while longer - it would be great to hear from those who have upgraded to Nomad v0.11.3 and whether the hangs have been resolved.

I want give a big THANK YOU to everyone who helped collect information and track this down. Ya'll are awesome!

idrennanvmware · 2020-06-26T00:30:07Z

@shoenig at this stage we have upgraded all our environments to 0.11.3 - if we experience the issues described here we will report back.

Thanks for the update

rkettelerij · 2020-06-29T11:56:48Z

We've been running 0.11.3 for 7 days now without any issues.

rkettelerij · 2020-07-06T08:18:15Z

Still no issues after 14 days of running 0.11.3. IMHO it's safe to say this issue is resolved.

idrennanvmware · 2020-07-09T02:18:42Z

Agreed - we will have @raunak2004 close out the issue. Thanks everyone for all the thoughts and feedback cycles!

raunak2004 · 2020-07-09T14:22:34Z

Closing the issue since we haven't seen it with 0.11.3. Thanks everyone.

github-actions · 2022-11-05T02:32:29Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added stage/needs-investigation stage/waiting-reply labels May 18, 2020

stale bot removed the stage/waiting-reply label May 18, 2020

tgross added the stage/waiting-reply label May 21, 2020

stale bot removed the stage/waiting-reply label Jun 1, 2020

idrennanvmware mentioned this issue Jun 7, 2020

Nomad Client becomes unresponsive after time #8085

Closed

shoenig self-assigned this Jun 8, 2020

pznamensky mentioned this issue Jun 15, 2020

Nomad server daemon hung with 100% CPU #8163

Closed

raunak2004 closed this as completed Jul 9, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad node becoming un-responsoive #7987

Nomad node becoming un-responsoive #7987

raunak2004 commented May 16, 2020 •

edited

Loading

tgross commented May 18, 2020 •

edited

Loading

svudutala commented May 18, 2020

raunak2004 commented May 18, 2020

tgross commented May 18, 2020

idrennanvmware commented Jun 1, 2020

rkettelerij commented Jun 4, 2020 •

edited

Loading

idrennanvmware commented Jun 7, 2020

idrennanvmware commented Jun 8, 2020

shoenig commented Jun 9, 2020

idrennanvmware commented Jun 9, 2020 •

edited

Loading

idrennanvmware commented Jun 9, 2020 •

edited

Loading

svudutala commented Jun 9, 2020

pznamensky commented Jun 10, 2020

svudutala commented Jun 10, 2020 •

edited

Loading

shoenig commented Jun 15, 2020

idrennanvmware commented Jun 15, 2020

kung-foo commented Jun 17, 2020 •

edited

Loading

jrasell commented Jun 17, 2020

kung-foo commented Jun 17, 2020

jorgemarey commented Jun 17, 2020

kung-foo commented Jun 17, 2020

rkettelerij commented Jun 17, 2020 •

edited

Loading

rkettelerij commented Jun 18, 2020 •

edited

Loading

shoenig commented Jun 23, 2020

idrennanvmware commented Jun 26, 2020

rkettelerij commented Jun 29, 2020

rkettelerij commented Jul 6, 2020

idrennanvmware commented Jul 9, 2020

raunak2004 commented Jul 9, 2020

github-actions bot commented Nov 5, 2022

Nomad node becoming un-responsoive #7987

Nomad node becoming un-responsoive #7987

Comments

raunak2004 commented May 16, 2020 • edited Loading

tgross commented May 18, 2020 • edited Loading

svudutala commented May 18, 2020

raunak2004 commented May 18, 2020

tgross commented May 18, 2020

idrennanvmware commented Jun 1, 2020

rkettelerij commented Jun 4, 2020 • edited Loading

idrennanvmware commented Jun 7, 2020

idrennanvmware commented Jun 8, 2020

shoenig commented Jun 9, 2020

idrennanvmware commented Jun 9, 2020 • edited Loading

idrennanvmware commented Jun 9, 2020 • edited Loading

svudutala commented Jun 9, 2020

pznamensky commented Jun 10, 2020

svudutala commented Jun 10, 2020 • edited Loading

shoenig commented Jun 15, 2020

idrennanvmware commented Jun 15, 2020

kung-foo commented Jun 17, 2020 • edited Loading

jrasell commented Jun 17, 2020

kung-foo commented Jun 17, 2020

jorgemarey commented Jun 17, 2020

kung-foo commented Jun 17, 2020

rkettelerij commented Jun 17, 2020 • edited Loading

rkettelerij commented Jun 18, 2020 • edited Loading

shoenig commented Jun 23, 2020

idrennanvmware commented Jun 26, 2020

rkettelerij commented Jun 29, 2020

rkettelerij commented Jul 6, 2020

idrennanvmware commented Jul 9, 2020

raunak2004 commented Jul 9, 2020

github-actions bot commented Nov 5, 2022

raunak2004 commented May 16, 2020 •

edited

Loading

tgross commented May 18, 2020 •

edited

Loading

rkettelerij commented Jun 4, 2020 •

edited

Loading

idrennanvmware commented Jun 9, 2020 •

edited

Loading

idrennanvmware commented Jun 9, 2020 •

edited

Loading

svudutala commented Jun 10, 2020 •

edited

Loading

kung-foo commented Jun 17, 2020 •

edited

Loading

rkettelerij commented Jun 17, 2020 •

edited

Loading

rkettelerij commented Jun 18, 2020 •

edited

Loading