Nomad Client becomes unresponsive after time #8085

idrennanvmware · 2020-06-01T12:58:14Z

Nomad version

Output from nomad version

Operating system and Environment details

Nomad v0.11.1 (b434570)

Issue

After some time nomad clients can become unresponsive. We have a cluster that clients join and leave with regularity. After some time a client can become unresponsive BUT the nomad agent is actually running.

At this time the following is observed:

systemctl status nomad shows the service as running
systemctl status nomad
● nomad.service - "HashiCorp Nomad - Workload orchestrator"
Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2020-05-12 17:43:33 UTC; 2 weeks 5 days ago
Docs: https://nomadproject.io/docs/
Main PID: 122615 (nomad)
Tasks: 105
Memory: 379.1M
CGroup: /system.slice/nomad.service
├─ 50060 /opt/nomad/0.11.1/nomad logmon
├─ 50061 /opt/nomad/0.11.1/nomad logmon
├─ 50177 /opt/nomad/0.11.1/nomad docker_logger
├─ 50212 /opt/nomad/0.11.1/nomad docker_logger
├─ 50277 /opt/nomad/0.11.1/nomad logmon
├─ 50337 /opt/nomad/0.11.1/nomad docker_logger
├─107085 /opt/nomad/0.11.1/nomad logmon
├─107086 /opt/nomad/0.11.1/nomad logmon
├─107197 /opt/nomad/0.11.1/nomad docker_logger
├─107230 /opt/nomad/0.11.1/nomad docker_logger
└─122615 /opt/nomad/0.11.1/nomad agent -config /etc/nomad.d
The health check on the node is failing for nomad client
Get http://0.0.0.0:4646/v1/agent/health?type=client: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The nomad agent has not written any logs for days, and the last few entries are
2020/05/24 20:30:01.028143 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 20:30:01.029012 [INFO] (runner) creating watcher
2020/05/24 20:30:01.032749 [INFO] (runner) starting
2020/05/24 20:30:01.152990 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/51241bf4-c37b-eced-2c43-308cd03c4808/run-consul-backup/secrets/file.env"
2020-05-24T20:30:01.156Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T20:30:16.655Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T20:30:16.658Z [INFO] client.gc: marking allocation for GC: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808
2020/05/24 20:30:16.661027 [INFO] (runner) stopping
2020/05/24 20:30:16.661518 [INFO] (runner) received finish
2020-05-24T20:51:14.883Z [INFO] client.gc: garbage collecting allocation: alloc_id=914fce0e-0996-4e48-aa44-0c87ca4434ce reason="forced collection"
2020-05-24T21:00:00.444Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:00:00.444Z
2020-05-24T21:00:00.447Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stderr.fifo timestamp=2020-05-24T21:00:00.447Z
2020/05/24 21:00:01.141007 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 21:00:01.142009 [INFO] (runner) creating watcher
2020/05/24 21:00:01.143019 [INFO] (runner) starting
2020/05/24 21:00:01.248536 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/run-consul-backup/secrets/file.env"
2020-05-24T21:00:01.253Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T21:00:16.806Z [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 driver=raw_exec task_name=run-consul-backup error="rpc error: code = Unavailable desc = transport is closing"
2020-05-24T21:00:16.806Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup error="rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin525470496: connect: no such file or directory""
2020-05-24T21:00:16.807Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T21:00:16.809Z [INFO] client.gc: marking allocation for GC: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021
2020/05/24 21:00:16.812760 [INFO] (runner) stopping
2020/05/24 21:00:16.812991 [INFO] (runner) received finish
2020-05-24T21:21:14.885Z [INFO] client.gc: garbage collecting allocation: alloc_id=a4870502-4cd3-4404-c977-c2b18fd2f31b reason="forced collection"
2020-05-24T21:30:00.409Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup @module=logmon path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:30:00.409Z
2020-05-24T21:30:00.411Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stderr.fifo @module=logmon timestamp=2020-05-24T21:30:00.410Z
2020/05/24 21:30:01.014612 [INFO] (runner) creating new runner (dry: false, once: false)
2020/05/24 21:30:01.015174 [INFO] (runner) creating watcher
2020/05/24 21:30:01.015990 [INFO] (runner) starting
2020/05/24 21:30:01.148086 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/run-consul-backup/secrets/file.env"
2020-05-24T21:30:01.152Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
2020-05-24T21:30:16.795Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
2020-05-24T21:30:16.797Z [INFO] client.gc: marking allocation for GC: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa
2020/05/24 21:30:16.800719 [INFO] (runner) stopping
2020/05/24 21:30:16.801398 [INFO] (runner) received finish
2020-05-24T21:36:14.880Z [INFO] client.gc: garbage collecting allocation: alloc_id=70a4b1f4-0d38-5eeb-05c8-85e379ad047e reason="forced collection"
Allocations appear to still be running on the node. Previously allocated tasks show active and working and hitting one of them (a fabio allocation) is responsive.
Consul and other services are functioning as expected
systemctl restart nomad hangs indefinitely too. A kill of the process is required. After killing the process Nomad Client functions as expected

Thoughts

Seems this may be a possible deadlock? Wondering if the way we call GC reasonably regularly is something that could be triggering this. We see it occur regularly and simply restarting the nomad agent fixes it.

The text was updated successfully, but these errors were encountered:

pznamensky · 2020-06-02T10:17:14Z

Looks like the same issue as #8038

idrennanvmware · 2020-06-07T13:58:16Z

Just happened again. Notes on #7987

tgross · 2020-06-22T17:26:15Z

@shoenig should we close this one as well given #8038?

shoenig · 2020-06-23T14:51:02Z

I'm gonna close this out in favor of #7987, @idrennanvmware . Definitely let us know if there are still problems after upgrading to Nomad v0.11.3+ !

github-actions · 2022-11-06T02:35:05Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

idrennanvmware mentioned this issue Jun 1, 2020

Nomad node becoming un-responsoive #7987

Closed

shoenig self-assigned this Jun 10, 2020

pznamensky mentioned this issue Jun 15, 2020

Nomad server daemon hung with 100% CPU #8163

Closed

tgross added the type/bug label Jun 22, 2020

shoenig closed this as completed Jun 23, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad Client becomes unresponsive after time #8085

Nomad Client becomes unresponsive after time #8085

idrennanvmware commented Jun 1, 2020 •

edited

Loading

pznamensky commented Jun 2, 2020

idrennanvmware commented Jun 7, 2020

tgross commented Jun 22, 2020

shoenig commented Jun 23, 2020

github-actions bot commented Nov 6, 2022

Nomad Client becomes unresponsive after time #8085

Nomad Client becomes unresponsive after time #8085

Comments

idrennanvmware commented Jun 1, 2020 • edited Loading

Nomad version

Operating system and Environment details

Issue

Thoughts

pznamensky commented Jun 2, 2020

idrennanvmware commented Jun 7, 2020

tgross commented Jun 22, 2020

shoenig commented Jun 23, 2020

github-actions bot commented Nov 6, 2022

idrennanvmware commented Jun 1, 2020 •

edited

Loading