Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Client becomes unresponsive after time #8085

Closed
idrennanvmware opened this issue Jun 1, 2020 · 5 comments
Closed

Nomad Client becomes unresponsive after time #8085

idrennanvmware opened this issue Jun 1, 2020 · 5 comments
Assignees
Labels

Comments

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Jun 1, 2020

Nomad version

Output from nomad version

Operating system and Environment details

Nomad v0.11.1 (b434570)

Issue

After some time nomad clients can become unresponsive. We have a cluster that clients join and leave with regularity. After some time a client can become unresponsive BUT the nomad agent is actually running.

At this time the following is observed:

  1. systemctl status nomad shows the service as running
    systemctl status nomad
    ● nomad.service - "HashiCorp Nomad - Workload orchestrator"
    Loaded: loaded (/etc/systemd/system/nomad.service; enabled; vendor preset: enabled)
    Active: active (running) since Tue 2020-05-12 17:43:33 UTC; 2 weeks 5 days ago
    Docs: https://nomadproject.io/docs/
    Main PID: 122615 (nomad)
    Tasks: 105
    Memory: 379.1M
    CGroup: /system.slice/nomad.service
    ├─ 50060 /opt/nomad/0.11.1/nomad logmon
    ├─ 50061 /opt/nomad/0.11.1/nomad logmon
    ├─ 50177 /opt/nomad/0.11.1/nomad docker_logger
    ├─ 50212 /opt/nomad/0.11.1/nomad docker_logger
    ├─ 50277 /opt/nomad/0.11.1/nomad logmon
    ├─ 50337 /opt/nomad/0.11.1/nomad docker_logger
    ├─107085 /opt/nomad/0.11.1/nomad logmon
    ├─107086 /opt/nomad/0.11.1/nomad logmon
    ├─107197 /opt/nomad/0.11.1/nomad docker_logger
    ├─107230 /opt/nomad/0.11.1/nomad docker_logger
    └─122615 /opt/nomad/0.11.1/nomad agent -config /etc/nomad.d

  2. The health check on the node is failing for nomad client
    Get http://0.0.0.0:4646/v1/agent/health?type=client: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

  3. The nomad agent has not written any logs for days, and the last few entries are
    2020/05/24 20:30:01.028143 [INFO] (runner) creating new runner (dry: false, once: false)
    2020/05/24 20:30:01.029012 [INFO] (runner) creating watcher
    2020/05/24 20:30:01.032749 [INFO] (runner) starting
    2020/05/24 20:30:01.152990 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/51241bf4-c37b-eced-2c43-308cd03c4808/run-consul-backup/secrets/file.env"
    2020-05-24T20:30:01.156Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
    2020-05-24T20:30:16.655Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
    2020-05-24T20:30:16.658Z [INFO] client.gc: marking allocation for GC: alloc_id=51241bf4-c37b-eced-2c43-308cd03c4808
    2020/05/24 20:30:16.661027 [INFO] (runner) stopping
    2020/05/24 20:30:16.661518 [INFO] (runner) received finish
    2020-05-24T20:51:14.883Z [INFO] client.gc: garbage collecting allocation: alloc_id=914fce0e-0996-4e48-aa44-0c87ca4434ce reason="forced collection"
    2020-05-24T21:00:00.444Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:00:00.444Z
    2020-05-24T21:00:00.447Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup @module=logmon path=/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/alloc/logs/.run-consul-backup.stderr.fifo timestamp=2020-05-24T21:00:00.447Z
    2020/05/24 21:00:01.141007 [INFO] (runner) creating new runner (dry: false, once: false)
    2020/05/24 21:00:01.142009 [INFO] (runner) creating watcher
    2020/05/24 21:00:01.143019 [INFO] (runner) starting
    2020/05/24 21:00:01.248536 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/1523fc28-b77b-9043-2857-99a3e7f9c021/run-consul-backup/secrets/file.env"
    2020-05-24T21:00:01.253Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
    2020-05-24T21:00:16.806Z [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 driver=raw_exec task_name=run-consul-backup error="rpc error: code = Unavailable desc = transport is closing"
    2020-05-24T21:00:16.806Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup error="rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: dial unix /tmp/plugin525470496: connect: no such file or directory""
    2020-05-24T21:00:16.807Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021 task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
    2020-05-24T21:00:16.809Z [INFO] client.gc: marking allocation for GC: alloc_id=1523fc28-b77b-9043-2857-99a3e7f9c021
    2020/05/24 21:00:16.812760 [INFO] (runner) stopping
    2020/05/24 21:00:16.812991 [INFO] (runner) received finish
    2020-05-24T21:21:14.885Z [INFO] client.gc: garbage collecting allocation: alloc_id=a4870502-4cd3-4404-c977-c2b18fd2f31b reason="forced collection"
    2020-05-24T21:30:00.409Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup @module=logmon path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stdout.fifo timestamp=2020-05-24T21:30:00.409Z
    2020-05-24T21:30:00.411Z [INFO] client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup path=/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/alloc/logs/.run-consul-backup.stderr.fifo @module=logmon timestamp=2020-05-24T21:30:00.410Z
    2020/05/24 21:30:01.014612 [INFO] (runner) creating new runner (dry: false, once: false)
    2020/05/24 21:30:01.015174 [INFO] (runner) creating watcher
    2020/05/24 21:30:01.015990 [INFO] (runner) starting
    2020/05/24 21:30:01.148086 [INFO] (runner) rendered "(dynamic)" => "/opt/nomad/alloc/2f327d5b-3f89-ace6-3463-71acd31241aa/run-consul-backup/secrets/file.env"
    2020-05-24T21:30:01.152Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/services/consul/assets/scripts/backup.sh Args:[]}"
    2020-05-24T21:30:16.795Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa task=run-consul-backup reason="Restart unnecessary as task terminated successfully"
    2020-05-24T21:30:16.797Z [INFO] client.gc: marking allocation for GC: alloc_id=2f327d5b-3f89-ace6-3463-71acd31241aa
    2020/05/24 21:30:16.800719 [INFO] (runner) stopping
    2020/05/24 21:30:16.801398 [INFO] (runner) received finish
    2020-05-24T21:36:14.880Z [INFO] client.gc: garbage collecting allocation: alloc_id=70a4b1f4-0d38-5eeb-05c8-85e379ad047e reason="forced collection"

  4. Allocations appear to still be running on the node. Previously allocated tasks show active and working and hitting one of them (a fabio allocation) is responsive.

  5. Consul and other services are functioning as expected

  6. systemctl restart nomad hangs indefinitely too. A kill of the process is required. After killing the process Nomad Client functions as expected

Thoughts

Seems this may be a possible deadlock? Wondering if the way we call GC reasonably regularly is something that could be triggering this. We see it occur regularly and simply restarting the nomad agent fixes it.

@pznamensky
Copy link

Looks like the same issue as #8038

@idrennanvmware
Copy link
Contributor Author

Just happened again. Notes on #7987

@tgross
Copy link
Member

tgross commented Jun 22, 2020

@shoenig should we close this one as well given #8038?

@shoenig
Copy link
Member

shoenig commented Jun 23, 2020

I'm gonna close this out in favor of #7987, @idrennanvmware . Definitely let us know if there are still problems after upgrading to Nomad v0.11.3+ !

@shoenig shoenig closed this as completed Jun 23, 2020
@github-actions
Copy link

github-actions bot commented Nov 6, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants