Tasks being killed off due to rpc error #9950

thetooth · 2021-02-03T01:08:11Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Linux magic 5.5.13-arch2-1 #1 SMP PREEMPT Mon, 30 Mar 2020 20:42:41 +0000 x86_64 GNU/Linux

Issue

All tasks are currently being killed off on a particular node shortly after they start, anything scheduled here that needs to run on this node ends up in a permanent pending state.

Restarting nomad allows the task to be rescheduled after purging it but it will quickly fail again. The logs suggest the garbage collector is destroying everything on the node the moment it's catalogued???

    2021-02-03T09:02:30.304+1030 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stdout.fifo timestamp=2021-02-03T09:02:30.304+1030
 client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stdout.fifo timestamp=2021-02-03T09:02:30.304+1030
    2021-02-03T09:02:30.304+1030 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stderr.fifo timestamp=2021-02-03T09:02:30.304+1030
 client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stderr.fifo timestamp=2021-02-03T09:02:30.304+1030
    2021/02/03 09:02:30.360886 [INFO] (runner) creating new runner (dry: false, once: false)
    2021/02/03 09:02:30.360999 [INFO] (runner) creating watcher
(runner) creating new runner (dry: false, once: false)
(runner) creating watcher
    2021/02/03 09:02:30.361067 [INFO] (runner) starting
(runner) starting
    2021/02/03 09:02:30.478630 [INFO] (runner) rendered "(dynamic)" => "/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/recorder/secrets/file.env"
(runner) rendered "(dynamic)" => "/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/recorder/secrets/file.env"
    2021-02-03T09:02:30.483+1030 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c cd /opt && /opt/recorder -nats nats://192.168.0.244:4222]}"
 client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c cd /opt && /opt/recorder -nats nats://192.168.0.244:4222]}"
    2021-02-03T09:06:48.469+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=fb42bec2-b846-ee41-2d34-9b6d82593ba7 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=fb42bec2-b846-ee41-2d34-9b6d82593ba7 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: marking allocation for GC: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=b9711ce8-80de-879f-b86a-767366574fae reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=abf24944-980e-e088-c558-9f43c02857ab reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=ebf18f69-9f3d-1fba-02c2-0cd5137855e1 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=46a1a840-def3-108d-eff8-2f3d41f00a5f reason="forced collection"
 client.gc: marking allocation for GC: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50
 client.gc: garbage collecting allocation: alloc_id=b9711ce8-80de-879f-b86a-767366574fae reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=abf24944-980e-e088-c558-9f43c02857ab reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=ebf18f69-9f3d-1fba-02c2-0cd5137855e1 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=46a1a840-def3-108d-eff8-2f3d41f00a5f reason="forced collection"
    2021-02-03T09:42:42.595+1030 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 driver=raw_exec task_name=recorder error="rpc error: code = Unavailable desc = transport is closing"
    2021-02-03T09:42:42.595+1030 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder error="rpc error: code = Canceled desc = grpc: the client connection is closing"
client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 driver=raw_exec task_name=recorder error="rpc error: code = Unavailable desc = transport is closing"
client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder error="rpc error: code = Canceled desc = grpc: the client connection is closing"
    2021/02/03 09:42:42.602868 [INFO] (runner) stopping
    2021/02/03 09:42:42.602919 [INFO] (runner) received finish
(runner) stopping
(runner) received finish

The text was updated successfully, but these errors were encountered:

tgross · 2021-02-03T13:24:32Z

Hi @thetooth! The "forced collection" message you're seeing is because the client is full enough that it needs to GC allocations that have failed. ~4 minutes passed between start and finish, and the GC can't reap a running task. The RPC error you're seeing is coming from the stats collector not having anything to collect stats from; it shouldn't be throwing an error on a task stop and that's probably a bug but that's unrelated to why the task has stopped in the first place.

A few things we could use here:

Take a look at your client GC config so as to delay the GC of allocations: gc_max_allocs, gc_disk_usage_threshold, etc.
Once you've done that, you should be able to get the nomad alloc logs :alloc_id for the allocation (don't forget to check with the -stderr flag too)
Turn up the client's logs to debug so that we can see if the client is giving us any more clues as to what's going on.

tgross · 2021-02-26T20:58:18Z

Without additional information there's not much more we can do with this issue. Going to close this one out. Please feel free to re-open or open a new issue if you have additional information.

github-actions · 2022-10-22T02:41:53Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

nickethier added theme/client type/bug stage/needs-investigation labels Feb 3, 2021

tgross added theme/driver/raw_exec stage/waiting-reply labels Feb 3, 2021

tgross self-assigned this Feb 12, 2021

tgross closed this as completed Feb 26, 2021

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks being killed off due to rpc error #9950

Tasks being killed off due to rpc error #9950

thetooth commented Feb 3, 2021

tgross commented Feb 3, 2021 •

edited

Loading

tgross commented Feb 26, 2021

github-actions bot commented Oct 22, 2022

Tasks being killed off due to rpc error #9950

Tasks being killed off due to rpc error #9950

Comments

thetooth commented Feb 3, 2021

Nomad version

Operating system and Environment details

Issue

tgross commented Feb 3, 2021 • edited Loading

tgross commented Feb 26, 2021

github-actions bot commented Oct 22, 2022

tgross commented Feb 3, 2021 •

edited

Loading