You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched for a similar issue in our bug tracker and didn't find any solutions.
What happened?
A bug happened!
I'm using Jobs plugin and trying to configure jobs.pool.destroy_timeout parameter for waiting while long tasks will be finished. I expect the workers to killed after the set timeout, but the killing happens much sooner. The log also shows that the task could not be returned to the queue due to a closed connection.
Start ./rr serve, wait for task delivery and stop with CTRL+C.
Let's see the log output. There we have grace timeout is: 600 seconds and destroy signal received {"timeout": "5m0s"}. Looks good.
Next, i see a 10 second wait and killing of the workers:
2023-05-20T15:37:08+0000 DEBUG server sending stop request to the worker {"pid": 3391}
2023-05-20T15:37:18+0000 WARN server worker doesn't respond on stop command, killing process {"PID": 3391}
Relevant log output
2023-05-20T15:36:35+0000 DEBUG rpc plugin was started {"address": "tcp://127.0.0.1:6001", "list of the plugins with RPC methods:": ["jobs", "app", "resetter", "metrics", "informer", "lock"]}
2023-05-20T15:36:35+0000 DEBUG jobs driver ready {"pipeline": "test-1", "driver": "amqp", "start": "2023-05-20T15:36:35+0000", "elapsed": "4.686321ms"}
2023-05-20T15:36:35+0000 DEBUG amqp pipeline was started {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:35+0000", "elapsed": "1.747318ms"}
[INFO] RoadRunner server started; version: 2023.1.3, buildtime: 2023-05-11T12:32:12+0000
2023-05-20T15:36:35+0000 DEBUG amqp get raw payload {"assigned ID": "46053157-1d06-480e-bd3f-0608d97a84ec"}
2023-05-20T15:36:35+0000 DEBUG amqp get raw payload {"assigned ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc"}
2023-05-20T15:36:35+0000 DEBUG server worker is allocated {"pid": 3389, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000 DEBUG server worker is allocated {"pid": 3391, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000 DEBUG jobs job processing was started {"ID": "46053157-1d06-480e-bd3f-0608d97a84ec", "start": "2023-05-20T15:36:35+0000", "elapsed": "44.058µs"}
2023-05-20T15:36:35+0000 DEBUG jobs job processing was started {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapsed": "48.162µs"}
2023-05-20T15:36:35+0000 INFO server [2023-05-20T15:36:35+00:00] [3391] command.INFO Started task
{"task":{"id":"46053157-1d06-480e-bd3f-0608d97a84ec","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-9564f7aa2d9aff2c112707d0a20cd932-e55348c3c72bf326-01"],"Uber-Trace-Id":["9564f7aa2d9aff2c112707d0a20cd932:e55348c3c72bf326:0:1"]}}}
2023-05-20T15:36:35+0000 INFO server [2023-05-20T15:36:35+00:00] [3389] command.INFO Started task
{"task":{"id":"f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-b90ddf6cf0e4dd80289f0234957731a6-f3ed915eb7dce574-01"],"Uber-Trace-Id":["b90ddf6cf0e4dd80289f0234957731a6:f3ed915eb7dce574:0:1"]}}}
^Cstop signal received, grace timeout is: 600 seconds
2023-05-20T15:36:38+0000 DEBUG amqp pipeline was stopped {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:38+0000", "elapsed": "4.366µs"}
2023-05-20T15:36:38+0000 INFO server destroy signal received {"timeout": "5m0s"}
2023-05-20T15:36:38+0000 DEBUG amqp starting stop routine
2023-05-20T15:36:38+0000 DEBUG amqp delivery channel was closed, leaving the rabbit listener
2023-05-20T15:36:38+0000 DEBUG amqp number of listeners {"listeners": 0}
2023-05-20T15:37:08+0000 DEBUG server sending stop request to the worker {"pid": 3389}
2023-05-20T15:37:08+0000 DEBUG server sending stop request to the worker {"pid": 3391}
2023-05-20T15:37:18+0000 WARN server worker doesn't respond on stop command, killing process {"PID": 3391}2023-05-20T15:37:18+0000 WARN server worker doesn't respond on stop command, killing process {"PID": 3389}
2023-05-20T15:37:18+0000 WARN server network error {"reason": "network", "pid": 3391, "internal_event_name": "EventWorkerError", "error": "worker_exec: Networconnection is not open\""}
2023-05-20T15:37:18+0000 ERROR jobs job execute failed {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000 DEBUG jobs ------> job poller was stopped <------
2023-05-20T15:37:18+0000 WARN server network error {"reason": "network", "pid": 3389, "internal_event_name": "EventWorkerError", "error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000 ERROR jobs job processed with errors {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF", "ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapsed": "42.538869896s"}
2023-05-20T15:37:18+0000 ERROR jobs negatively acknowledge failed {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "error": "Exception (504) Reason: \"channel/connection is not open\""}
2023-05-20T15:37:18+0000 ERROR jobs job execute failed {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000 DEBUG jobs ------> job poller was stopped <------
2023-05-20T15:37:18+0000 DEBUG server worker destroyed {"pid": 3391, "internal_event_name": "EventWorkerDestruct"}
2023-05-20T15:37:18+0000 DEBUG server worker destroyed {"pid": 3389, "internal_event_name": "EventWorkerDestruct"}
The text was updated successfully, but these errors were encountered:
No duplicates 🥲.
What happened?
A bug happened!
I'm using Jobs plugin and trying to configure
jobs.pool.destroy_timeout
parameter for waiting while long tasks will be finished. I expect the workers to killed after the set timeout, but the killing happens much sooner. The log also shows that the task could not be returned to the queue due to a closed connection.Version (rr --version)
rr2 version 2023.1.3 (build time: 2023-05-11T12:32:12+0000, go1.20.4), OS: linux, arch: amd64
How to reproduce the issue?
Create dummy worker with sleep 120s. Add some tasks to queue
test-1
.Use config like this with destroy_timeout: 300s (5 min):
Start
./rr serve
, wait for task delivery and stop with CTRL+C.Let's see the log output. There we have
grace timeout is: 600 seconds
anddestroy signal received {"timeout": "5m0s"}
. Looks good.Next, i see a 10 second wait and killing of the workers:
Relevant log output
The text was updated successfully, but these errors were encountered: