[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

Kaspiman · 2023-05-20T15:45:36Z

No duplicates 🥲.

I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

A bug happened!

I'm using Jobs plugin and trying to configure jobs.pool.destroy_timeout parameter for waiting while long tasks will be finished. I expect the workers to killed after the set timeout, but the killing happens much sooner. The log also shows that the task could not be returned to the queue due to a closed connection.

Version (rr --version)

rr2 version 2023.1.3 (build time: 2023-05-11T12:32:12+0000, go1.20.4), OS: linux, arch: amd64

How to reproduce the issue?

Create dummy worker with sleep 120s. Add some tasks to queue test-1.

Use config like this with destroy_timeout: 300s (5 min):

version: '3'

server:
    command: '/usr/local/bin/php public/worker.php'

amqp:
    addr: 'amqp://guest:guest@rabbitmq:5672'

logs:
    level: debug
    mode: development
    output: stdout

endure:
    grace_period: 600s

jobs:
    num_pollers: 12
    pool:
        num_workers: 2
        max_jobs: 0
        destroy_timeout: 300s
    consume:
        - test-1
    pipelines:
        test-1:
            driver: amqp
            config:
                queue: test-1
                prefetch: 10
                priority: 10
                exchange: default
                redial_timeout: 60
                routing_key: test-1
                durable: true
                exchange_durable: true
                consume_all: true

Start ./rr serve, wait for task delivery and stop with CTRL+C.

Let's see the log output. There we have grace timeout is: 600 seconds and destroy signal received {"timeout": "5m0s"}. Looks good.

Next, i see a 10 second wait and killing of the workers:

2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3391}

Relevant log output

2023-05-20T15:36:35+0000        DEBUG   rpc             plugin was started      {"address": "tcp://127.0.0.1:6001", "list of the plugins with RPC methods:": ["jobs", "app", "reset
ter", "metrics", "informer", "lock"]}
2023-05-20T15:36:35+0000        DEBUG   jobs            driver ready    {"pipeline": "test-1", "driver": "amqp", "start": "2023-05-20T15:36:35+0000", "elapsed": "4.686321ms"}
2023-05-20T15:36:35+0000        DEBUG   amqp            pipeline was started    {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:35+0000", "elapsed": "1.747318m
s"}
[INFO] RoadRunner server started; version: 2023.1.3, buildtime: 2023-05-11T12:32:12+0000
2023-05-20T15:36:35+0000        DEBUG   amqp            get raw payload {"assigned ID": "46053157-1d06-480e-bd3f-0608d97a84ec"}
2023-05-20T15:36:35+0000        DEBUG   amqp            get raw payload {"assigned ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc"}
2023-05-20T15:36:35+0000        DEBUG   server          worker is allocated     {"pid": 3389, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000        DEBUG   server          worker is allocated     {"pid": 3391, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000        DEBUG   jobs            job processing was started      {"ID": "46053157-1d06-480e-bd3f-0608d97a84ec", "start": "2023-05-20T15:36:35+0000", "elapse
d": "44.058µs"}
2023-05-20T15:36:35+0000        DEBUG   jobs            job processing was started      {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapse
d": "48.162µs"}
2023-05-20T15:36:35+0000        INFO    server          [2023-05-20T15:36:35+00:00] [3391] command.INFO  Started task
{"task":{"id":"46053157-1d06-480e-bd3f-0608d97a84ec","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-9564f7aa2d9aff2c112707d0a20cd932-e
55348c3c72bf326-01"],"Uber-Trace-Id":["9564f7aa2d9aff2c112707d0a20cd932:e55348c3c72bf326:0:1"]}}}

2023-05-20T15:36:35+0000        INFO    server          [2023-05-20T15:36:35+00:00] [3389] command.INFO  Started task
{"task":{"id":"f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-b90ddf6cf0e4dd80289f0234957731a6-f
3ed915eb7dce574-01"],"Uber-Trace-Id":["b90ddf6cf0e4dd80289f0234957731a6:f3ed915eb7dce574:0:1"]}}}

^Cstop signal received, grace timeout is: 600 seconds
2023-05-20T15:36:38+0000        DEBUG   amqp            pipeline was stopped    {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:38+0000", "elapsed": "4.366µs"}
2023-05-20T15:36:38+0000        INFO    server          destroy signal received {"timeout": "5m0s"}
2023-05-20T15:36:38+0000        DEBUG   amqp            starting stop routine
2023-05-20T15:36:38+0000        DEBUG   amqp            delivery channel was closed, leaving the rabbit listener
2023-05-20T15:36:38+0000        DEBUG   amqp            number of listeners     {"listeners": 0}
2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3389}
2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3389}
2023-05-20T15:37:18+0000        WARN    server          network error   {"reason": "network", "pid": 3391, "internal_event_name": "EventWorkerError", "error": "worker_exec: Networ
connection is not open\""}
2023-05-20T15:37:18+0000        ERROR   jobs            job execute failed      {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        DEBUG   jobs            ------> job poller was stopped <------
2023-05-20T15:37:18+0000        WARN    server          network error   {"reason": "network", "pid": 3389, "internal_event_name": "EventWorkerError", "error": "worker_exec: Networ
k:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        ERROR   jobs            job processed with errors       {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF", "ID": "f68075ec-fb37-46
b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapsed": "42.538869896s"}
2023-05-20T15:37:18+0000        ERROR   jobs            negatively acknowledge failed   {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "error": "Exception (504) Reason: \"channel/
connection is not open\""}
2023-05-20T15:37:18+0000        ERROR   jobs            job execute failed      {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        DEBUG   jobs            ------> job poller was stopped <------
2023-05-20T15:37:18+0000        DEBUG   server          worker destroyed        {"pid": 3391, "internal_event_name": "EventWorkerDestruct"}
2023-05-20T15:37:18+0000        DEBUG   server          worker destroyed        {"pid": 3389, "internal_event_name": "EventWorkerDestruct"}

The text was updated successfully, but these errors were encountered:

rustatian · 2023-05-20T17:37:47Z

Hey @embargo2710 👋🏻
Thanks for the report, fixed 👍🏻

Kaspiman added B-bug Bug: bug, exception F-need-verification labels May 20, 2023

Kaspiman assigned rustatian May 20, 2023

Kaspiman changed the title ~~[🐛 BUG]:~~ [🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill May 20, 2023

rustatian added this to General May 20, 2023

github-project-automation bot moved this to Backlog in General May 20, 2023

rustatian moved this from Backlog to In Progress in General May 20, 2023

rustatian modified the milestones: v2023.2.0, v2023.1.4 May 20, 2023

rustatian mentioned this issue May 20, 2023

fix: take into account pool's deadline option roadrunner-server/server#43

Merged

6 tasks

rustatian removed the F-need-verification label May 20, 2023

rustatian closed this as completed in roadrunner-server/server#43 May 20, 2023

github-project-automation bot moved this from In Progress to Unreleased in General May 20, 2023

rustatian mentioned this issue May 20, 2023

fix: improve destroy signal handling roadrunner-server/sdk#72

Merged

6 tasks

rustatian moved this from Unreleased to Done in General May 25, 2023

rustatian mentioned this issue May 25, 2023

release: v2023.1.4 #1575

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

Kaspiman commented May 20, 2023

rustatian commented May 20, 2023

[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

Comments

Kaspiman commented May 20, 2023

No duplicates 🥲.

What happened?

Version (rr --version)

How to reproduce the issue?

Relevant log output

rustatian commented May 20, 2023