Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill #1567

Closed
1 task done
Kaspiman opened this issue May 20, 2023 · 1 comment · Fixed by roadrunner-server/server#43
Assignees
Labels
B-bug Bug: bug, exception
Milestone

Comments

@Kaspiman
Copy link

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

A bug happened!

I'm using Jobs plugin and trying to configure jobs.pool.destroy_timeout parameter for waiting while long tasks will be finished. I expect the workers to killed after the set timeout, but the killing happens much sooner. The log also shows that the task could not be returned to the queue due to a closed connection.

Version (rr --version)

rr2 version 2023.1.3 (build time: 2023-05-11T12:32:12+0000, go1.20.4), OS: linux, arch: amd64

How to reproduce the issue?

Create dummy worker with sleep 120s. Add some tasks to queue test-1.

Use config like this with destroy_timeout: 300s (5 min):

version: '3'

server:
    command: '/usr/local/bin/php public/worker.php'

amqp:
    addr: 'amqp://guest:guest@rabbitmq:5672'

logs:
    level: debug
    mode: development
    output: stdout

endure:
    grace_period: 600s

jobs:
    num_pollers: 12
    pool:
        num_workers: 2
        max_jobs: 0
        destroy_timeout: 300s
    consume:
        - test-1
    pipelines:
        test-1:
            driver: amqp
            config:
                queue: test-1
                prefetch: 10
                priority: 10
                exchange: default
                redial_timeout: 60
                routing_key: test-1
                durable: true
                exchange_durable: true
                consume_all: true

Start ./rr serve, wait for task delivery and stop with CTRL+C.

Let's see the log output. There we have grace timeout is: 600 seconds and destroy signal received {"timeout": "5m0s"}. Looks good.

Next, i see a 10 second wait and killing of the workers:

2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3391}

Relevant log output

2023-05-20T15:36:35+0000        DEBUG   rpc             plugin was started      {"address": "tcp://127.0.0.1:6001", "list of the plugins with RPC methods:": ["jobs", "app", "reset
ter", "metrics", "informer", "lock"]}
2023-05-20T15:36:35+0000        DEBUG   jobs            driver ready    {"pipeline": "test-1", "driver": "amqp", "start": "2023-05-20T15:36:35+0000", "elapsed": "4.686321ms"}
2023-05-20T15:36:35+0000        DEBUG   amqp            pipeline was started    {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:35+0000", "elapsed": "1.747318m
s"}
[INFO] RoadRunner server started; version: 2023.1.3, buildtime: 2023-05-11T12:32:12+0000
2023-05-20T15:36:35+0000        DEBUG   amqp            get raw payload {"assigned ID": "46053157-1d06-480e-bd3f-0608d97a84ec"}
2023-05-20T15:36:35+0000        DEBUG   amqp            get raw payload {"assigned ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc"}
2023-05-20T15:36:35+0000        DEBUG   server          worker is allocated     {"pid": 3389, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000        DEBUG   server          worker is allocated     {"pid": 3391, "internal_event_name": "EventWorkerConstruct"}
2023-05-20T15:36:35+0000        DEBUG   jobs            job processing was started      {"ID": "46053157-1d06-480e-bd3f-0608d97a84ec", "start": "2023-05-20T15:36:35+0000", "elapse
d": "44.058µs"}
2023-05-20T15:36:35+0000        DEBUG   jobs            job processing was started      {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapse
d": "48.162µs"}
2023-05-20T15:36:35+0000        INFO    server          [2023-05-20T15:36:35+00:00] [3391] command.INFO  Started task
{"task":{"id":"46053157-1d06-480e-bd3f-0608d97a84ec","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-9564f7aa2d9aff2c112707d0a20cd932-e
55348c3c72bf326-01"],"Uber-Trace-Id":["9564f7aa2d9aff2c112707d0a20cd932:e55348c3c72bf326:0:1"]}}}

2023-05-20T15:36:35+0000        INFO    server          [2023-05-20T15:36:35+00:00] [3389] command.INFO  Started task
{"task":{"id":"f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc","queue":"test-1","name":"deduced_by_rr","payload":["payload"],"headers":{"Traceparent":["00-b90ddf6cf0e4dd80289f0234957731a6-f
3ed915eb7dce574-01"],"Uber-Trace-Id":["b90ddf6cf0e4dd80289f0234957731a6:f3ed915eb7dce574:0:1"]}}}

^Cstop signal received, grace timeout is: 600 seconds
2023-05-20T15:36:38+0000        DEBUG   amqp            pipeline was stopped    {"driver": "amqp", "pipeline": "test-1", "start": "2023-05-20T15:36:38+0000", "elapsed": "4.366µs"}
2023-05-20T15:36:38+0000        INFO    server          destroy signal received {"timeout": "5m0s"}
2023-05-20T15:36:38+0000        DEBUG   amqp            starting stop routine
2023-05-20T15:36:38+0000        DEBUG   amqp            delivery channel was closed, leaving the rabbit listener
2023-05-20T15:36:38+0000        DEBUG   amqp            number of listeners     {"listeners": 0}
2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3389}
2023-05-20T15:37:08+0000        DEBUG   server          sending stop request to the worker      {"pid": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3391}
2023-05-20T15:37:18+0000        WARN    server          worker doesn't respond on stop command, killing process {"PID": 3389}
2023-05-20T15:37:18+0000        WARN    server          network error   {"reason": "network", "pid": 3391, "internal_event_name": "EventWorkerError", "error": "worker_exec: Networ
connection is not open\""}
2023-05-20T15:37:18+0000        ERROR   jobs            job execute failed      {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        DEBUG   jobs            ------> job poller was stopped <------
2023-05-20T15:37:18+0000        WARN    server          network error   {"reason": "network", "pid": 3389, "internal_event_name": "EventWorkerError", "error": "worker_exec: Networ
k:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        ERROR   jobs            job processed with errors       {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF", "ID": "f68075ec-fb37-46
b9-9a1e-85d3e6cb5bcc", "start": "2023-05-20T15:36:35+0000", "elapsed": "42.538869896s"}
2023-05-20T15:37:18+0000        ERROR   jobs            negatively acknowledge failed   {"ID": "f68075ec-fb37-46b9-9a1e-85d3e6cb5bcc", "error": "Exception (504) Reason: \"channel/
connection is not open\""}
2023-05-20T15:37:18+0000        ERROR   jobs            job execute failed      {"error": "worker_exec: Network:\n\tsync_worker_exec_payload: EOF"}
2023-05-20T15:37:18+0000        DEBUG   jobs            ------> job poller was stopped <------
2023-05-20T15:37:18+0000        DEBUG   server          worker destroyed        {"pid": 3391, "internal_event_name": "EventWorkerDestruct"}
2023-05-20T15:37:18+0000        DEBUG   server          worker destroyed        {"pid": 3389, "internal_event_name": "EventWorkerDestruct"}
@Kaspiman Kaspiman added B-bug Bug: bug, exception F-need-verification labels May 20, 2023
@Kaspiman Kaspiman changed the title [🐛 BUG]: [🐛 BUG]: Safe shutdown of workers occurs before the specified time through a process kill May 20, 2023
@github-project-automation github-project-automation bot moved this to Backlog in General May 20, 2023
@rustatian rustatian moved this from Backlog to In Progress in General May 20, 2023
@rustatian rustatian modified the milestones: v2023.2.0, v2023.1.4 May 20, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Unreleased in General May 20, 2023
@rustatian
Copy link
Member

Hey @embargo2710 👋🏻
Thanks for the report, fixed 👍🏻

@rustatian rustatian moved this from Unreleased to Done in General May 25, 2023
@rustatian rustatian mentioned this issue May 25, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants