Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send QUIT to worker before dying #14913

Merged
merged 1 commit into from
Feb 21, 2024

Conversation

TheRealHaoLiu
Copy link
Member

@TheRealHaoLiu TheRealHaoLiu commented Feb 21, 2024

SUMMARY

Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit

reproduce instruction

  • start a long running job like a long sleep
  • shutdown postgres while the job is running
  • wait for > 40 second (DISPATCHER_DB_DOWNTOWN_TOLLERANCE)
  • restart postgres

symptoms

  • job is stuck in running state (should fail)
  • new job stuck in pending state (should run)

run gdb python -p <pid of dispatcher main process> with py-bt
see that dispatcher is stuck at waiting for child process to terminate

Traceback (most recent call first):
  <built-in method waitpid of module object at remote 0x7f923d461130>
  File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib64/python3.9/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/lib64/python3.9/multiprocessing/util.py", line 357, in _exit_function
    p.join()

run gdb python -p <dispatcher child process with job associated> with py-bt`
see the dispatcher child process is in the main waiting loop reading from queue

Traceback (most recent call first):
  File "/usr/lib64/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 935, in wait
    ready = selector.select(timeout)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 428, in _poll
    r = wait([self], timeout)
  File "/usr/lib64/python3.9/multiprocessing/connection.py", line 261, in poll
    return self._poll(timeout)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 279, in read
    return queue.get(block=True, timeout=1)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/base.py", line 291, in work_loop
    body = self.read(queue)

sending quit to all worker queue before raising exception will signal workers to quit

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API
AWX VERSION
awx: 23.8.2.dev33+gaf6a9410bb.d20240221
ADDITIONAL INFORMATION

Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit

Co-Authored-By: Alan Rominger <arominge@redhat.com>
@TheRealHaoLiu TheRealHaoLiu marked this pull request as ready for review February 21, 2024 20:10
Copy link
Member

@AlanCoding AlanCoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting my thoughts here for later:

For later enhancements, we can obviously put this in a method. That method could be useful to add additional logging. We could even wait to see if the workers exit, and then upgrade the action to SIGTERM and then SIGKILL. This problem of shutdown deadlocks has been disruptive, so it makes sense to me to do any given amount of that.

@TheRealHaoLiu TheRealHaoLiu merged commit 3fb3125 into ansible:devel Feb 21, 2024
23 checks passed
@TheRealHaoLiu TheRealHaoLiu deleted the quit-worker-queue branch February 21, 2024 21:08
Sasa993 pushed a commit to Sasa993/awx that referenced this pull request Feb 22, 2024
Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit

Co-authored-by: Alan Rominger <arominge@redhat.com>
djyasin pushed a commit to djyasin/awx that referenced this pull request Sep 16, 2024
Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit

Co-authored-by: Alan Rominger <arominge@redhat.com>
djyasin pushed a commit to djyasin/awx that referenced this pull request Nov 11, 2024
Fix deadlock scenario where dispatcher child process stuck in reading from queue loop after dispatcher parent process decided to quit

Co-authored-by: Alan Rominger <arominge@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants