Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"could not reach dispatcher" errors when canceling jobs #12740

Closed
4 of 9 tasks
shanemcd opened this issue Aug 25, 2022 · 6 comments
Closed
4 of 9 tasks

"could not reach dispatcher" errors when canceling jobs #12740

shanemcd opened this issue Aug 25, 2022 · 6 comments

Comments

@shanemcd
Copy link
Member

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

I noticed this in my logs when canceling jobs:

tools_awx_1     | 2022-08-25 16:15:43,403 ERROR    [b8bab65b] awx.main.models.unified_jobs could not reach dispatcher on awx_1 within 5s

I was seeing this immediately, not after 5 seconds.

I changed this from error to exception:

logger.error('could not reach dispatcher on {} within {}s'.format(self.execution_node, timeout))

Now I see:

tools_awx_1     | Traceback (most recent call last):
tools_awx_1     |   File "/awx_devel/./awx/main/models/unified_jobs.py", line 1408, in actually_running
tools_awx_1     |     running = self.celery_task_id in ControlDispatcher('dispatcher', self.controller_node or self.execution_node).running(timeout=timeout)
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/control.py", line 38, in running
tools_awx_1     |     return self.control_with_reply('running', *args, **kwargs)
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/control.py", line 53, in control_with_reply
tools_awx_1     |     for reply in conn.events(select_timeout=timeout, yield_timeouts=True):
tools_awx_1     |   File "/awx_devel/./awx/main/dispatch/__init__.py", line 35, in events
tools_awx_1     |     raise RuntimeError('Listening for events can only be done in autocommit mode')
tools_awx_1     | RuntimeError: Listening for events can only be done in autocommit mode

AWX version

21.5.1.dev1+gcbef8717a8.d20220825

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

N/A

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Try canceling a job

Expected results

No errors

Actual results

An error

Additional information

No response

@shanemcd
Copy link
Member Author

shanemcd commented Aug 25, 2022

It seems like this was problem introduced in #12573

Here is what I've learned so far:

get_autocommit here returns False:

if not pg_connection.get_autocommit():

But if you look at self.conn.autocommit, it returns True

So the first thing I tried was:

diff --git a/awx/main/dispatch/__init__.py b/awx/main/dispatch/__init__.py
index 7fa4bd06f1..970e9bfc09 100644
--- a/awx/main/dispatch/__init__.py
+++ b/awx/main/dispatch/__init__.py
@@ -16,6 +16,7 @@ def get_local_queuename():
 
 class PubSub(object):
     def __init__(self, conn):
+        assert conn.autocommit, "Connection must be in autocommit mode."
         self.conn = conn
 
     def listen(self, channel):
@@ -31,9 +32,6 @@ class PubSub(object):
             cur.execute('SELECT pg_notify(%s, %s);', (channel, payload))
 
     def events(self, select_timeout=5, yield_timeouts=False):
-        if not pg_connection.get_autocommit():
-            raise RuntimeError('Listening for events can only be done in autocommit mode')
-
         while True:
             if select.select([self.conn], [], [], select_timeout) == NOT_READY:
                 if yield_timeouts:
@@ -73,6 +71,8 @@ def pg_bus_conn(new_connection=False):
             raise RuntimeError('Unexpectedly could not connect to postgres for pg_notify actions')
         conn = pg_connection.connection
 
+        assert conn.autocommit, "Connection must be in autocommit mode."
+
     pubsub = PubSub(conn)
     yield pubsub
     if new_connection:

But that now causes:

tools_awx_1     | 2022-08-25 17:04:36,911 ERROR    [a4f65eaa] awx.main.dispatch Worker failed to run task awx.main.scheduler.tasks.task_manager(*[], **{}
tools_awx_1     | Traceback (most recent call last):
tools_awx_1     |   File "/awx_devel/awx/main/dispatch/worker/task.py", line 102, in perform_work
tools_awx_1     |     result = self.run_callable(body)
tools_awx_1     |   File "/awx_devel/awx/main/dispatch/worker/task.py", line 77, in run_callable
tools_awx_1     |     return _call(*args, **kwargs)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/tasks.py", line 25, in task_manager
tools_awx_1     |     run_manager(TaskManager, "task")
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/tasks.py", line 20, in run_manager
tools_awx_1     |     manager().schedule()
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 132, in schedule
tools_awx_1     |     self._schedule()
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 58, in inner
tools_awx_1     |     result = func(*args, **kwargs)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 745, in _schedule
tools_awx_1     |     self.process_tasks()
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 714, in process_tasks
tools_awx_1     |     self.process_pending_tasks(pending_tasks)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 58, in inner
tools_awx_1     |     result = func(*args, **kwargs)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 635, in process_pending_tasks
tools_awx_1     |     self.start_task(task, self.controlplane_ig, task.get_jobs_fail_chain(), execution_instance)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 58, in inner
tools_awx_1     |     result = func(*args, **kwargs)
tools_awx_1     |   File "/awx_devel/awx/main/scheduler/task_manager.py", line 563, in start_task
tools_awx_1     |     task_cls.apply_async(
tools_awx_1     |   File "/awx_devel/awx/main/dispatch/publish.py", line 97, in apply_async
tools_awx_1     |     with pg_bus_conn() as conn:
tools_awx_1     |   File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
tools_awx_1     |     return next(self.gen)
tools_awx_1     |   File "/awx_devel/awx/main/dispatch/__init__.py", line 74, in pg_bus_conn
tools_awx_1     |     assert conn.autocommit, "Connection must be in autocommit mode."
tools_awx_1     | AssertionError: Connection must be in autocommit mode.

I'm starting to think it may be pretty dangerous to reuse the connection here. What are the implications of trying to reuse a connection that is not in autocommit mode? Is it safe to enable autocommit mode on an existing connection? I wouldn't think so. This code blowing up is in the task manager, which runs inside a transaction...

@shanemcd
Copy link
Member Author

I put up that PR to revert the broken code ^

I get what you were trying to do there, but it seemed like it wasn't directly related to the thing you were trying to fix in #12573. Can we go ahead and fix this bug and deal with the number of database connections in #11745 or as a separate effort?

@AlanCoding
Copy link
Member

What are the implications of trying to reuse a connection that is not in autocommit mode?

It's fine generally, it just happens to conflict with certain things that we do. For task notifiers there's no problem. For task listeners we have reasons that it doesn't work for us:

  • The dispatcher main process listens for new tasks, but closes its db connection when forking, self.pool.up(). You obviously can't use the existing connection to listen persistently when you close it frequently. Because we need this to be as long-lived as possible, it is proper that we create a new connection (although it gives me concern about our error handling... this hasn't changed any).
  • When checking the status of a job (during a cancel) we do control-with-reply, and this doesn't work with autocommit=False, because it won't NOTIFY until the transaction commits, and we are waiting for the reply inside of the transaction.

Is it safe to enable autocommit mode on an existing connection?

Absolutely not. We should never do that.

This code blowing up is in the task manager, which runs inside a transaction...

In your diff, because you re-inserted the assert. I follow that.

Because I understand the purpose of the task manager, I know that it submits tasks, but it never uses this for listening over pg_notify. The only exception might be when it cancels jobs inside of a workflow... even then, it would be most correct to create a new connection and do a catch of any Exception, as in #12749.

Otherwise I feel pretty confident that this was a fairly isolated bug and would prefer to move ahead with #12769 as the fix.

@shanemcd
Copy link
Member Author

It's fine generally, it just happens to conflict with certain things that we do. For task notifiers there's no problem

This seems to conflict with what it says here: https://www.psycopg.org/docs/advanced.html?highlight=autocommit

"Because of the way sessions interact with notifications (see NOTIFY documentation), you should keep the connection in autocommit mode if you wish to receive or send notifications in a timely manner."

Will pull your new code and play around with it.

@AlanCoding
Copy link
Member

Acknowledged about the warning, and I'm ok with this. Because our prior solution to this problem was to put the NOTIFY actions in Django on_commit methods, which was less timely than doing it in the transaction. For a demonstration case, say the task manager takes 1 minute to run. A job is dispatched at the 30 second mark. It would take 30 seconds before the NOTIFY is received by another client. That fails the doc's definition of a "timely manner", but it is what we want.

@AlanCoding
Copy link
Member

Link fallout from the fix for this - #13017

Not a huge deal, but needs some tweaking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants