Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow nesting sync_to_async via asyncio.wait_for #367

Merged
merged 6 commits into from
Sep 18, 2023

Conversation

jthorniley
Copy link
Contributor

Change the order of fallbacks used by SyncToAsync to find the appropriate executor for sync code, so it prefers to use AsyncToSync.loop_thread_executors rather than thread_sensitive_context. Add test case to demonstrate problem.

Resolves #348

Copy link
Member

@andrewgodwin andrewgodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kind of changes are incredibly hard to fully review, but it seems reasonable and also passes all the current tests (which are pretty good at catching bad behaviour!).

@carltongibson
Copy link
Member

carltongibson commented Jan 12, 2023

Hi @jthorniley

Resolves #348

I'm still looking at this. There are two issues it looks like: even with the patch here, @brownan's example from #348 still hits a deadlock.

@carltongibson
Copy link
Member

carltongibson commented Jan 12, 2023

The test-case doesn't quite match the behaviour of the example running under Daphne.

In the test the executors selected in SyncToAsync.call are:

<concurrent.futures.thread.ThreadPoolExecutor object at 0x1064a4730> # Wrapping middleware — this is _max_workers = 1
<asgiref.current_thread_executor.CurrentThreadExecutor object at 0x1064a0850> # in async view

Under Daphne we get the same concurrent.futures.thread.ThreadPoolExecutor selected twice (which causes the timeout/deadlock, depending on which option is selected.)

@jthorniley
Copy link
Contributor Author

Hi @carltongibson thanks for your comments, sorry I haven't had time to look at this for a while.

I've tried an alternative approach which I think makes more sense. Instead of changing how we look up the executor, set the reference to the current thread executor using the contextvars lib - since Python 3.7 is the minimum supported version this seems like its probably an option when it wasn't before.

The asgiref local has to be maintained across sync_to_async/async_to_sync conversions using the launch_map lookups, but these lookup are based on asyncio Task ID, which changes when you use asyncio.gather and friends, whereas contextvars is generally "just works" in asyncio code. Also, we already make the effort to transfer context to the new thread in AsyncToSync so there isn't really much extra logic to take care of.

I've tried this implementation with the example ASGI code in issue #348 - and it seems to work (GET request completes and the page renders "Hello world"). I've tried daphne and uvicorn and they both seem ok with this.

One problem is that one of the deadlock unit tests now fails, I think because there isn't a deadlock anymore! (It just actually works). I think its probably safe to just remove the test?

asgiref/sync.py Outdated Show resolved Hide resolved
@jthorniley
Copy link
Contributor Author

I'd comment that also, while this "fixes" the current thread executor issue, it also "breaks" asgiref.Local, which will not be maintained across a stack of sync_to_async/async_to_sync if theres a asyncio.wait_for etc somewhere in the middle. I'm not sure if thats really fixable except by either reimplementing Local with contextvars, or just deprecating it in favour of contextvars (Django seems to use it in a few fairly disparate places)

@andrewgodwin
Copy link
Member

The intention was always to implement Local via contextvars once we hit the minimum version to support it, so I wouldn't be against doing that.

@jthorniley
Copy link
Contributor Author

@andrewgodwin

The intention was always to implement Local via contextvars once we hit the minimum version to support it, so I wouldn't be against doing that.

That makes sense, contextvars is python 3.7 which seems to be now the min supported version, I did try to actually do that first and I have an implementation that seems to work, but I wasnt too happy with the way it handles deleting keys, see here:

https://github.com/django/asgiref/compare/main...jthorniley:asgiref:local-contextvar-draft?expand=1

(Comments in the diff, I won't repeat but tldr I'm worried I've made a memory leak)

@carltongibson
Copy link
Member

Hi @jthorniley — thanks for swinging back to this.

For life reasons, it's not likely I'll have capacity to look at this closely soon, so let me just comment quickly not to block:

One problem is that one of the deadlock unit tests now fails, I think because there isn't a deadlock anymore! (It just actually works). I think it's probably safe to just remove the test?

Sounds right 😅

The intention was always to implement Local via contextvars once we hit the minimum version to support it, so I wouldn't be against doing that.

That makes sense, contextvars is python 3.7 which seems to be now the min supported version...

This would be great IMO! I briefly glanced at this back last year, but didn't get to progress it. If you could push it through, super! 🎁

There's even a comment about this:

asgiref/asgiref/local.py

Lines 29 to 30 in 0357158

This doesn't use contextvars as it needs to support 3.6. Once it can support
3.7 only, we can then reimplement the storage more nicely.

@jthorniley
Copy link
Contributor Author

Thanks both for the comments. I've updated the PR to use contextvars as an implementation for Local, and reverted the logic that keeps track of the CurrentThreadExecutor to use that new Local implementation rather than a plain contextvar, as I found in testing that plain contextvars are not thread safe when you are passing them between threads as is the case here.

Anyway, I think this implementation is fairly solid from what testing I've managed to do.

Copy link
Member

@andrewgodwin andrewgodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have no idea how delighted I am that this is net-deletion of code! From what I can see it looks good, and honestly I mostly place my faith in the tests for this sort of thing anyway.

If you think this is ready to merge, I will happily land it and we can get around to a new minor release of asgiref.

@andrewgodwin
Copy link
Member

Talking of tests... those and the lint failures will need fixing first!

@jthorniley
Copy link
Contributor Author

The test failure was a good catch - the state was not cleaned up properly so it only occurred on the full test run rather than running that test individually.

Lint issue also fixed.

Whether this is ready to merge... as you say its hard to assess this sort of thing beyond the tests. I'm tempted to wait until I've rolled out this branch at work on Monday just to give some extra confidence from a "real-world" test (I can't see what would go wrong, and its passed all our CI). I will update then.

@andrewgodwin
Copy link
Member

Alright, sounds like a good plan!

@andrewgodwin
Copy link
Member

@jthorniley Did you manage to get any testing of this in this week?

@jthorniley
Copy link
Contributor Author

@andrewgodwin yes and no I'm afraid... We rolled it out but had to roll back due to (unrelated to this change) infrastructure issues and haven't managed to redo this. So yes, this still looks fine still as far as I can see, but didn't get a proper test really.

@andrewgodwin
Copy link
Member

Ah, no problem - let's wait a bit longer to see if you can give it some more burn-in time. Hopefully the other issues are easy to fix!

@jthorniley
Copy link
Contributor Author

Ok, we've tried rolling out again and actually I think there are some issues with this branch so unfortunately it wouldn't be good to merge now.

Using this change (with a Django/ASGI app) we end up with too many database connections getting created, I think because they are held in a local by Django and the behaviour has changed in some unfortunate way (hard to track down exactly).

Additionally (trying to track down something more concrete) I found that running the current Django tests with this asgiref branch results in a failure:

./runtests.py asgi
Testing against Django installed in '/Users/james/repos/django/.venv/lib/python3.10/site-packages/django' with up to 8 processes
Found 16 test(s).
System check identified no issues (0 silenced).
..F.............
======================================================================
FAIL: test_concurrent_async_uses_multiple_thread_pools (asgi.tests.ASGITest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/james/repos/asgiref/asgiref/sync.py", line 235, in __call__
    return call_result.result()
  File "/Users/james/.pyenv/versions/3.10.7/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/james/.pyenv/versions/3.10.7/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/james/repos/asgiref/asgiref/sync.py", line 297, in main_wrap
    result = await self.awaitable(*args, **kwargs)
  File "/Users/james/repos/django/tests/asgi/tests.py", line 339, in test_concurrent_async_uses_multiple_thread_pools
    self.assertEqual(response_start["status"], 200)
AssertionError: 500 != 200

----------------------------------------------------------------------
Ran 16 tests in 6.927s

FAILED (failures=1)

I still think this approach (changing to contextvars for the implementation of Local) makes sense, but these issues need to be resolved.

@andrewgodwin
Copy link
Member

Hmm, well that's annoying, but not entirely unexpected. I can only presume that the behaviour around synchronous contexts is different enough to make it break - contextvars inherit in a different way to what Local currently does.

@jthorniley
Copy link
Contributor Author

I think this is now a viable solution.

I'll just note that there's a separate issue #399 which is also directly affected by this (i.e. this essentially does that).

So this went stale as I got a bit stuck but I've had time to work on it, so I've rebased and addressed a few issues:

Local(thread_critical=True)

The behaviour of Local(thread_critical=True) has a critical subtlety which I've added a test for in test_local test_thread_critical_local_not_context_dependent_in_sync_thread()

Essentially, when Local(thread_critical=True) behaves more or less identically to threading.local() i.e. "global within the current thread", rather than simulating the contextvar behaviour of being local to a specific call stack. This is important because django uses a Local(thread_critical=True) to store the database connection reference at the point at which it is created.

In an Django ASGIHandler server, there will be one sync thread per request, and it will get a DB connection initialised local to that thread (typically) somewhere inside the async view code, wrapped in a sync_to_async. However, the connection is only closed using a callback attached to the request_finished signal. As a result, its important that the callbacks from the request_finished signal see the same Local() data as the inner sync_to_async code (provided its running in the same thread), otherwise DB connections never get closed.

Its easy to confirm that test_thread_critical_local_not_context_dependent_in_sync_thread() is just testing the existing behaviour of Local - i.e. it passes if copied onto current main branch.

ApplicationCommunicator

The change in behaviour of Local means that a Local is now preserved even when using asyncio.create_task- in the case of ApplicationCommunicator it was actually a benefit that it wasn't - this was the cause of the django test failure that I had above (now resolved).

When the django test code run an async def test_x function, it wraps it in async_to_sync and as a result any sync_to_async wrapped code inside that function will run in the original sync test thread.

However that means when using ApplicationCommunicator any sync_to_async wrapped code inside the ASGI handler will end up running in the single top-level test thread, which is really not what we want - ApplicationCommunicator should simulate the behaviour if you'd started the ASGI server straight from a "normal" async thread - i.e. the first sync_to_async inside an ASGI request should create its own Thread to run in.

The solution was to create a new contextvars.Context() inside ApplicationCommunicator so that the ASGI server doesn't try to run code in the top-level test thread.

@andrewgodwin
Copy link
Member

I really appreciate your work on this!

Looks like the tests are failing in CI, at least - different environment issue to you locally, or something else, you think?

@jthorniley
Copy link
Contributor Author

@andrewgodwin thanks for running it - yes just taking a quick look:

I think CI is running mypy which I didn't run locally, but I'm sure I can tidy that up.

Additionally used py3.11 locally - and the tests do work there in CI, i'm guessing some API changes between 3.10 - 3.11 that I didn't account for, I'll investigate that today.

@andrewgodwin
Copy link
Member

Thanks - looks great now. I'll merge it in, and hopefully when we make a release with this in we won't get any immediate screaming! (The test suite looks good and pretty decently covers this, but there's always the chance of uncaught side effects)

@andrewgodwin andrewgodwin merged commit d920c3c into django:main Sep 18, 2023
6 checks passed
@jthorniley
Copy link
Contributor Author

@andrewgodwin thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sync_to_async calls hang / deadlock when called in a task
3 participants