Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to httpx and reuse async client when possible #3000

Closed
wants to merge 5 commits into from

Conversation

sarayourfriend
Copy link
Collaborator

@sarayourfriend sarayourfriend commented Sep 8, 2023

Fixes

Fixes #2788 by @sarayourfriend

Description

Originally this PR set out to create a new get_aiohttp_session entrypoint for "safely" retrieving a reusable aiohttp.ClientSession object. It turns out this is significantly more complex than it appeared. In the course of discovering the underlying complexities of this change, I finally came to understand the event loop management issues I was having in the original Django ASGI proof of concept PR. The related nuances with async_to_sync are described in the new api.utils.httpx.

In the original implementation (which you can see by checking earlier commits) tried to implement the changes using aiohttp. However, aiohttp assumes that its client session will only be closed or garbage collected (run __del__) while the event loop in which the session was created is still available. That is not possible with our current application's implementation. async_to_sync creates a new event loop each time the wrapped function is called. That event loop is quickly disposed of, and any finalizers registered using weakref.finalize or through proxy objects (weakref.proxy(loop, finalize_session)) necessarily run after the loop is finalized. So while it is trivial for us to ensure that client sessions are appropriately finalized after the loop is gone (thereby ensuring that they don't unnecessarily hang around for the entire lifetime of the application), it's quite literally impossible to call close or __del__ on aiohttp.ClientSession without it raising an exception caused by the loop being closed. On top of this, the only way to get aiohttp.ClientSession to not cause a secondary issue with its explicit reference to the loop was to pass loop explicitly as an argument to ClientSession, wrapped in weakref.proxy. Otherwise, the session instance held on to the loop reference forever, preventing the loop references and the sessions themselves from being garbage collected.

After struggling with various attempts to get aiohttp to work, I decided to see whether httpx's finalization methods similarly depended on the event loop still being available. It turns out that it does. However, httpx's API is a lot more in line with requests, and a fair bit simpler, and I only figured out the real solution in the end after switching to httpx, so this PR switches us to httpx instead of aiohttp. I can undo this change if we want as it is not strictly necessary to implement the required changes.

httpx itself doesn't reference the loop, but httpcore, the library httpx is built on, relies on anyio to create the connection pools and anyio's code does hold on to the loop. After struggling with various approaches I thought could work, I finally landed on the realisation that it wasn't going to be possible to make this work in a magical, transparent way that I originally wanted. Instead, I decided to create a wrapper around async_to_sync that explicitly manages the httpx client (and will be extended to manage other resources like the async Redis client we'll use when we implement #2789) and closes it before the async_to_sync created loop closes. There is literally no other way I could find to ensure that the httpx client is closed before the loop closes.

That leaves us with the new api.utils.httpx module, its get_httpx_client function, and the api.utils.asgiref.async_to_sync wrapper around the asgiref.sync.async_to_sync function. There are complexities here and while right now it does not necessarily provide us with the benefits of a reusable client (connection pooling), it will do so in the future. As we move more of our stack over to async routes (which Django runs inside an async_to_sync call in WSGI, or just directly once the app is running under ASGI), we will start actually sharing the client in some cases. For example, under WSGI if an entire view handler is asynchronous and doesn't have async_to_sync calls nested inside sync_to_async, every call to get_httpx_client will happen within the same event loop and will therefore be able to share the client. Under ASGI, any async code running outside a nested sync_to_async -> async_to_sync context will reuse the parent loop and therefore the session as well. In the async route under WSGI case, however, we need to force Django to use our version of async_to_sync so that the resources are correctly managed. That will be handled in #2789.

That is to say: the current approach does not introduce more overhead (other than that of the call to get_httpx_client itself, but this should be trivial), and over time will result in less overhead due to the wider ability to share the client.

An example of how we'll get a benefit from this even before switching to ASGI in #2790 is in #2789. The issue to convert the image proxy route to async under WSGI will result in a shared async context for that entire route. That route sometimes requires multiple HTTP requests (HEAD to get the file type if we don't have it yet, GET always to get the proxied image). In the cases where it would make more than one HTTP request, all requests will share the same client.

Implementation detail description

There are two main parts to the core implementation in this PR:

  1. OpenverseWSGIHandler. This creates a shutdown hook to which our application can register callbacks. The shutdown hook is activated on worker abort and exit, facilitated through the worker hook configuration added in gunicorn.conf.py. Registered callbacks are themselves referenced weakly and those reference objects are automatically cleaned up, meaning there is no long-term memory overhead to registering a shutdown callback. ASGI directly supports shutdown callbacks, so once we switch to an ASGI handler, we'll just need to adapt the calling of the shutdown handlers to match the ASGI lifecycle approach, as opposed to relying on Gunicorn worker lifecycle methods. This isn't strictly necessary now, but once the app is converted to ASGI and everything runs inside a single event loop, it will indeed be necessary to ensure that shared clients like HTTPx's and others are closed before the application shuts down.
  2. get_httpx_client. This is the part that enables sharing the client. The module documentation describes the rationale behind the implementation, in particular the relevant details of async_to_sync.
  3. api.utils.asgiref.async_to_sync. This is the part that prevents sharing the client from throwing event loop closure errors on application shutdown. It does this by explicitly opening and closing the client for the entire lifetime of the event loop.

Of course, there are also relatively minor changes to switch the code from aiohttp's API to httpx's.

Testing Instructions

Unit tests should all pass. Checkout the branch locally and run just build web to get the updated dependencies. Then run just down -v && just api/up && just api/init and make search requests locally with filter_dead=True to force dead link filtering to make requests. Everything should work. You will see new info level logs referencing the client creation and finalization. We can remove these info logs before merging, if preferred, or demote them to debug level.

When testing, make sure you test distinct searches rather than just refreshing. When you refresh the same search, the dead link cache prevents any actual dead link check requests from being made. In that case, the full check_dead_links path with the updated httpx based code is not being fully tested! Alternatively, replace _get_cached_statuses's implementation with the following:

def _get_cached_statuses(redis, image_urls):
    return [None for _ in range(len(image_urls))]

That causes the cache to be bypassed and checks to be made even if results are already cached locally.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Sep 8, 2023
@openverse-bot openverse-bot added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Sep 8, 2023
@sarayourfriend
Copy link
Collaborator Author

It turns out httpx was not a solution for this problem 😢 I'm not sure under what circumstances I had it working locally (I was watching it finalize the clients and the shutdown handler refs in the logs), but now we have roughly the same problem there was with aiohttp under this approach: on application shutdown, each client tries to close itself and raises an unhandled exception about the loop being closed.

I wonder if we can just ignore these. I can't keep digging into this now though, heading out for the weekend. I'll return to this Monday and see if I can sort out a good approach.

@sarayourfriend sarayourfriend force-pushed the refactor/reuse-aiohttp-client-session branch from 2f998d3 to f54556f Compare September 11, 2023 04:41
@sarayourfriend
Copy link
Collaborator Author

Okay, it turns out we need to use aiohttp unless we want to re-write unit tests that rely on pook. Pook doesn't have support for httpx (h2non/pook#71) and it's not worth making the change for that, especially if it doesn't actually help fix anything.

However, before doing that, I'd like to try running the app under ASGI and seeing if the client-sharing issues go away. The thinking here being that if the app is running under ASGI, even if every request is still actually synchronous, at least async_to_sync will pick up on the event loop everything is running inside of? My confidence in that hypothesis is like ~50% but I think it is worth a shot to see what happens. If that is the case, then I'll propose changing the order of implementation of the remaining tickets for the Django ASGI conversion milestone.

@sarayourfriend
Copy link
Collaborator Author

Okay, indeed, when running the application under uvicorn, get_httpx_client re-uses the same client on every request.

Oddly, that is not the case when running the app under gunicorn with uvicorn workers. I think that's the issue I ended up running into in the original proof-of-concept PR. Now that I understand the problem better, I feel more comfortable digging into what's going on.

@sarayourfriend sarayourfriend mentioned this pull request Sep 11, 2023
6 tasks
@sarayourfriend
Copy link
Collaborator Author

sarayourfriend commented Sep 11, 2023

Closing in favour of #3011. If we can't take that approach for some reason, we can revisit the changes in this PR in a new PR with the new understanding of the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Use a shared aiohttp.ClientSession rather than creating a new one per-request
2 participants