Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection hangs forever if the WebSocket server just sends a message and close #7

Closed
frankie567 opened this issue Nov 24, 2022 · 25 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@frankie567
Copy link
Owner

frankie567 commented Nov 24, 2022

Describe the bug

Connection hangs forever if the WebSocket server just sends a message and then close.

The client waits forever for a message, missing the only message that was sent by the server.

To Reproduce

Server code

import uvicorn
from starlette.applications import Starlette
from starlette.websockets import WebSocket
from starlette.routing import WebSocketRoute


async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    await websocket.send_text("FOO")
    await websocket.close()


routes = [
    WebSocketRoute("/ws", endpoint=websocket_endpoint),
]

app = Starlette(debug=True, routes=routes)

if __name__ == "__main__":
    uvicorn.run(app)

Client code

import httpx
from httpx_ws import connect_ws


def my_client():
    with httpx.Client() as client:
        with connect_ws("http://localhost:8000/ws", client) as ws:
            text = ws.receive_text()
            return text


if __name__ == "__main__":
    print(my_client())

Expected behavior

The client should be able to read the message and close properly.

Configuration

  • Python version: 3.7+
  • httpx-ws version: 0.1.0
  • httpx version: 0.23.1

Additional context

  • Doesn't happen if we add an asyncio.sleep or a receive_* operation on the server side
  • Doesn't happen with websockets library
@frankie567 frankie567 added the bug Something isn't working label Nov 24, 2022
@frankie567 frankie567 changed the title Errors in unit tests if the server WS doesn't wait a bit Connection hangs forever if the WebSocket server just sends a message and close Nov 24, 2022
@frankie567 frankie567 added the help wanted Extra attention is needed label Nov 24, 2022
@frankie567
Copy link
Owner Author

Wasted 1.5 days on this 🥹 If anyone has insights/clues about this, it would be absolutely welcome.

@kousikmitra
Copy link
Contributor

kousikmitra commented Dec 3, 2022

@frankie567 was looking into this. checked with a custom ws server impl could not reproduce. is it possible this is happening only with the test server?

@frankie567
Copy link
Owner Author

frankie567 commented Dec 4, 2022

Yes, highly possible. Actually, if you're looking at the test setup, I'm testing against the two WebSockets implementations of Uvicorn, wsproto and websockets:

@pytest.fixture(params=("wsproto", "websockets"))
def websocket_implementation(request) -> Literal["wsproto", "websockets"]:
return request.param

def create_server(app: Starlette, socket: str):
config = uvicorn.Config(app, uds=socket, ws=websocket_implementation)
return uvicorn.Server(config)

If I remember correctly, the issue arises only with wsproto implementation, not websockets.

The thing is, I'm even not really sure this problem has consequences in a real context other than unit tests 🤔

@tomchristie
Copy link

Data point: I couldn't replicate this as an issue.

Used the example given above, and ran the server both with uvicorn example:app --ws wsproto and uvicorn example:app --ws websockets. Are you able to replicate this stand-alone, or do we think it's an artifact inside the test cases?

@frankie567
Copy link
Owner Author

frankie567 commented Dec 5, 2022

Thank you for having a look into this, Tom! Really appreciate it 🙂

Actually, yes, I was able to reproduce it quite consistently on my local macOS or even a Linux VM on GitHub codespaces. Worth noting that it happens only with wsproto implementation of Uvicorn.

Here is the Uvicorn log with wsproto (the client is hanging forever):

$ uvicorn server:app --ws wsproto --log-level trace
INFO:     Started server process [13670]
INFO:     Waiting for application startup.
TRACE:    ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}}
TRACE:    ASGI [1] Receive {'type': 'lifespan.startup'}
TRACE:    ASGI [1] Send {'type': 'lifespan.startup.complete'}
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
TRACE:    127.0.0.1:45048 - HTTP connection made
TRACE:    127.0.0.1:45048 - Upgrading to WebSocket
TRACE:    127.0.0.1:45048 - WebSocket connection made
TRACE:    127.0.0.1:45048 - ASGI [2] Started scope={'type': 'websocket', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'scheme': 'ws', 'server': ('127.0.0.1', 8000), 'client': ('127.0.0.1', 45048), 'root_path': '', 'path': '/ws', 'raw_path': b'/ws', 'query_string': b'', 'headers': '<...>', 'subprotocols': []}
TRACE:    127.0.0.1:45048 - ASGI [2] Receive {'type': 'websocket.connect'}
TRACE:    127.0.0.1:45048 - ASGI [2] Send {'type': 'websocket.accept', 'subprotocol': None, 'headers': '<...>'}
INFO:     ('127.0.0.1', 45048) - "WebSocket /ws" [accepted]
TRACE:    127.0.0.1:45048 - ASGI [2] Send {'type': 'websocket.send', 'text': '<3 chars>'}
TRACE:    127.0.0.1:45048 - ASGI [2] Send {'type': 'websocket.close', 'code': 1000, 'reason': ''}
TRACE:    127.0.0.1:45048 - ASGI [2] Completed
TRACE:    127.0.0.1:45048 - WebSocket connection lost

And here it is with websockets (the client displays the message and close properly):

$ uvicorn server:app --ws websockets --log-level trace
INFO:     Started server process [20730]
INFO:     Waiting for application startup.
TRACE:    ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}}
TRACE:    ASGI [1] Receive {'type': 'lifespan.startup'}
TRACE:    ASGI [1] Send {'type': 'lifespan.startup.complete'}
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
TRACE:    127.0.0.1:37170 - HTTP connection made
TRACE:    127.0.0.1:37170 - Upgrading to WebSocket
DEBUG:    = connection is CONNECTING
TRACE:    127.0.0.1:37170 - WebSocket connection made
DEBUG:    < GET /ws HTTP/1.1
DEBUG:    < host: localhost:8000
DEBUG:    < accept: */*
DEBUG:    < accept-encoding: gzip, deflate
DEBUG:    < user-agent: python-httpx/0.23.1
DEBUG:    < connection: upgrade
DEBUG:    < upgrade: websocket
DEBUG:    < sec-websocket-key: xlgjFww4etMG4awUv2d4dA==
DEBUG:    < sec-websocket-version: 13
TRACE:    127.0.0.1:37170 - ASGI [2] Started scope={'type': 'websocket', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'scheme': 'ws', 'server': ('127.0.0.1', 8000), 'client': ('127.0.0.1', 37170), 'root_path': '', 'path': '/ws', 'raw_path': b'/ws', 'query_string': b'', 'headers': '<...>', 'subprotocols': []}
TRACE:    127.0.0.1:37170 - ASGI [2] Receive {'type': 'websocket.connect'}
TRACE:    127.0.0.1:37170 - ASGI [2] Send {'type': 'websocket.accept', 'subprotocol': None, 'headers': '<...>'}
INFO:     ('127.0.0.1', 37170) - "WebSocket /ws" [accepted]
TRACE:    127.0.0.1:37170 - ASGI [2] Send {'type': 'websocket.send', 'text': '<3 chars>'}
DEBUG:    > HTTP/1.1 101 Switching Protocols
DEBUG:    > Upgrade: websocket
DEBUG:    > Connection: Upgrade
DEBUG:    > Sec-WebSocket-Accept: ByUvCi3KAKmU1nBfsj34T/RUgbQ=
DEBUG:    > date: Mon, 05 Dec 2022 12:37:51 GMT
DEBUG:    > server: uvicorn
INFO:     connection open
DEBUG:    = connection is OPEN
DEBUG:    > TEXT 'FOO' [3 bytes]
TRACE:    127.0.0.1:37170 - ASGI [2] Send {'type': 'websocket.close', 'code': 1000, 'reason': ''}
DEBUG:    = connection is CLOSING
DEBUG:    > CLOSE 1000 (OK) [2 bytes]
DEBUG:    < CLOSE 1000 (OK) [2 bytes]
DEBUG:    x half-closing TCP connection
TRACE:    127.0.0.1:37170 - WebSocket connection lost
DEBUG:    = connection is CLOSED
TRACE:    127.0.0.1:37170 - ASGI [2] Completed
INFO:     connection closed

EDIT: After trying other things, it seems it happens only when running Uvicorn with wsproto and uvloop. If I force to use asyncio loop, it works 😵‍💫

$ uvicorn server:app --ws wsproto --log-level trace --loop asyncio
INFO:     Started server process [17282]
INFO:     Waiting for application startup.
TRACE:    ASGI [1] Started scope={'type': 'lifespan', 'asgi': {'version': '3.0', 'spec_version': '2.0'}}
TRACE:    ASGI [1] Receive {'type': 'lifespan.startup'}
TRACE:    ASGI [1] Send {'type': 'lifespan.startup.complete'}
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
TRACE:    127.0.0.1:56500 - HTTP connection made
TRACE:    127.0.0.1:56500 - Upgrading to WebSocket
TRACE:    127.0.0.1:56500 - WebSocket connection made
TRACE:    127.0.0.1:56500 - ASGI [2] Started scope={'type': 'websocket', 'asgi': {'version': '3.0', 'spec_version': '2.3'}, 'http_version': '1.1', 'scheme': 'ws', 'server': ('127.0.0.1', 8000), 'client': ('127.0.0.1', 56500), 'root_path': '', 'path': '/ws', 'raw_path': b'/ws', 'query_string': b'', 'headers': '<...>', 'subprotocols': []}
TRACE:    127.0.0.1:56500 - ASGI [2] Receive {'type': 'websocket.connect'}
TRACE:    127.0.0.1:56500 - ASGI [2] Send {'type': 'websocket.accept', 'subprotocol': None, 'headers': '<...>'}
INFO:     ('127.0.0.1', 56500) - "WebSocket /ws" [accepted]
TRACE:    127.0.0.1:56500 - ASGI [2] Send {'type': 'websocket.send', 'text': '<3 chars>'}
TRACE:    127.0.0.1:56500 - ASGI [2] Send {'type': 'websocket.close', 'code': 1000, 'reason': ''}
TRACE:    127.0.0.1:56500 - ASGI [2] Completed
TRACE:    127.0.0.1:56500 - WebSocket connection lost

So it seems we hit weird asyncio bugs because of implementation differences? 🤔

@tomchristie
Copy link

it seems it happens only when running Uvicorn with wsproto and uvloop.

Ug. Yes - confirmed.
After pip install uvloop I was able to reproduce this.

It looks to me like the issue is something to do with the background recieve handling.

I'll give my own perspective on this, which you might or might not find useful...

  • I would start this package off without any ping/pong behaviour at all. "Minimal Viable Product" and all that.
  • Once I was happy that was all working correctly I'd add in ping/pong, but rather than having a background flow control running, I'd hook it into receive() calls, so that receiving data is what triggers ping/pong. You'd have a loop, and inside that you'd have a timeout on the data read operations, so that the control flow can periodically send a ping. I think it might end up a bit simpler as a control flow mechanism, and doesn't require a new flow of control, or new thread in the sync case. You can probably avoid having to use any of the asyncio or threading libraries with this approach. You've got a "timeout" on the network .read() which is enough.

Anyways, don't feel like you need to act on the above, just sharing a different approach, which avoids having to try to get clever with asyncio and threading.

@frankie567
Copy link
Owner Author

Thanks for the insights, Tom!

I'll think about you suggest. Actually, I had an implementation that was approximately like this at some point, but wasn't sure if it was the best thing to do in terms of developer experience. Maybe I should reconsider!

@tomchristie
Copy link

I had an implementation that was approximately like this at some point

Ah yeah, that looks pretty neat.

Could be that just leaving ping/pongs up to the server to initiate would be perfectly okay in practice. 🤷‍♂️

@frankie567
Copy link
Owner Author

Maybe 🤔

Hello, @Kludex 👋 Sorry to summon you like this, but I remember that you asked for the .ping method, so the client could initiate a ping. Did you have a specific use-case in mind? 🙂

@Kludex
Copy link

Kludex commented Dec 5, 2022

At that point... I just wanted to use a simple ping. 🤔

@skewty
Copy link

skewty commented Aug 30, 2023

Was any progress made on this? Have this library's maintainers made a decision to try something like:

I had an implementation that was approximately like this at some point, but wasn't sure if it was the best thing to do

in order to try and get this issue resolved?

@frankie567
Copy link
Owner Author

@skewty Do you experience this behavior in a "real" use-case (i.e. outside an automated unit test)?

@skewty
Copy link

skewty commented Sep 1, 2023

I may have but I refactored so much other code I can't be certain anymore. Also can't reproduce in production so easily anymore.

Long story short.. Only one of two servers that communicated together over websocket was on UPS power.. oops! (not our fault / scope luckily) During a power outage, one server went down (obviously), when it came up again the websocket connection never recovered. That server now plugs into UPS power. Additionally, given the timeline I switched to "websockets" library.

To be clear I can't say for certain it was http-ws' fault as another issue was identified in the code as well.

@MtkN1
Copy link

MtkN1 commented Feb 7, 2024

I've identified the root cause of this issue.

By trying out the fix branch I forked for httpx-ws (https://github.com/MtkN1/httpx-ws/tree/patch-trailing-data), the reproducible code (#7 (comment)) no longer hangs!

You can install it with:

pip install git+https://github.com/MtkN1/httpx-ws.git@patch-trailing-data

Here's the diff for the fix branch: main...MtkN1:httpx-ws:patch-trailing-data


The reason the reproducible code hangs is that httpx-ws does not handle trailing data.

diff --git a/httpx_ws/_api.py b/httpx_ws/_api.py
index ded6b87..1c186cc 100644
--- a/httpx_ws/_api.py
+++ b/httpx_ws/_api.py
@@ -105,9 +105,10 @@ class WebSocketSession:
             float
         ] = DEFAULT_KEEPALIVE_PING_TIMEOUT_SECONDS,
         subprotocol: typing.Optional[str] = None,
+        trailing_data: bytes = b"",
     ) -> None:
         self.stream = stream
-        self.connection = wsproto.connection.Connection(wsproto.ConnectionType.CLIENT)
+        self.connection = wsproto.connection.Connection(wsproto.ConnectionType.CLIENT, trailing_data=trailing_data)
         self.subprotocol = subprotocol

         self._events: queue.Queue[
@@ -1054,6 +1055,15 @@ def _connect_ws(
         if response.status_code != 101:
             raise WebSocketUpgradeError(response)

+        (
+            trailing_data,
+            _,
+        ) = (
+            response.stream._stream._httpcore_stream._status.connection._connection._h11_state.trailing_data
+        )
+        if trailing_data:
+            print("trailing_data:", trailing_data)
+
         subprotocol = response.headers.get("sec-websocket-protocol")

         session = WebSocketSession(
@@ -1063,6 +1073,7 @@ def _connect_ws(
             keepalive_ping_interval_seconds=keepalive_ping_interval_seconds,
             keepalive_ping_timeout_seconds=keepalive_ping_timeout_seconds,
             subprotocol=subprotocol,
+            trailing_data=trailing_data,
         )
         yield session
         session.close()

It seems in the server code with uvicorn and uvloop, all three byte sequences, "Websocket accept", "FOO text frame", and "Close frame", are synchronously sent over the network.

On the client side, while "Websocket accept" is processed as a response header, the remaining "FOO text frame" and "Close frame" are treated as trailing data. Since the current httpx-ws implementation doesn't handle trailing data, ws.receive_text() hangs without processing these bytes.

Without uvloop, it's likely that the three byte sequences are sent asynchronously, which might explain why the current httpx-ws implementation works.

As the fix code shows, h11 and wsproto have implementations for trailing data. Here are their references:

h11: https://h11.readthedocs.io/en/latest/api.html?highlight=trailing_data#h11.Connection.trailing_data
wsproto: https://python-hyper.org/projects/wsproto/en/stable/_modules/wsproto/connection.html?highlight=trailing_data

This issue aligns with what I wrote in the HTTPCore discussion.
encode/httpcore#871

Additionally, the private API mentioned in the fix code (response.stream._stream._httpcore_stream._status.connection._connection._h11_state.trailing_data) is too long, so I'm considering making it available as a public API.
encode/httpcore#872

Next steps:

  1. If this temporary private API is acceptable, I'll remove unnecessary prints from my fix branch and create a pull request.
  2. Wait for the resolution of the public API issue (Support connection Upgrade and CONNECT. encode/httpcore#872) .

@tomchristie
Copy link

Wait for the resolution of the public API issue

This appears to be the best course of action from my POV, except... "wait for". The pull request won't write itself.

I'm in a position to guide someone through implementing and resolving this. I'd suggest starting with a draft pull request in httpcore that (incorrectly) raises an exception whenever a 101 Switching Protocols response or 2xx in response to a CONNECT occur would be a starting point. From there I can help work through how we fix up the network stream to include the trailing data.

https://github.com/encode/httpcore/blob/master/httpcore/_async/http11.py#L118

@frankie567
Copy link
Owner Author

Thank you @MtkN1 for the investigation and detailed explanations! That's very impressive and appreciated 🙏 I wasn't aware of this trailing_data mechanism. I'm of course OK to review a PR for this, even with the private API.


@tomchristie If you're open to this, I'm ready to pledge money via Polar to help solve encode/httpcore#872.

@MtkN1
Copy link

MtkN1 commented Feb 7, 2024

Thanks for the reactions!

I'll try to work on encode/httpcore#872 when I have some time 🙂
Subjectively, I don't feel any urgency with this issue in httpx-ws, so I'll refrain from creating a PR for the Private API.

@tomchristie
Copy link

@frankie567 - That's interesting thanks. We've not committed to Polar yet, but I do like the work you're/they're doing.
I've enabled pledges on that issue.

@MtkN1
Copy link

MtkN1 commented Feb 21, 2024

The fix for encode/httpcore#872 has been merged! 🎉

The trailing data issue I described is now being received via stream.read(). There is no need to fix httpx-ws. I've confirmed that using the master branch of httpcore resolves the problem!

@frankie567
Copy link
Owner Author

That's truly awesome @MtkN1, thanks 🙏

I'll wait for the next release of httpcore before closing this 🙂 I also need to remove the "sleep" workaround in unit tests.

@Kludex
Copy link

Kludex commented Feb 21, 2024

httpcore release will happen today: encode/httpcore#892.

@frankie567
Copy link
Owner Author

Thank you all for your help 🙏

@all-contributors
add @Kludex for bug, research
add @tomchristie for bug, research
add @MtkN1 for bug, research

@frankie567
Copy link
Owner Author

@all-contributors
please add @tomchristie for bug, research.
please add @MtkN1 for bug, research.

Annoying bot...

Copy link
Contributor

@frankie567

@tomchristie already contributed before to bug, research

We had trouble processing your request. Please try again later.

Copy link
Contributor

@frankie567

I couldn't determine any contributions to add, did you specify any contributions?
Please make sure to use valid contribution names.

I've put up a pull request to add @Kludex! 🎉

I've put up a pull request to add @tomchristie! 🎉

I've put up a pull request to add @MtkN1! 🎉

frankie567 added a commit that referenced this issue Feb 22, 2024
Bug fixes and improvements
--------------------------

* Always disable automatic keepalive ping when using ASGI transport. Thanks @dmontagu and @Kludex 🎉
* Bump dependencies:
    * `httpcore>=1.0.4` - Solves #7, thanks to @tomchristie and @MtkN1 🎉
T-256 pushed a commit to T-256/httpx-ws that referenced this issue Jun 5, 2024
Bug fixes and improvements
--------------------------

* Always disable automatic keepalive ping when using ASGI transport. Thanks @dmontagu and @Kludex 🎉
* Bump dependencies:
    * `httpcore>=1.0.4` - Solves frankie567#7, thanks to @tomchristie and @MtkN1 🎉
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants