-
-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent issue receiving server events on client when switching cellular network #417
Comments
Some more information from a recently broken connection after swapping between 4G/5G many, many times. Adding a watch to chsk-state atom to conj state over time ordered by oldest first:
Here is the series of events from monitoring
|
@Naomarik Hi Omar, thanks for the detailed info! Haven't had the opportunity to look at this closely yet, but would like to confirm my understanding of what you've shared so long. Is the following an accurate description of an example of the behaviour you're seeing?:
My questions:
There's at least a few different things that could be going on, and the above info should help lead us to the right general area. |
1-4 is accurate except for client->server DOES continue working. Once in broken state
Some additional notes
So to summarize in particular what seems to be broken is the server's |
Quick update:
|
Thanks a lot for the extra info, busy going through your info from here and Telegram. A work-in-progress summary so far (will keep updated):
While in broken state:
|
So this is happening easily by just switching IP. Connection on first IP. Notice websocket peer address is 49342
Browser tab is still open, computer's internet switches from phone tethering -> home wifi. Connection on second IP. Notice websocket peer address is 54406
Disconnection from client from IP address a bit over a minute later, peer address from first IP 49342
This nils the sch in the [sch udt] vector from the |
Excellent summary and debugging work by @Naomarik:Client connects to server with a unique client ID. That is used as the primary key for that specific browser tab to identify it. The websocket channel (or sch in sente code) and udt are the key's value, so we have The websocket channel maps onto a port on the server of which 65,535 are available. So the websocket channel has a distinct port assigned to each connection. When the IP address changes suddenly, there is no immediate on-close event that gets sent. Instead it times out after about 60 seconds. At some point in the future, client sente reconnects with the SAME CLIENT ID as before, but with a new ip address. Now we have When timeout occurs, the client ID gets dissoc'd, removing the reconnected websocket channel and any hope for the server to initiate contact to its beloved client, but vice versa works fine. Later on, upd-conn! occurs to update the udt in the first arity, so there is a I'll prepare a fix 👍 |
Will be fixed in forthcoming v1.18 👍 Leaving issue open until the fix is confirmed. |
Fix has been confirmed by reporter, closing 👍 |
…Naomarik) Huge thanks to @Naomarik for reporting and diagnosing this issue! BEFORE THIS COMMIT The following scenario was possible: t0. Client WebSocket connects to server with unique client ID. State in `conns_` atom: `{<client-id1> [<ws-sch:port-for-ip1> <udt>]}`. t1. Client's IP changes (for example, by switching from wifi to cellular network). t2. No `:on-close` event is triggered on server. t3. New client WebSocket connects to server with the SAME client ID but new IP. State in `conns_` atom: `{<client-id1> [<ws-sch:port-for-ip2> <udt>]}`. t4. Server-side connection timeout triggers for ORIGINAL connection (IP1). This unintentionally removes sch for the NEW connection (IP2). State in `conns_` atom: `{<client-id1> [nil <udt>]}`. t5. At this point: - The client has a working WebSocket connection to server. - But the server `conns_` state is bad (has a nil sch for client). - Which means that server->client broadcasts all fail. - This broken state persists until if/when something causes the client to reconnect. But note that this may not happen anytime soon since the client believes (accurately) that it IS successfully connected. IMPLEMENTATION DETAILS Each connection to server establishes the following: - An `:on-open` handler that modifies state in `conns_` for <client-id> - An `:on-close` handler that modifies state in `conns_` for <client-id> The bad behaviour occurs when: - Conn1's delayed `:on-close` triggers after Conn2's `:on-open`. Because Conn1 and Conn2 share the same client-id, they're mutating the same state. AFTER THIS COMMIT We introduce a simple compare-and-swap (CAS) mechanism in the `:on-close` handler so that its state mutations will noop if the current server-ch does not = the server-ch added by the corresponding `:on-open` handler. I.e. a given `:on-close` will now only remove the SAME server-ch added by its corresponding `:on-open`. In the example above, this means that the timeout at t4 will noop.
…Naomarik) Huge thanks to @Naomarik for reporting and diagnosing this issue! BEFORE THIS COMMIT The following scenario was possible: t0. Client WebSocket connects to server with unique client ID. State in `conns_` atom: `{<client-id1> [<ws-sch:port-for-ip1> <udt>]}`. t1. Client's IP changes (for example, by switching from wifi to cellular network). t2. No `:on-close` event is triggered on server. t3. New client WebSocket connects to server with the SAME client ID but new IP. State in `conns_` atom: `{<client-id1> [<ws-sch:port-for-ip2> <udt>]}`. t4. Server-side connection timeout triggers for ORIGINAL connection (IP1). This unintentionally removes sch for the NEW connection (IP2). State in `conns_` atom: `{<client-id1> [nil <udt>]}`. t5. At this point: - The client has a working WebSocket connection to server. - But the server `conns_` state is bad (has a nil sch for client). - Which means that server->client broadcasts all fail. - This broken state persists until if/when something causes the client to reconnect. But note that this may not happen anytime soon since the client believes (accurately) that it IS successfully connected. IMPLEMENTATION DETAILS Each connection to server establishes the following: - An `:on-open` handler that modifies state in `conns_` for <client-id> - An `:on-close` handler that modifies state in `conns_` for <client-id> The bad behaviour occurs when: - Conn1's delayed `:on-close` triggers after Conn2's `:on-open`. Because Conn1 and Conn2 share the same client-id, they're mutating the same state. AFTER THIS COMMIT We introduce a simple compare-and-swap (CAS) mechanism in the `:on-close` handler so that its state mutations will noop if the current server-ch does not = the server-ch added by the corresponding `:on-open` handler. I.e. a given `:on-close` will now only remove the SAME server-ch added by its corresponding `:on-open`. In the example above, this means that the timeout at t4 will noop.
I've been unreliably able to replicate an issue that causes client to no longer receive messages from server until page is refreshed.
This has occurred to me intermittently when my iPhone network goes from switching states between:
This has been intermittent for me in that I can get client to stop receiving server messages several times consecutively, but it will also have multiple rounds of working fine.
This problem also occurs on my PC when tethered to iPhone for internet and switching between 4G/5G.
I've also tried switching states on my PC from ethernet <-> wifi but that has worked reliably.
Other info:
The text was updated successfully, but these errors were encountered: