-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some dead websocket connections are never cleaned up #431
Comments
@krajj7 Hi Jan, I'm having difficulty reproducing this so far. I'm consistently seeing Are you 100% sure that you're running Sente 1.18.1 and not 1.18.0? Have you tried running You can check the
Correct, but it should still work fine.
Correct. The server has a loop that regularly tries to ping the client if the connection has been idle. That ping attempt should trigger a connection close if it fails. If you've confirmed that you're really running 1.18.1 and are still seeing the issue - did you make any other changes to the reference example? E.g. did you change the default web server from http-kit, etc.? Thanks! |
Hi, thank you for looking into this. I re-tried today and found that I also cannot reliably reproduce with the given instructions, I am sorry about that. I can still trigger stuck connections with a few minutes of trying, but frustratingly, I haven't been able to find a reliable trigger. I am 100% sure I'm using 1.18.1 and I didn't make any relevant changes in the example project. The only changes I made was the added guava dependency and commenting out auto-opening of the browser, since I was most successful triggering the problem with these steps (in the example project):
The timing of step 6 seems to be critical. If it's done too late or too early the problem doesn't manifest. The message spam doesn't always seem to be neccessary, I'm not sure if it matters or not. I'm sorry about these unreliable instructions. If possible please give it 5 minutes and let me know if you got a stuck connection. If not I will try to dive deeper and see if I can find the cause or something more helpful. |
I'll also add that besides the example project and chromium, I also observed this problem in my larger app which uses http-kit, the connections were non-localhost and it happens with Firefox as well. |
@krajj7 Hi Jan, thanks for the additional info. It's helpful to know if it's intermittent and/or seems to involve some kind of timing issue. That indicates a different kind of problem than if it's consistently reproducible. Will investigate further and come back to you 👍 |
@krajj7 Really struggling to reproduce this on my end. And have gone over the relevant code a few times now without spotting any obvious explanation so far. I'm sorry to prod on this, but would you please humour me and just explicitly confirm that you've run 1.18.0 had a known issue with I need to get some sleep, but will continue digging first thing in the morning - I might be missing something obvious atm. Apologies for all the trouble on this, and appreciate your assistance debugging! Cheers |
Quick update: while I still haven't been able to successfully reproduce on my end yet, I have a couple ideas for some of the logic that could be made more robust. Will prepare a debug build with some improvements and extra logging and come back to you. |
I'm sorry to hear you weren't able to reproduce it. My repro instructions must still be missing something crucial, but I don't know what it is. I can confirm I get I'll try to look "under the hood" a bit more and see if I can discover something. |
Thanks Jan 🙏
I think I've spotted a possible cause of the trouble you're seeing, trying to confirm - will come back to you shortly. |
Possible fix for [#431]? Before this commit: It was possible for a `conns_` CAS update to fail on conn close, *while still updating* the entry's udt - leading to the responsible gc loop failing its udt check. After this commit: CAS updates to `conns_` are properly atomic. Either sch and udt are both updated, or neither is.
When you get an opportunity, would you please try The dev reference project uses this version. My hope is that the problem should be fixed here, but if it's not - could you please share the server-side logging output from that reference project? Thanks so much for all your time+effort on this! Really sorry about all the trouble. |
No problem at all, I'm glad to assist. I tried "1.18.2-SNAPSHOT" and managed to trigger the issue. The log is here: https://github.com/ptaoussanis/sente/files/12002682/bug.log This is roughly what I did:
A few things to note:
Hope this helps. I'll be happy to do more experiments if needed. |
The new implementation is more explicit, and includes a fix to at least one bug in the previous implementation: it was previously possible for a `conns_` CAS update to fail on conn close, *while still updating* the entry's udt - leading to the responsible gc loop failing its udt check.
That did help, thanks! And much appreciated 🙏 Would you please try again with the current (updated) |
The new example project does Anyway the issue seems to be much easier to trigger now. I get stuck connections just by re-logging in with different user-ids, without doing anything else. I made two logs, the first is really short:
Second log:
|
The new implementation is more explicit, and includes a fix to at least one bug in the previous implementation: it was previously possible for a `conns_` CAS update to fail on conn close, *while still updating* the entry's udt - leading to the responsible gc loop failing its udt check.
Finally managed to see this on my end! For some reason took dozens of attempts on my system, still don't know why we've had such different experiences with this. Anyway, it looks like the lingering connections were caused by the connection's I'll get the logic reworked and push another update. Will then ask for your patience to test once again if possible, since it's been so difficult for me to actually generate the necessary conditions on my system. |
Incl. notably some additional debugging tools for [#431]: - `connected-uids_` and `conns_` now both printed in loop - New buttons: - Print connected uids - Test repeated logins
Improvements include: - New internal "conn-id" concept that: 1. No longer depends on http server implementations to properly implement identity 2. Greatly improves logging output, easing debugging - General logging improvements to ease debugging - Now expose internal `conns_` state to ease debugging - Added server-side ping timeout to match client side, and to catch unexpected cases where http server is never able to identify a connection as broken. - Simplified internal API for updating `conns_` state. - More robust handling when events fire in unexpected order (e.g. :on-close firing before :on-open handshake). - Generally improved clarity and robustness. Note: some relevant additions will also be made to the reference example project to aid debugging related to
Improvements include: - New internal "conn-id" concept that: 1. No longer depends on http server implementations to properly implement identity 2. Greatly improves logging output, easing debugging - General logging improvements to ease debugging - Now expose internal `conns_` state to ease debugging - Added server-side ping timeout to match client side, and to catch unexpected cases where http server is never able to identify a connection as broken. - Simplified internal API for updating `conns_` state. - More robust handling when events fire in unexpected order (e.g. :on-close firing before :on-open handshake). - Generally improved clarity and robustness. Note: some relevant additions will also be made to the reference example project to aid debugging related to
Okay, Using this version, I seem to be unable to produce any lingering dead connections on my system. But since I also struggled with the previous version, it would be great to get confirmation from you if possible. I've added some additional relevant debugging tools to the reference example on master, incl.:
Hopefully everything works on your end. |
Tried Here is a log that ends with two stuck websocket connections: Here is a log that ends with one AJAX connection stuck: |
Wow, that's so crazy ^^ Will go through your logs in a moment. In the meantime- is there anything about your system that's unusual? It would be so helpful if I could reproduce this better on my end. What browser and OS are you on? Are you running any browser plugins or other software that could be doing network conditioning? Are you changing any options or software in the reference example? You mentioned earlier that you need to add Guava for your JVM to work - what JVM is that? To clarify: this behaviour definitely shouldn't happen in any environment, it's a bug in Sente if this can occur. I just feel bad repeatedly wasting your time with updates that don't actually solve the problem. Might be helpful if I try replicate your environment if possible. |
I don't think there's anything unusual about my setup. I'm not changing anything in the project except for logging (added file logger, :trace by default) and the guava lib. I don't use any unusual networking software or plugins. HW: Lenovo Yoga Slim 7 15ITL05 laptop
I don't mind continuing testing new versions as needed. I can see that it would be much better if you could reproduce the issue yourself, but I don't know what could help with that. |
👍 And just on the off-chance that it's relevant, why is Guava necessary in your environment? |
I think it's some dependency conflict. The server runs without the guava dep, but when I try to compile the javascript I get this error:
By default |
So to confirm: if you just clone the Sente repo without making any changes, go to the example project's path - and execute Edit: if so, can you share the contents of your |
Yes.
I do have a bunch of things in
|
…ids` Big thanks to @krajj7 for the report and huge assistance debugging! As part of the investigation into this issue, and due to another recent related issue (#429), I decided that we were overdue for a refactor of Sente's connection management system. The old system had grown overly complex, and left too much room for edge cases and timing issues. This commit introduces a major refactor of the system, with an emphasis on robustness and improved observability. Specific improvements include: - New internal "conn-id" concept that: 1. No longer depends on http server implementations to properly implement identity 2. Greatly improves logging output, easing debugging - General logging improvements to ease debugging - Now expose internal `conns_` state to ease debugging - Added server-side ping timeout to match client side, and to catch unexpected cases where http server is never able to identify a connection as broken. Note that this new feature is currently opt-in[1], but will be enabled by default in a future release. - Simplified internal API for updating `conns_` state. - More robust handling when events fire in unexpected order (e.g. :on-close firing before :on-open handshake). - Generally improved clarity and robustness. Note: some related additions will also be made to the reference example project in another commit. [1] Provide a value (e.g. 5000) for the new `:ws-ping-timeout-ms` option to `make-channel-socket-server!`
Incl. notably some additional debugging tools for [#431]: - `connected-uids_` and `conns_` now both printed in loop - New buttons: - Print connected uids - Test repeated logins
Hi Jan- thanks for the additional info. Indeed, nothing sticks out as obviously relevant 👍 I've just pushed This includes a rewrite of the last code that I believe could have been causing problems. Could you please try this and let me know? 🙏 I've also updated the reference example again to include some additional debugging tools - notably a toggle to simulate unreliable but unbroken connections. |
Hi Peter, So far I haven't been able to trigger the issue with I'm going to try a longer experiment to confirm. I deployed the new alpha to my app where I originally noticed the problem and will script periodic browser freezes/restarts to happen during the night. That should test a lot of reconnects. So I'll report tomorrow, unless I notice any problem sooner. |
Excellent, thank you so much! I'll wait on word from you then to push |
I'm happy to report that my experiment didn't reveal any problems 👍 I tested re/connections to my app as well as the example project at the same time, more than 1800 sessions in total with random user-ids and none of them got stuck. Thank you for working on this. |
That's great news, thanks so much for all your patience and assistance with debugging this! |
…ids` Big thanks to @krajj7 for the report and huge assistance debugging! As part of the investigation into this issue, and due to another recent related issue (#429), I decided that we were overdue for a refactor of Sente's connection management system. The old system had grown overly complex, and left too much room for edge cases and timing issues. This commit introduces a major refactor of the system, with an emphasis on robustness and improved observability. Specific improvements include: - New internal "conn-id" concept that: 1. No longer depends on http server implementations to properly implement identity 2. Greatly improves logging output, easing debugging - General logging improvements to ease debugging - Now expose internal `conns_` state to ease debugging - Added server-side ping timeout to match client side, and to catch unexpected cases where http server is never able to identify a connection as broken. Note that this new feature is currently opt-in[1], but will be enabled by default in a future release. - Simplified internal API for updating `conns_` state. - More robust handling when events fire in unexpected order (e.g. :on-close firing before :on-open handshake). - Generally improved clarity and robustness. Note: some related additions will also be made to the reference example project in another commit. [1] Provide a value (e.g. 5000) for the new `:ws-ping-timeout-ms` option to `make-channel-socket-server!`
Incl. notably some additional debugging tools for [#431]: - `connected-uids_` and `conns_` now both printed in loop - New buttons: - Print connected uids - Test repeated logins
Note that because of the amount of code touched, I decided in hindsight to be cautious and release this work as part of a bigger version bump (1.19 instead of 1.18.2). Closing for now, but please feel free to re-open if you do encounter any more related trouble. Cheers! |
After upgrading from sente 1.17.0 to 1.18.1 I noticed that my app is accumulating dead websocket connections in the
connected-uids
atom that never go away, even after trying to close them by sending[:chsk/close]
. One case when this seems to happen is when the browser is killed non-gracefully, ie.kill -9
.This can be simulated by using the example project in the repo (I am assuming Linux and using Chromium as the browser):
git checkout https://github.com/ptaoussanis/sente.git; cd sente/example-project; lein cljsbuild once
[com.google.guava/guava "25.1-jre"]
in project.clj or it doesn't work)lein repl
(-main)
(:ws @connected-uids)
in the REPL should show that user-idkillall -KILL chromium
(chsk-send! <uid> [:chsk/close])
(:ws @connected-uids)
With the example project from version 1.17.0 the client disappears from
(:ws @connected-uids)
after only a few seconds. With version 1.18.1 the client seems to stay indefinitely.I realize that using
:chsk/close
is undocumented, as far as I understand, unresponsive clients should eventually be disconnected automatically, but this never seems to happen with 1.18.1 (when the browser is killed non-gracefully as above). With 1.17.0 automatic disconnects mostly worked but I found some cases where an explicit close was needed, which is why I'm using it.The text was updated successfully, but these errors were encountered: