Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connections hang in CLOSED state #1589

Closed
Yozhig opened this issue Nov 22, 2022 · 7 comments
Closed

Connections hang in CLOSED state #1589

Yozhig opened this issue Nov 22, 2022 · 7 comments
Milestone

Comments

@Yozhig
Copy link
Contributor

Yozhig commented Nov 22, 2022

We have faced an issue when connections (ports and processes owning them) hang forever (we use these connections for infinite streams of events).

...
4071 inet_tcp 249       69938937   <0.15260.394>                   *:*                    CLOSED       STREAM
4072 inet_tcp 249       141447159  <0.17014.480>                   *:*                    CLOSED       STREAM
4078 inet_tcp 249       176131209  <0.6376.600>                    *:*                    CLOSED       STREAM
4084 inet_tcp 249       80625928   <0.22290.881>                   *:*                    CLOSED       STREAM
4087 inet_tcp 249       101426932  <0.2095.695>                    *:*                    CLOSED       STREAM
4095 inet_tcp 249       147349243  <0.23210.849>                   *:*                    CLOSED       STREAM

Our investigation showed it happens due to combination of configuration, error handling from cowboy side and inet_drv behavior. The interesting thing here is when send_timeout_close configuration parameter is used and inet_drv closes a connection due to timeout it does not send tcp_closed or tcp_error message to a port owner. The only thing we can do in this case is always check a return value of Transport:send/2.
UPD: I've made a test for this case in #1590

@stolen
Copy link

stolen commented Nov 14, 2023

@essen could you react on this please?
We faced this issue too. Long running connections (MPEG-TS over HTTP in our case) tend to leak.
This leads to a growing number of useless processes spending CPU and memory.
On this picture there is a significant leak (about 5K processes which is about 1.5K cowboy_loop workers per day) on a very low load (about 2 new requests per second and max 100 concurrent connections).
image

@essen
Copy link
Member

essen commented Nov 15, 2023

What I don't understand is why the idle/inactivity timeouts don't help.

Anyway have you tried the patch in the PR, does it fix your case?

@stolen
Copy link

stolen commented Nov 15, 2023

why the idle/inactivity timeouts don't help

I should have mentioned, sorry for that.
These are long-living responses. Client connects and receives data as it appear, and it may last for hours (e.g. internet radio via shoutcast or SSE), without any data sent by a client. Dowloading 2-gigabyte file over slow 3G link is also affected.
Meaningful idle_timeout kills these connections because there is no client activity in the socket, leading to loss of data in perfect network conditions.
So, we have to disable it explicitly with cowboy_req:cast({set_options, #{idle_timeout => infinity}}, Req); for these long requests.

have you tried the patch in the PR

Not yet. Will check it soon.
I've traced ssl:send/2 -> {error,enotconn}, so the patch seems to address exactly our case.

@essen
Copy link
Member

essen commented Nov 15, 2023

Yes I've merged a similar patch for Gun, didn't think there was users disabling these timeouts so didn't follow up here (priorities and all that). But sounds like it'd be useful for the next release.

@stolen
Copy link

stolen commented Nov 15, 2023

Looking forward for the release. Thank you!

@stolen
Copy link

stolen commented Nov 16, 2023

The issue is fixed with PR.
Service was restarted at 11/15 12:00 (large step down), and there is no growth in process count.
Screenshot 2023-11-16 at 13 05 30

@essen essen added this to the 2.11 milestone Nov 23, 2023
@essen
Copy link
Member

essen commented Dec 12, 2023

This will now work properly on current master. I made a slightly different fix so you may want to test again before I release the new version. Closing, thanks!

@essen essen closed this as completed Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants