Skip to content
This repository has been archived by the owner on Dec 28, 2024. It is now read-only.

Websocket connections not closed using Kong load balancer #83

Closed
landorg opened this issue Nov 8, 2019 · 11 comments
Closed

Websocket connections not closed using Kong load balancer #83

landorg opened this issue Nov 8, 2019 · 11 comments

Comments

@landorg
Copy link
Contributor

landorg commented Nov 8, 2019

Tell us about your environment

AnyCable-Go version:
v0.6.3
AnyCable gem version:
v0.6.3

What did you do?

Use kong as a load balancer.

What did you expect to happen?

Websocket connections are closed after some time.

What actually happened?

Websocket connections add up very fast until kong reaches its limit (8-10k) and restarts with

[alert] 33#0: 16384 worker_connections are not enough

Before kong we used the native kubernetes nginx as a load-balancer and the connections got closed properly. We usually had around 1k connections at peak times.
First we thougt the problem was that sticky sessions were not working properly. This problem should be fixed but we still see the same issue. (first bump in the picture is before we used sticky sessions hash_on: ip)
2019-11-08-141314_592x229_scrot

EDIT: a graph from before we were using kong:
2019-11-11-101217_887x232_scrot

Are there any special settings that need to be set in load-balancer?
Any other idea how to resolve that problem?

Thank You

@sponomarev sponomarev changed the title Websocket connections not closed Websocket connections not closed using Kong load balancer Nov 8, 2019
@palkan
Copy link
Member

palkan commented Nov 8, 2019

Don't know anything about Kong 🤷🏻‍♂️

we used sticky sessions hash_on: ip

Preferably, you should use "least open connections" for WebSockets; that would result in more uniform balancing.

But as I see here:

'least open connections' does not make sense in a Kong cluster

Anyway, that shouldn't be the reason for this issue.

As far as I understand, WebSockets load balancing works the following way:

  1. Client connects to the LB.
  2. LB connects to the upstream.
  3. Client disconnects from the LB.
  4. LB closes the corresponding connection to the upstream.

What could go wrong? I have two ideas:

  • Kong doesn't detect closed connections properly.
    I can't say anything about Kong internal, but maybe that's something lying lower, at the OS (or whatever it is in k8s?) layer related to TCP settings (see, for example, https://docs.anycable.io/#/anycable-go/os_tuning?id=tcp-keepalive).

  • Kong doesn't close/reap LB->upstream connections.
    Maybe, Kong expects a specific closing code or status or something (which we do not set in anycable-go) and doesn't remove the connection (although the client has left).

Do you know how to reproduce this setup locally? Maybe, via docker-compose configuration with Kong and AnyCable? Anything that we'll be able to use for investigation is appreciated.

@palkan palkan added the question label Nov 8, 2019
@landorg
Copy link
Contributor Author

landorg commented Nov 11, 2019

Thanks for your answer.
I'll try to provide you a minimal example of our setup.
I also asked this questions to the guys from kong here: https://discuss.konghq.com/t/anycable-websocket-connections-not-closed/4866

regarding the tcp keepalive settings:
we use the default settings from ubuntu:

net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9

as far as I understand this means a websocket connection shoul be closed after 2 hours if all 9 request probes fail. So there should be a reduction of connections somewhen if this woud work. right? Here the graph from the weekend with less traffic:
2019-11-11-103131_892x232_scrot

I'll add a graph of normal operation to my original question

@le0pard
Copy link

le0pard commented Mar 11, 2020

We had something similar on project with such scheme:

AWS ALB <-> nginx <-> anycable-go

Issue was net.ipv4.tcp_tw_reuse, which had value 1 (mean enabled). It created issue, where our websocket connections not die, even with reducer keepalive for tcp. Websockets connections stop "leaking" after disable net.ipv4.tcp_tw_reuse (value 0).

Also, check that you don't have net.ipv4.tcp_tw_recycle with value 1. It also can cause issues (on new linux kernels net.ipv4.tcp_tw_recycle was removed) and need to be disabled.

Maybe it will help you @RolandG

@landorg
Copy link
Contributor Author

landorg commented Apr 7, 2020

Thanks for the tip. We didn't have the issue since we have activated again the orange cloud in cloudflare. We had it deactivated before because we had performance problems with it. Since then the connections are closed at a certain point from cloudflare I gues. Still we might need this in the future.

@palkan palkan added stale and removed question labels Jun 10, 2020
@palkan palkan closed this as completed Jun 10, 2020
@landorg
Copy link
Contributor Author

landorg commented Jul 22, 2020

@le0pard
Currently doing some tests.
Did you set these parameters on the nginx or on the anycable server?

@le0pard
Copy link

le0pard commented Jul 22, 2020

@RolandG it is params for Linux kernel - https://wiki.archlinux.org/index.php/sysctl

Was globally available for all processes on the system

@scalp42
Copy link

scalp42 commented Jan 19, 2021

@le0pard at which layer you made the sysctl change? the Nginx or Anycable one?

@le0pard
Copy link

le0pard commented Jan 19, 2021

@scalp42 sysctl changes was made on system/OS level.

@scalp42
Copy link

scalp42 commented Jan 19, 2021

@le0pard right but on which host? the Nginx one or Anycable host?

@le0pard
Copy link

le0pard commented Jan 19, 2021

It was the same host in my case @scalp42

@eneeyac
Copy link

eneeyac commented Jan 25, 2021

Hi, guys
I have a similar issue with my anycable environment.

The anycable-go server is behind the Nginx reverse proxy and according to ngx_http_stub_status_module (http://nginx.org/en/docs/http/ngx_http_stub_status_module.html) I see near 80 active connections that look close to reality for that period of time for my app.

image

But at the same time anycable_go_clients_uniq_num metric for anycable-go shows 360 clients and anycable_go_clients_num more than 2000 clients. After I restarted the anycable-go, the anycable_go_clients_num became the same as nginx metrics show.

image

And anycable_go_mem_sys_bytes metric has reduced:

image

On the other hand when I restart the Nginx, the anycable_go_clients_num and anycable_go_mem_sys_bytes metrics don’t change. So that forces me to think that anycable-go keeps outdated connections, not Nginx. Does it make sense?

I tried to tune OS keepalive settings as described in the official doc https://docs.anycable.io/#/v1/anycable-go/os_tuning?id=tcp-keepalive but nothing changes, anycable_go_clients_num metric shows a lot of clients.

I checked thatnet.ipv4.tcp_tw_recycle = 0and cat: /proc/sys/net/ipv4/tcp_tw_recycle: No such file or directory

I also tried the libkeepalive library (http://libkeepalive.sourceforge.net) as described in
https://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#libkeepalive but with the same result - anycable_go_clients_num metric shows a lot of clients.

Could you advise please what may be wrong? I would be appreciated for any help

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants