Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waku V2 prod fleet WebSockify connectivity issues #19

Closed
jm-clius opened this issue Jun 22, 2021 · 41 comments
Closed

Waku V2 prod fleet WebSockify connectivity issues #19

jm-clius opened this issue Jun 22, 2021 · 41 comments
Assignees

Comments

@jm-clius
Copy link
Contributor

jm-clius commented Jun 22, 2021

I've noticed the following possible issues with the Waku v2 prod fleet:

Logging/Kibana issues Resolved

  1. Logs aren't available for any Waku v2 prod nodes on Kibana since ~11:13 UTC on the 14th of June. (In fact, the only Waku v2 service that's still logging seems to be the one on host node-01.gc-us-central1-a.wakuv2.test.)
  2. The logs for all services on node-01.do-ams3.wakuv2.prod disappeared even earlier from Kibana (9th of June).

Connectivity to node-01.do-ams3.wakuv2.prod Resolved

prod nodes do not seem to be (stably) connected to node-01.do-ams3.wakuv2.prod. From available logs:

  1. The other two prod nodes seemed to be trying to reach the peerId for node-01.do-ams3.wakuv2.prod at the wire address for node-01.ac-cn-hongkong-c.wakuv2.prod (/ip4/8.210.222.231/tcp/30303). Perhaps some earlier inconsistency in the connect.yml script? Since these get persisted, we may need to drop the Peer table on each of the prod nodes, before running the connect script again.
  2. It seems possible that node-01.do-ams3.wakuv2.prod was not updated with the other two nodes after deploying to prod. This is based on the other two nodes complaining that it does not support the same protocols as them. (Though this could just be a side-effect of the issue (1) above.)

Possible websockify issues

  1. websockify (on all hosts) logs frequent errors: (e.g. "Connection reset", "no shared cipher", "bad syntax", "Method not allowed") on available logs.
@jm-clius
Copy link
Contributor Author

Hi @jakubgs. I've summarised some of the problems I see on prod above. Some/most of these may be related, but LMK if you want me to create separate issues for each subheading. Thanks!

@arthurk
Copy link
Contributor

arthurk commented Jun 22, 2021

See #20 (comment) for deployment notes

@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

I restarted the Logstash containers and it seems like the logs are back on the menu:

image

But it is a worrying behavior that this happened. I'll have to monitor Logstash services more closely.

@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

Also, yeah, separate problems require separate issues. But whatever.

@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

I purged SQLite databases with:

echo 'DELETE FROM Peer;' | sudo sqlite3 data/store.sqlite3

And then re-ran Ansible for all the prod hosts, and the connections look correct:

admin@node-01.do-ams3.wakuv2.prod:/docker/nim-waku-v2 % /docker/nim-waku-v2/rpc.sh get_waku_v2_admin_v1_peers
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": [
    {
      "multiaddr": "/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA",
      "protocol": "/vac/waku/relay/2.0.0-beta2",
      "connected": true
    },
    {
      "multiaddr": "/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD",
      "protocol": "/vac/waku/relay/2.0.0-beta2",
      "connected": true
    },
    {
      "multiaddr": "/ip4/188.166.135.145/tcp/9000/p2p/16Uiu2HAmUsvgMBF7yhzFMumEcccjWfg2EPCSXtE33Tbs8jNQj2bS",
      "protocol": "/vac/waku/relay/2.0.0-beta2",
      "connected": true
    }
  ]
}

@jakubgs jakubgs changed the title Inquiry: Waku v2 prod fleet health Waku V2 prod fleet WebSockify connectivity issues Jun 22, 2021
@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

There are certainly some logs indicating connectivity issues:

admin@node-01.gc-us-central1-a.wakuv2.prod:/docker/nim-waku-v2 % grep -i err /var/log/docker/websockify/docker.log
2021-06-22T02:01:45.030714+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:45.247048+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:45.681484+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:45.896461+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:46.116637+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:46.336742+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T02:01:46.554451+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T03:05:56.750209+00:00 docker/websockify[767]: handler exception: [Errno 104] Connection reset by peer
2021-06-22T09:09:19.471171+00:00 docker/websockify[767]: handler exception: [Errno 32] Broken pipe
2021-06-22T09:09:22.555589+00:00 docker/websockify[767]: handler exception: [Errno 32] Broken pipe
2021-06-22T12:32:21.489493+00:00 docker/websockify[767]: handler exception: [Errno 32] Broken pipe
2021-06-22T12:33:21.630111+00:00 docker/websockify[767]: handler exception: [Errno 32] Broken pipe

But stuff like that is usually created by clients, not the server.

@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

There is no proxy in place, the container exposes the port 443 directly:

admin@node-01.gc-us-central1-a.wakuv2.prod:/docker/nim-waku-v2 % d inspect websockify | jq '.[].NetworkSettings.Ports'
{
  "443/tcp": [
    {
      "HostIp": "0.0.0.0",
      "HostPort": "443"
    }
  ]
}

So there is no reverse proxy tuning we could possible do.

@jakubgs
Copy link
Member

jakubgs commented Jun 22, 2021

@jm-clius your report about WebSocket connectivity issues is very vague. I see nothing wrong with the setup as far as I can tell.

You have to give me a way to reproduce this issue in order to be able to try to start to debug it.

@jm-clius
Copy link
Contributor Author

Thanks for the fixes, @jakubgs. I will monitor on our side, but things definitely look better.

You have to give me a way to reproduce this issue in order to be able to try to start to debug it.

Yeah, I added the websockify error logs to the report simply because I noticed them occurring frequently while investigating the other issue and thought it could be related. It might not be a problem at all, but I'll see if I can come up with something more concrete and reproducible. Will also hear if @D4nte has any thoughts.

@D4nte
Copy link

D4nte commented Jun 24, 2021

I witness connectivity issues from js-waku side: waku-org/js-waku#185
I attributed it to an issue with js-libp2p that does not emit an event when the stream closes.
However, I did not investigate the reason for said stream to close. I guess this is related to the websockify errors. I have no idea at this stage.

@jakubgs
Copy link
Member

jakubgs commented Aug 3, 2021

@D4nte is this still relevant?

@D4nte
Copy link

D4nte commented Aug 5, 2021

All good on my side so this comment #19 (comment) can now be ignored.

@jakubgs
Copy link
Member

jakubgs commented Aug 5, 2021

I'm closing this then. Feel free to reopen if this shows up again.

@jakubgs jakubgs closed this as completed Aug 5, 2021
@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

We've seen some new issues with Websockify today. One thing I've noticed is a lot of processes in the containers:

admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % d top websockify | head -n4        
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
dockrem+            1474738             1474716             0                   Sep13               ?                   00:00:54            python3 -m websockify --cert=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/fullchain.pem --key=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/privkey.pem 443 node:30303
dockrem+            1688455             1474738             0                   03:59               ?                   00:00:07            python3 -m websockify --cert=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/fullchain.pem --key=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/privkey.pem 443 node:30303
dockrem+            1755707             1474738             0                   12:57               ?                   00:00:03            python3 -m websockify --cert=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/fullchain.pem --key=/etc/letsencrypt/live/node-01.ac-cn-hongkong-c.wakuv2.prod.statusim.net/privkey.pem 443 node:30303

admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % d top websockify | wc -l           
61

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

And we can see that the websocket healthcheck is failing:

admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % websocat -vk1E wss://localhost:443
[INFO  websocat::lints] Auto-inserting the line mode
[INFO  websocat::stdio_threaded_peer] get_stdio_peer (threaded)
[INFO  websocat::ws_client_peer] get_ws_client_peer
websocat: WebSocketError: I/O failure
websocat: error running

On some hosts it's just takes a really long time to respond.

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

Here's some errors we can find in the logs:

admin@node-01.ac-cn-hongkong-c.wakuv2.prod:~ % awk -F':' '/Err/{a[$6]++}END{for(i in a){printf "%2s - %s\n", a[i], i}}' /var/log/docker/websockify/docker.log
 1 -  [Errno 32] Broken pipe
 9 -  [Errno 110] Operation timed out
77 -  [Errno 104] Connection reset by peer

@jakubgs jakubgs reopened this Sep 14, 2021
@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

There's noticeable SWAPing and a LOT of websockify processes:

image

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

I'm wondering if the processes aren't stuck in some weird state after a connection breaking of not cleanly.

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

Look at that, looks like we are hitting a peer limit:

image

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

It's definitely not a limit of file descriptors:

admin@node-01.do-ams3.wakuv2.prod:~ % d exec -it nim-waku-v2 sh -c 'ulimit -n'  
1048576

admin@node-01.do-ams3.wakuv2.prod:~ % d exec -it nim-waku-v2 cat /proc/sys/fs/file-max
9223372036854775807

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

And the restart doesn't appear to free up the peers:

image

So we are still close to the limit, wherever that limit comes from.

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

Although the RPC call shows only 13 of them:

admin@node-01.do-ams3.wakuv2.prod:~ % /docker/nim-waku-v2/rpc.sh get_waku_v2_admin_v1_peers | jq '.result[].multiaddr'        
"/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA"
"/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD"
"/ip4/188.166.135.145/tcp/9000/p2p/16Uiu2HAmUsvgMBF7yhzFMumEcccjWfg2EPCSXtE33Tbs8jNQj2bS"
"/ip4/10.0.0.12/tcp/43605/p2p/16Uiu2HAmCc8eEyz7FE1pudYYQVvvzrYLFhZ7NNpC3etj6KiqmPQL"
"/ip4/10.0.0.12/tcp/40709/p2p/16Uiu2HAkvJu5nuDnzKRsAU9W7mw1SuuVsHLY5gfoVVohpq3nDLuD"
"/ip4/192.168.43.253/tcp/41943/p2p/16Uiu2HAmETkTXQJjirrgaFqoQECG491jqDKuqNnJMvF9tPMGhjUx"
"/ip4/192.168.1.19/tcp/41139/p2p/16Uiu2HAm6vBaena2tCqBVPGctzi9q6mjHXNGHeNFs1UmzXEa1nPD"
"/ip4/10.0.0.12/tcp/43507/p2p/16Uiu2HAmVXm8AWQpohDytibp4fgrmekpAPN2VQ2F12zErAczAxqJ"
"/ip4/148.103.183.231/tcp/60210/p2p/16Uiu2HAkxiM8Ut3qpA3zqMgSb327qiAADeBMgQjCeHmGrFPP1Smq"
"/ip4/192.168.1.19/tcp/44859/p2p/16Uiu2HAmPNCdNRPx3isT96Me4UVa6UGrv6C2t32m2Ftfh8PrxQQk"
"/ip4/192.168.43.253/tcp/42031/p2p/16Uiu2HAmCo6hG4JRYJ1KC4MzCtmdGPMHPU8afXS7rWGiGkNv4Afg"
"/ip4/192.168.1.19/tcp/43339/p2p/16Uiu2HAkzkuH3Yn1CHi61UNYEb5cKx7FGShXBAVn8keEB4FCEjDw"
"/ip4/10.0.0.12/tcp/45689/p2p/16Uiu2HAmDmWtc8gm394hwJJjfcHsvzcNbeCA8RMwkHHR1uCVWUSc"

@jakubgs
Copy link
Member

jakubgs commented Sep 14, 2021

Wtf:

image

@D4nte
Copy link

D4nte commented Sep 15, 2021

Ok so this time it's due to over usage of the resources.

I continued the discussion there: https://discord.com/channels/864066763682218004/865467749923684403/887493985256497172

Happy to move it to GitHub :)

@jakubgs @jm-clius @oskarth what are you thought on the matter re expenditure of resources (see my discord message)?

@jakubgs
Copy link
Member

jakubgs commented Sep 15, 2021

No it's not. The issue is a limit if peers set at 50.

@jm-clius
Copy link
Contributor Author

Indeed, the libp2p max number of connections is set to 50: https://github.com/status-im/nim-libp2p/blob/master/libp2p/connmanager.nim#L26.
Summarizing some points from Discord discussion:

  • To expand network capacity we'll have to increase the number of discoverable full nodes in the network where e.g. the web client can connect to, especially when bootstrapping.
  • Encouraging user-run nodes is preferable.
  • It should be possible to add such nodes to a discoverably "bootstrap" list for the prod network using DNS-based discovery.

@jm-clius
Copy link
Contributor Author

Although the RPC call shows only 13 of them:

admin@node-01.do-ams3.wakuv2.prod:~ % /docker/nim-waku-v2/rpc.sh get_waku_v2_admin_v1_peers | jq '.result[].multiaddr'        
"/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA"
"/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD"
"/ip4/188.166.135.145/tcp/9000/p2p/16Uiu2HAmUsvgMBF7yhzFMumEcccjWfg2EPCSXtE33Tbs8jNQj2bS"
"/ip4/10.0.0.12/tcp/43605/p2p/16Uiu2HAmCc8eEyz7FE1pudYYQVvvzrYLFhZ7NNpC3etj6KiqmPQL"
"/ip4/10.0.0.12/tcp/40709/p2p/16Uiu2HAkvJu5nuDnzKRsAU9W7mw1SuuVsHLY5gfoVVohpq3nDLuD"
"/ip4/192.168.43.253/tcp/41943/p2p/16Uiu2HAmETkTXQJjirrgaFqoQECG491jqDKuqNnJMvF9tPMGhjUx"
"/ip4/192.168.1.19/tcp/41139/p2p/16Uiu2HAm6vBaena2tCqBVPGctzi9q6mjHXNGHeNFs1UmzXEa1nPD"
"/ip4/10.0.0.12/tcp/43507/p2p/16Uiu2HAmVXm8AWQpohDytibp4fgrmekpAPN2VQ2F12zErAczAxqJ"
"/ip4/148.103.183.231/tcp/60210/p2p/16Uiu2HAkxiM8Ut3qpA3zqMgSb327qiAADeBMgQjCeHmGrFPP1Smq"
"/ip4/192.168.1.19/tcp/44859/p2p/16Uiu2HAmPNCdNRPx3isT96Me4UVa6UGrv6C2t32m2Ftfh8PrxQQk"
"/ip4/192.168.43.253/tcp/42031/p2p/16Uiu2HAmCo6hG4JRYJ1KC4MzCtmdGPMHPU8afXS7rWGiGkNv4Afg"
"/ip4/192.168.1.19/tcp/43339/p2p/16Uiu2HAkzkuH3Yn1CHi61UNYEb5cKx7FGShXBAVn8keEB4FCEjDw"
"/ip4/10.0.0.12/tcp/45689/p2p/16Uiu2HAmDmWtc8gm394hwJJjfcHsvzcNbeCA8RMwkHHR1uCVWUSc"

This does not make sense. The RPC call should only show the peers that were explicitly added using RPC add_peers command (or that were included as arguments in the startup script). In other words, I'd expect prod cluster nodes to have:

  • 2 entries into the Peer table each to the other nodes in the cluster
  • 1 entry for the chat2bridge
    A total of 3 in other words. Where do these other Peer entries come from?

@jakubgs
Copy link
Member

jakubgs commented Sep 15, 2021

@jm-clius I have no idea, but look at the IPs of the peers other than the first 3, which are just the cluster ones.

The IPs are either from 192.168.x.x subnet or 10.x.x.x subnet, which are both respectively C and A Private network IP subnets. Which means they are not publicly available nodes for sure.

@jakubgs
Copy link
Member

jakubgs commented Sep 15, 2021

I've purged the Peer tables from all 3 hosts and re-started them and re-connected:

 > a all -a '/docker/nim-waku-v2/rpc.sh get_waku_v2_admin_v1_peers | jq .result[].multiaddr'
node-01.do-ams3.wakuv2.prod | CHANGED | rc=0 >>
"/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA"
"/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD"
node-01.gc-us-central1-a.wakuv2.prod | CHANGED | rc=0 >>
"/ip4/188.166.135.145/tcp/30303/p2p/16Uiu2HAmL5okWopX7NqZWBUKVqW8iUxCEmd5GMHLVPwCgzYzQv3e"
"/ip4/8.210.222.231/tcp/30303/p2p/16Uiu2HAm4v86W3bmT1BiH6oSPzcsSr24iDQpSN5Qa992BCjjwgrD"
node-01.ac-cn-hongkong-c.wakuv2.prod | CHANGED | rc=0 >>
"/ip4/188.166.135.145/tcp/30303/p2p/16Uiu2HAmL5okWopX7NqZWBUKVqW8iUxCEmd5GMHLVPwCgzYzQv3e"
"/ip4/34.121.100.108/tcp/30303/p2p/16Uiu2HAmVkKntsECaYfefR1V2yCR79CegLATuTPE6B9TxgxBiiiA"

@D4nte
Copy link

D4nte commented Sep 21, 2021

@jm-clius Does the 50 peers limit make sense? Should we increase this limit?

@jm-clius
Copy link
Contributor Author

@jm-clius Does the 50 peers limit make sense? Should we increase this limit?

The limit here is a maximum of 50 simultaneous connections (suspect we can store many more peers), which I think makes sense. (@Menduist, have you found that there is a better practical upper limit for connections for, say, nimbus?) The solution is to have clients connect to other nodes than just the cluster. If this is difficult to get done in a short time, we can increase the conn limit, but I'd prefer this to be a temporary measure unless we have a good understanding of what impact more connections will have.

@jakubgs
Copy link
Member

jakubgs commented Sep 21, 2021

@jm-clius so you're saying we'd prefer to have more hosts? If so we can scale it up pretty easily:

hosts_count = 1 /* number of hosts in each DC */

@jm-clius
Copy link
Contributor Author

@jm-clius so you're saying we'd prefer to have more hosts? If so we can scale it up pretty easily:

hosts_count = 1 /* number of hosts in each DC */

Thanks! Ideally we'd like people to run and connect to their own nodes, rather than increasing reliance on our cluster.

@Menduist
Copy link
Collaborator

You can have as many peers as you want, but that will generally cost more bandwidth & cpu.
For reference, IPFS can have 6000 peers, nimbus used to be around 300..

The only "surprising" upper limit is that ISP's router can crash or overheat if you have too many connections, so for software running at home, you should keep peer count to a minimum. But for beefy servers, not an issue

@jm-clius
Copy link
Contributor Author

You can have as many peers as you want, but that will generally cost more bandwidth & cpu.
For reference, IPFS can have 6000 peers, nimbus used to be around 300..

Ok, interesting! Thanks, @Menduist. Perhaps we can then go up quite a bit from the current default 50. Will alleviate the immediate problems, while we encourage further decentralisation.

@D4nte I think we should make this a configurable CLI parameter and run the cluster nodes with a much higher connection limit (probably 300, following nimbus). This should form part of the next release, so we can get it rolled out to prod.
Have created an issue: waku-org/nwaku#721

@D4nte
Copy link

D4nte commented Sep 28, 2021

Issues are being reported as we have a hackathon at the moment.

I can see a number of exception in elastic

handler exception: 'str' object has no attribute 'decode'

According due novnc/websockify#439 it happens with "large amount of binary data" which could fit our scenario depending on the messages being sent/received.

The fix has been released in websockify 10.0.0.

We currently use 0.9.0.

Can we please upgrade websockify? @jakubgs @arthurk

As we have a hackathon in progress, any chance this could be done now?

jakubgs added a commit that referenced this issue Sep 28, 2021
Apparently `0.9.0` has an issue with transferring "large amount of binary data".

This results in an exception like this:
```
handler exception: 'str' object has no attribute 'decode'
```

Issues:
* novnc/websockify#439
* #19

Signed-off-by: Jakub Sokołowski <jakub@status.im>
@jakubgs
Copy link
Member

jakubgs commented Sep 28, 2021

I've deployed 0.10.0 in this commit: 8e56bef7

So far we see no issues, but will continue to monitor.

@D4nte
Copy link

D4nte commented Oct 1, 2021

I've deployed 0.10.0 in this commit: 8e56bef

So far we see no issues, but will continue to monitor.

Thank you for that. The handler exception... issues have disappeared. I also received feedback from user that websocket connections are now more stable.

@jakubgs
Copy link
Member

jakubgs commented Oct 1, 2021

Nice. Feel free to to close this issue if you consider this resolved.

Though as I said, a native websocket support in the node would be much easier to manage and support.

@D4nte
Copy link

D4nte commented Oct 5, 2021

Happy to close. @jm-clius, happy too?

@jm-clius
Copy link
Contributor Author

jm-clius commented Oct 5, 2021

Happy to close. I'll go ahead. :)

@jm-clius jm-clius closed this as completed Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants