Mamaki testnet tendermint issues #457

mindstyle85 · 2022-05-29T22:24:00Z

Starting to gather info on Mamaki here, so that all info that could potentially help to debug is in once place.

Mamaki started with only the new p2p system, where genesis validators were on version 0.5.0 (using tendermint 0.35.4 and cosmos-sdk v0.46.beta2) and had consensus-timeout set to 25s, and one or two bootstrap-peers and about 10 persistent-peers. Tendermint worked ok.
After publicly announcing Mamaki, other users started to join with v0.5.2. which had slight changes to consensus params, and they also used only the new p2p system with the persistent-peers and bootstrap-peers. This is where network started to get flaky.
There were several attempt to solve it, with the end result that the genesis validators and a few others had these settings:
timeout-commit = "25s"
these 3 are now different accross validators, many have lowered them a bit:

max-connections = 250
max-num-inbound-peers = 180
max-num-outbound-peers = 70

timeout-propose = "3s"
skip-timeout-commit = false

[p2p]

# Enable the legacy p2p layer.
use-legacy = true

Lastly, the persistent-peers have been added back due to enabling the old p2p system. This should mean we are now running both new and old p2p system which were supposedly interoperable.

The network has these issues/symptoms at the moment:

up to 20 rounds before consensus

- different step durations

- if disabling the old p2p system again, nodes have issues syncing/finding peers - 5 or more blocks in a row made by the same validator (it even happened with a validator with voting power 20, which is essentially 0% of the network) - peer overflow issue:

ERR failed to process message ch_id=0 envelope={"Broadcast":false,"From":"fcff172744c51684aaefc6fd3433eae275a2f31b","Message":{"addresses":[{"id":"30... (long list) ...  err="peer sent too many addresses (max: 100, got: 230)"
ERR peer error, evicting err="peer sent too many addresses (max: 100, got: 230)" module=p2p peer=fcff172744c51684aaefc6fd3433eae275a2f31b

I probably missed something too so feel free to add onto this. As well as potential solutions, or why this would be happening.
One solution i saw is to set filter-peers=true to stop the peer overflow, but i am not sure that a long term solution here.

The text was updated successfully, but these errors were encountered:

Northa · 2022-05-30T17:42:17Z

In my case node getting stucked with pong timeout error

Tried(without success):
increase pong timeout to 270
set pong timeout a bit lower than ping timeout

Observations:

Tested with use-legacy=true and filter-peers = true/false

Once a node freezing some endpoints also gettting stucked(not all). And as you can see CPU usage increasing drastically.
:26657 - /status and /dump_consensus_state not working
:1317 - /node_info not working
:9090 grpc GetNodeInfo not working
the rest endpoints seems to work fine(not tested all :1317 and 9090)

use-legacy=false

Node panics if queue-type = wdrr and prometheus=true

queue-type = wdrr and prometheus=false. Node starts but CPU utilization at 1000%

evan-forbes · 2022-06-06T12:56:19Z

This appears to be caused by a few different simultaneous issues. The original issue stemmed from a configuration bug in tendermint that wasn't allowing peers to be pruned. When peers would begin to build up, the consensus reactor state would get caught on some mutexes. This was fixed by simply reducing the max peer count during the initialization of the node. tendermint/tendermint#8684

There was also an issue with the node getting stuck in a loop when catching up from peers. That was contributing to the issues above along with causing
#453 tendermint/tendermint#8651

The issues were compounded when we switched a portion of the network to the legacy p2p network. This caused multiple issues as tendermint's legacy PEX was not fully compatible with the new PEX when in a network with many peers. This would cause peers using the legacy PEX to get evicted from and not exchanges peers with those who were using the new PEX. tendermint/tendermint#8657 We made an additional change to our PEX that allows for legacy peers to send up to 250 peers instead of 100, as to not arbitrarily block legacy peers resulting in nodes not being able to sync from scratch.

evan-forbes · 2022-06-06T12:57:27Z

Before we debug further stability issues with the p2p network, we will want to issue an upgrade to fix as many of the p2p issues as possible

packetstracer · 2022-06-06T18:18:58Z

Tried the following changes with mostly no success in our case:

both modes of use-legacy (false | true)
limiting connections by setting low values on max-connections max-num-inbound-peers max-num-outbound-peers and max-incoming-connection-attempts
filtering peers with filter-peers = true

Service ends up getting clogged by incoming connections, and takes a while even to restart (CPU (35% on 6 cores) and Mem (4.4GB) usage seems high but not that high).

The only thing that seems to be working is changing p2p and rpc ports, and even with that settings being quite unstable in terms of block producing.

evan-forbes · 2022-06-06T18:26:10Z

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind. Changing ports would cause all other peers attempting to dial your address to fail, and therefore limit the number of peers connected.

packetstracer · 2022-06-06T22:40:44Z

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind. Changing ports would cause all other peers attempting to dial your address to fail, and therefore limit the number of peers connected.

Not yet @evan-forbes, will compile it tomorrow, give it a try and report the outcome.

packetstracer · 2022-06-07T16:45:14Z

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind.

Seems like the fix on that branch is working perfectly, no blocks missed, no issues on hw performance either 👌

Using same config as described by Mzonder in this post https://discord.com/channels/638338779505229824/979040398376849479/983613749694980186

packetstracer · 2022-06-07T21:18:56Z

Issues came back again, needed to restart the node

evan-forbes closed this as completed Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mamaki testnet tendermint issues #457

Mamaki testnet tendermint issues #457

mindstyle85 commented May 29, 2022 •

edited by liamsi

Loading

Northa commented May 30, 2022 •

edited

Loading

evan-forbes commented Jun 6, 2022

evan-forbes commented Jun 6, 2022

packetstracer commented Jun 6, 2022

evan-forbes commented Jun 6, 2022

packetstracer commented Jun 6, 2022

packetstracer commented Jun 7, 2022

packetstracer commented Jun 7, 2022

Mamaki testnet tendermint issues #457

Mamaki testnet tendermint issues #457

Comments

mindstyle85 commented May 29, 2022 • edited by liamsi Loading

Northa commented May 30, 2022 • edited Loading

evan-forbes commented Jun 6, 2022

evan-forbes commented Jun 6, 2022

packetstracer commented Jun 6, 2022

evan-forbes commented Jun 6, 2022

packetstracer commented Jun 6, 2022

packetstracer commented Jun 7, 2022

packetstracer commented Jun 7, 2022

mindstyle85 commented May 29, 2022 •

edited by liamsi

Loading

Northa commented May 30, 2022 •

edited

Loading