Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mamaki testnet tendermint issues #457

Closed
mindstyle85 opened this issue May 29, 2022 · 8 comments
Closed

Mamaki testnet tendermint issues #457

mindstyle85 opened this issue May 29, 2022 · 8 comments

Comments

@mindstyle85
Copy link

mindstyle85 commented May 29, 2022

Starting to gather info on Mamaki here, so that all info that could potentially help to debug is in once place.

  1. Mamaki started with only the new p2p system, where genesis validators were on version 0.5.0 (using tendermint 0.35.4 and cosmos-sdk v0.46.beta2) and had consensus-timeout set to 25s, and one or two bootstrap-peers and about 10 persistent-peers. Tendermint worked ok.
  2. After publicly announcing Mamaki, other users started to join with v0.5.2. which had slight changes to consensus params, and they also used only the new p2p system with the persistent-peers and bootstrap-peers. This is where network started to get flaky.
  3. There were several attempt to solve it, with the end result that the genesis validators and a few others had these settings:
    timeout-commit = "25s"
    these 3 are now different accross validators, many have lowered them a bit:
max-connections = 250
max-num-inbound-peers = 180
max-num-outbound-peers = 70
timeout-propose = "3s"
skip-timeout-commit = false
[p2p]

# Enable the legacy p2p layer.
use-legacy = true

Lastly, the persistent-peers have been added back due to enabling the old p2p system. This should mean we are now running both new and old p2p system which were supposedly interoperable.

The network has these issues/symptoms at the moment:

  • up to 20 rounds before consensus

consensus rounds

- different step durations

step duration

- if disabling the old p2p system again, nodes have issues syncing/finding peers - 5 or more blocks in a row made by the same validator (it even happened with a validator with voting power 20, which is essentially 0% of the network) - peer overflow issue:
ERR failed to process message ch_id=0 envelope={"Broadcast":false,"From":"fcff172744c51684aaefc6fd3433eae275a2f31b","Message":{"addresses":[{"id":"30... (long list) ...  err="peer sent too many addresses (max: 100, got: 230)"
ERR peer error, evicting err="peer sent too many addresses (max: 100, got: 230)" module=p2p peer=fcff172744c51684aaefc6fd3433eae275a2f31b

peers

rejeted peer

I probably missed something too so feel free to add onto this. As well as potential solutions, or why this would be happening.
One solution i saw is to set filter-peers=true to stop the peer overflow, but i am not sure that a long term solution here.

@Northa
Copy link

Northa commented May 30, 2022

In my case node getting stucked with pong timeout error
scr pong timeout

Tried(without success):
increase pong timeout to 270
set pong timeout a bit lower than ping timeout

Observations:

  1. Tested with use-legacy=true and filter-peers = true/false

Once a node freezing some endpoints also gettting stucked(not all). And as you can see CPU usage increasing drastically.
:26657 - /status and /dump_consensus_state not working
:1317 - /node_info not working
:9090 grpc GetNodeInfo not working
the rest endpoints seems to work fine(not tested all :1317 and 9090)
Screenshot from 2022-05-30 17-23-25

  1. use-legacy=false
  • Node panics if queue-type = wdrr and prometheus=true

Screenshot from 2022-05-30 19-08-40

  • queue-type = wdrr and prometheus=false. Node starts but CPU utilization at 1000%

Screenshot from 2022-05-30 19-58-07

@evan-forbes
Copy link
Member

This appears to be caused by a few different simultaneous issues. The original issue stemmed from a configuration bug in tendermint that wasn't allowing peers to be pruned. When peers would begin to build up, the consensus reactor state would get caught on some mutexes. This was fixed by simply reducing the max peer count during the initialization of the node. tendermint/tendermint#8684

There was also an issue with the node getting stuck in a loop when catching up from peers. That was contributing to the issues above along with causing
#453 tendermint/tendermint#8651

The issues were compounded when we switched a portion of the network to the legacy p2p network. This caused multiple issues as tendermint's legacy PEX was not fully compatible with the new PEX when in a network with many peers. This would cause peers using the legacy PEX to get evicted from and not exchanges peers with those who were using the new PEX. tendermint/tendermint#8657 We made an additional change to our PEX that allows for legacy peers to send up to 250 peers instead of 100, as to not arbitrarily block legacy peers resulting in nodes not being able to sync from scratch.

@evan-forbes
Copy link
Member

Before we debug further stability issues with the p2p network, we will want to issue an upgrade to fix as many of the p2p issues as possible

@packetstracer
Copy link

Tried the following changes with mostly no success in our case:

  • both modes of use-legacy (false | true)
  • limiting connections by setting low values on max-connections max-num-inbound-peers max-num-outbound-peers and max-incoming-connection-attempts
  • filtering peers with filter-peers = true

Service ends up getting clogged by incoming connections, and takes a while even to restart (CPU (35% on 6 cores) and Mem (4.4GB) usage seems high but not that high).

image

The only thing that seems to be working is changing p2p and rpc ports, and even with that settings being quite unstable in terms of block producing.

@evan-forbes
Copy link
Member

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind. Changing ports would cause all other peers attempting to dial your address to fail, and therefore limit the number of peers connected.

@packetstracer
Copy link

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind. Changing ports would cause all other peers attempting to dial your address to fail, and therefore limit the number of peers connected.

Not yet @evan-forbes, will compile it tomorrow, give it a try and report the outcome.

@packetstracer
Copy link

@packetstracer have you tried using the evan/profile-constantly branch? is has a patch to use the latest from tendermint. Provided max connections are limited to around 80, I have not had to restart any of my nodes using lagacy=false

the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind.

Seems like the fix on that branch is working perfectly, no blocks missed, no issues on hw performance either 👌

Using same config as described by Mzonder in this post https://discord.com/channels/638338779505229824/979040398376849479/983613749694980186

image

image

@packetstracer
Copy link

Issues came back again, needed to restart the node
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants