-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mamaki testnet tendermint issues #457
Comments
In my case node getting stucked with pong timeout error Tried(without success): Observations:
Once a node freezing some endpoints also gettting stucked(not all). And as you can see CPU usage increasing drastically.
|
This appears to be caused by a few different simultaneous issues. The original issue stemmed from a configuration bug in tendermint that wasn't allowing peers to be pruned. When peers would begin to build up, the consensus reactor state would get caught on some mutexes. This was fixed by simply reducing the max peer count during the initialization of the node. tendermint/tendermint#8684 There was also an issue with the node getting stuck in a loop when catching up from peers. That was contributing to the issues above along with causing The issues were compounded when we switched a portion of the network to the legacy p2p network. This caused multiple issues as tendermint's legacy PEX was not fully compatible with the new PEX when in a network with many peers. This would cause peers using the legacy PEX to get evicted from and not exchanges peers with those who were using the new PEX. tendermint/tendermint#8657 We made an additional change to our PEX that allows for legacy peers to send up to 250 peers instead of 100, as to not arbitrarily block legacy peers resulting in nodes not being able to sync from scratch. |
Before we debug further stability issues with the p2p network, we will want to issue an upgrade to fix as many of the p2p issues as possible |
Tried the following changes with mostly no success in our case:
Service ends up getting clogged by incoming connections, and takes a while even to restart (CPU (35% on 6 cores) and Mem (4.4GB) usage seems high but not that high). The only thing that seems to be working is changing p2p and rpc ports, and even with that settings being quite unstable in terms of block producing. |
@packetstracer have you tried using the the fix you described of changing ports makes sense. As discussed, if peers aren't pruned then the nodes fall behind. Changing ports would cause all other peers attempting to dial your address to fail, and therefore limit the number of peers connected. |
Not yet @evan-forbes, will compile it tomorrow, give it a try and report the outcome. |
Seems like the fix on that branch is working perfectly, no blocks missed, no issues on hw performance either 👌 Using same config as described by Mzonder in this post https://discord.com/channels/638338779505229824/979040398376849479/983613749694980186 |
Starting to gather info on Mamaki here, so that all info that could potentially help to debug is in once place.
timeout-commit = "25s"
these 3 are now different accross validators, many have lowered them a bit:
Lastly, the persistent-peers have been added back due to enabling the old p2p system. This should mean we are now running both new and old p2p system which were supposedly interoperable.
The network has these issues/symptoms at the moment:
- different step durations
- if disabling the old p2p system again, nodes have issues syncing/finding peers - 5 or more blocks in a row made by the same validator (it even happened with a validator with voting power 20, which is essentially 0% of the network) - peer overflow issue:
I probably missed something too so feel free to add onto this. As well as potential solutions, or why this would be happening.
One solution i saw is to set filter-peers=true to stop the peer overflow, but i am not sure that a long term solution here.
The text was updated successfully, but these errors were encountered: