-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kusama Validators Litep2p - Monitoring and Feedback #7076
Comments
This is a concern for me: #7077, we should pay attention to this. |
And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating. |
Confirm with paranodes, he had some validators that were constantly restarting, so that was the reason for this finality delays. |
And what was the reason for constantly restarting? |
Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce. Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of |
Did a bit more investigation on this path and for this list of candidates, which are slow to be approved, they induce a finality lag of around ~16 blocks.
For this particular candidates around 20/30 random validators(different polkadot versions) are a no-show, those validators aren't no-shows on any other candidate before and after, so it is a one-off for this particular candidates. What this candidates have in common is that all of them(9 of 9), have been backed in a group that contains STKD.IO/01 https://apps.turboflakes.io/?chain=kusama#/validator/5FKStTNJCk5J3EuuYcvJpNn8CxbkzW1J7mst3aayWCT8XrXh which seem to be one of the nodes that enabled litep2p. So, my theory is that the presence of this node in the backing group might make others slow on availability-recovery which results in no-shows and finality lag, however I don't have a definitive proof where this happens. Next
|
Confirmed STKD.IO/01 runs litep2p, reboot to libp2p will happen soon |
STKD.IO/01 was restarted with the litep2p flag around 2025-01-08 04:02:20 (at the start of the log file). It ran and outputted errors for about 25-30 min and cleared up around ~2025-01-08 04:30:00. I restarted the service a couple times at the beginning. The flag was removed 2025-01-14 14:43:47. https://public-logs-stkd.s3.us-west-2.amazonaws.com/extracted-messages.txt If you need any more info or have any questions let me know. |
This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> (cherry picked from commit ef064a3)
This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> (cherry picked from commit ef064a3)
This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> (cherry picked from commit ef064a3)
Triage report from the provided logs, thanks again @Sudo-Whodo 🙏
|
Found some data on one of our validators running with libp2p: https://grafana.teleport.parity.io/goto/volIfyDHR?orgId=1, it looks like sometimes a node running libp2p gets stuck while fetching full PoV from a backer running litep2p and that is causing them to be a
The receiver is dropped after 2 minutes of waiting for the PoV, normally fetching from the backer should either succeed or timeout after 2 seconds and if fails |
Libp2p Issue
There is an issue in libp2p v0.52.4 (on stable2412) with request-response protocols hitting Our current origin/master runs at version v0.54.1 which includes the following fix for the issue: Extracted from the issue's description:
The first attempt to solve the libp2p issue was made in this rust-libp2p/pull/5419, which confirms the libp2p was not tracking properly the request timeout * from protocols/request-response/src/handler.rs
// If timeout is already reached then there is no need to proceed further.
if message.time.elapsed() >= self.request_timeout {
self.pending_events
.push_back(Event::OutboundTimeout(message.request_id));
return;
} libp2p -> litep2pIt might be possible that the litep2p closes the connection to the libp2p node during the substream negotiation causing a I suspect this happens because litep2p deems the connection as inactive, just before the libp2p node initiates the request. When this happens, the connection is downgraded then it is closed due to a lack of substream activity or keep-alive connection timeout: This might not happen on the libp2p -> libp2p node implementation because libp2p might have taken a different approach to keep-alive connections. There is also the possibility that litep2p closes the connection because of a yamux error / bug. Litep2p runs a fairly outdated version of the yamux implementation. The following PR aims to improve the stability and performance of the yamux component bringing it up to date: Stable2412This version of libp2p does not track timeouts properly for request-response, causing libp2p to postpone failures of In the meanwhile, we can patch the substrate request-response by performing periodic checks and cancelling already-timedout requests similar to this deployment test PR: SummaryI believe that we are hitting the following edge-case in libp2p, causing the request-response protocol to not track the timeout properly: The issue happens between libp2p -> litep2p communication either due to a difference in implementing connection keep-alive mechanisms (paritytech/litep2p#260) or running a very outdated yamux multiplexer -- ie the component that provides substreams (paritytech/litep2p#256) AppendixThe request is submitted from availability-recovery: polkadot-sdk/polkadot/node/network/availability-recovery/src/task/strategy/full.rs Lines 87 to 89 in e889d18
The request arrives next in the polkadot networking bridge: polkadot-sdk/polkadot/node/network/bridge/src/tx/mod.rs Lines 323 to 325 in e889d18
The authority discovery provides the peerID on
The request arrives to the substrate's Behavior for RequestResponse protocols, point after the request moves to rustlibp2p: polkadot-sdk/substrate/client/network/src/request_responses.rs Lines 466 to 467 in e889d18
Unconfirmed: There might be a small race between these lines. The behavior reports an outdated view via |
Great investigation @lexnv, I think you nailed it! On top of that, even on the paths where the timeouts work it seems that the timeouts in stable2412 and before are not respecting the timeout values configured here, because the parameters to this calls are mistakenly reversed in libp2p v0.52: Inbound reqeuest & Outbound request, so you end up with the default 10s timeout for all requests. They seemed to be fixed in master with the upgrade to librustp2p v0.54.1 So, I second introducing this option in the stable versions.
Side note, there is still the open question on why litep2p triggers this errors in libp2p, and even with the timeouts fixed we probably want to root cause the exact specific reason, so I think once we fix the timeouts we should enable this debug logs on our validators: https://github.com/libp2p/rust-libp2p/blob/51070dae6395821c5ab45014b7208f15975c9101/protocols/request-response/src/handler.rs#L151, before we try to re-enable litep2p. |
…ch#7158) This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: paritytech#7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
This PR enforces that outbound requests are finished within the specified protocol timeout. The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly: - libp2p/rust-libp2p#5429 The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example: - libp2p/rust-libp2p#5417 For more context see: #7076 (comment) 1. Ideally, the force-timeout mechanism in this PR should never be triggered in production. However, origin/stable2412 occasionally encounters this issue. When this happens, 2 warnings may be generated: - one warning introduced by this PR wrt force timeout terminating the request - possible one warning when the libp2p decides (if at all) to provide the response back to substrate (as mentioned by @alexggh [here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769) and [here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842) 2. This implementation does not propagate to the substrate service the `RequestFinished { error: .. }`. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity: https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543 ### Testing Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time. I've also transitioned the tests to using `tokio::test` due to a limitation of [CI](https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867) ``` --- TRY 1 STDERR: sc-network request_responses::tests::max_response_size_exceeded --- thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26: there is no reactor running, must be called from the context of a Tokio 1.x runtime ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Bastian Köcher <git@kchr.de>
This is a placeholder issue for the community (kusama validators) to share their feedback, monitoring and logs.
We’re excited to announce the next step in improving the Kusama network with the introduction of litep2p—a more resource-efficient network backend. We need your help to make this transition successful!
Enable Litep2p Backend
We’re gradually rolling out litep2p across all validators. Here’s how you can help:
Rollout Plan
Monitoring & Feedback
Please keep an eye on your node after restarting and report any warnings or errors you encounter. In the first 15–30 minutes after the restart, you may see some temporary warnings, such as:
We'd like to pay special attention to at least the following metrics:
Tasks
The text was updated successfully, but these errors were encountered: