Kusama Validators Litep2p - Monitoring and Feedback #7076

lexnv · 2025-01-07T13:30:33Z

This is a placeholder issue for the community (kusama validators) to share their feedback, monitoring and logs.

We’re excited to announce the next step in improving the Kusama network with the introduction of litep2p—a more resource-efficient network backend. We need your help to make this transition successful!

Enable Litep2p Backend

We’re gradually rolling out litep2p across all validators. Here’s how you can help:

Ensure you're running the latest Polkadot release (version 2412 or newer).
Restart your node with the following flag

--network-backend litep2p

Rollout Plan

Phase 1: We need around 100 validators to start the transition.
Phase 2 (in a few days): Increase to 500 validators running litep2p.
Phase 3: Full rollout—inviting all validators to switch.

Monitoring & Feedback

Please keep an eye on your node after restarting and report any warnings or errors you encounter. In the first 15–30 minutes after the restart, you may see some temporary warnings, such as:

    Some network error occurred when fetching erasure chunk
    Low connectivity

We'd like to pay special attention to at least the following metrics:

Sync peers (substrate_sync_peers)
Block height (substrate_block_height)

Tasks

Give feedback

The text was updated successfully, but these errors were encountered:

alexggh · 2025-01-07T14:23:16Z

This is a concern for me: #7077, we should pay attention to this.

alexggh · 2025-01-08T09:54:30Z

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

https://grafana.teleport.parity.io/goto/-FyJtnvNg?orgId=1

alexggh · 2025-01-09T09:22:20Z

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

Confirm with paranodes, he had some validators that were constantly restarting, so that was the reason for this finality delays.

eskimor · 2025-01-10T12:58:33Z

And what was the reason for constantly restarting?

alexggh · 2025-01-10T13:33:12Z

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

alexggh · 2025-01-14T10:59:18Z

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

Did a bit more investigation on this path and for this list of candidates, which are slow to be approved, they induce a finality lag of around ~16 blocks.

0x8fe297cb881a48611829b911b9dfc4c176d5a540b5fd0ab4a2114b6b65e04d71
0x16a13885e900a4afc18d1689a0197602ac64176d70ccd8608f9a260de3b3a22e
0x869c13acf8857fc2df09bd3991cfda24f81b639373bdc68ba7a10436940c4a4d
0x83a323ec47de77df87e9e2dfc49650169622fd641304e62f8f83db085fc1822c
0x7881ad19ad2cd502c7af16313bf569c9c5ac5d42cd47b9b2bede8df71f63cd56
0xb062a7c221fcd1ecfe33f1aabc1b48abf3949a9ab09713e8ca6a86800388103c
0xd233f3836a3a4bc7703c7b211d8da49bd7455e7b2d264b6cfd053f42e73948f7
0x8f811f8b2d246badb748a46156fc0a992316616e36f8779f32a0606853f68df8
0x477c4030533603182240063e80b1169fac6674bdd1b207c719d5363b231a28ba
0x756973880ea6a9a08d3ab881ef7e908a4779c5c2f1d4e4aa3f01f2c2510171ec

For this particular candidates around 20/30 random validators(different polkadot versions) are a no-show, those validators aren't no-shows on any other candidate before and after, so it is a one-off for this particular candidates.

What this candidates have in common is that all of them(9 of 9), have been backed in a group that contains STKD.IO/01 https://apps.turboflakes.io/?chain=kusama#/validator/5FKStTNJCk5J3EuuYcvJpNn8CxbkzW1J7mst3aayWCT8XrXh which seem to be one of the nodes that enabled litep2p.

So, my theory is that the presence of this node in the backing group might make others slow on availability-recovery which results in no-shows and finality lag, however I don't have a definitive proof where this happens.

This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: #7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> (cherry picked from commit ef064a3)

lexnv · 2025-01-15T16:31:14Z

Triage report from the provided logs, thanks again @Sudo-Whodo 🙏

Repo	Count	Level	Triage report
https://github.com/paritytech/polkadot-sdk/	13399	warn_if_frequent	Some network error occurred when fetching erasure chunk
https://github.com/paritytech/polkadot-sdk/	1238	warn	Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/	805	warn	Recovery of available data failed.
https://github.com/paritytech/polkadot-sdk/	805	warn	Data unavailable for candidate .*
https://github.com/paritytech/polkadot-sdk/	325	warn	Report .: . to .. Reason: .. Banned, disconnecting. ( A collator provided a collation for the wrong para. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/	120	warn	fetch_pov_job
https://github.com/paritytech/polkadot-sdk/	108	warn	Low connectivity - authority lookup failed for too many validators.
https://github.com/paritytech/polkadot-sdk/	11	warn	Importing locally an already known assignment
https://github.com/paritytech/polkadot-sdk/	4	error	Protocol controllers receiver stream has returned `None`. Ignore this error if the node is shutting down.
https://github.com/paritytech/polkadot-sdk/	4	warn	Report .: . to .. Reason: .. Banned, disconnecting. ( A collator advertising a collation for an async backing relay parent using V1. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/	3	error	💔 Called `on_validated_block_announce` with a bad peer ID .*
https://github.com/paritytech/polkadot-sdk/	1	warn	💔 Error importing block .: . ( Parent block of 0x6da5…6146 has no associated weight)
https://github.com/paritytech/polkadot-sdk/	1	warn	.: . is already a reserved peer
https://github.com/paritytech/litep2p/	1	warn	Refusing to add known address that corresponds to a different peer ID
https://github.com/paritytech/polkadot-sdk/	1	warn	💔 Verification failed for block ..: .*

alexggh · 2025-01-16T09:26:29Z

Found some data on one of our validators running with libp2p: https://grafana.teleport.parity.io/goto/volIfyDHR?orgId=1, it looks like sometimes a node running libp2p gets stuck while fetching full PoV from a backer running litep2p and that is causing them to be a no_show in approval-voting.


2025-01-12 23:53:24.508 parachain::availability-recovery: Starting `Full recovery from backers` strategy candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40

2025-01-12 23:55:24.493 parachain::availability-recovery: Receiver for available data dropped. candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40 

2025-01-12 23:55:24.493 DEBUG parachain::availability-recovery: All receivers for available data dropped. candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40

The receiver is dropped after 2 minutes of waiting for the PoV, normally fetching from the backer should either succeed or timeout after 2 seconds and if fails recovery from chunks would be started, but in this case it doesn't happen.

lexnv · 2025-01-16T15:50:59Z

Libp2p Issue

The AvailableDataFetchingV1 is initiated from polkadot with option IfDisconnected::ImmediateError (ie return immediately an error if the remote peer is not connected to us).
Libp2p notifies the connection handler about this request.
Libp2p handler cannot obtain a new outbound substream for the request because of a SubstreamUpgradeError::Io error.

There is an issue in libp2p v0.52.4 (on stable2412) with request-response protocols hitting SubstreamUpgradeError::Io. This can cause potential infinite loops and delay the propagation of the request failure.

Our current origin/master runs at version v0.54.1 which includes the following fix for the issue:

fix(request-response): Report dial IO errors to the user libp2p/rust-libp2p#5429

Extracted from the issue's description:

This fixes a potential infinite retrying when dialing bad peers. The error is now reported to the user and they should handle it as they see fit for their case.

The first attempt to solve the libp2p issue was made in this rust-libp2p/pull/5419, which confirms the libp2p was not tracking properly the request timeout

* from protocols/request-response/src/handler.rs

        // If timeout is already reached then there is no need to proceed further.
        if message.time.elapsed() >= self.request_timeout {
            self.pending_events
                .push_back(Event::OutboundTimeout(message.request_id));
            return;
        }

libp2p -> litep2p

It might be possible that the litep2p closes the connection to the libp2p node during the substream negotiation causing a SubstreamUpgradeError::Io.

I suspect this happens because litep2p deems the connection as inactive, just before the libp2p node initiates the request. When this happens, the connection is downgraded then it is closed due to a lack of substream activity or keep-alive connection timeout:

transport_service: Improve connection stability by downgrading connections on substream inactivity litep2p#260

This might not happen on the libp2p -> libp2p node implementation because libp2p might have taken a different approach to keep-alive connections.

There is also the possibility that litep2p closes the connection because of a yamux error / bug. Litep2p runs a fairly outdated version of the yamux implementation. The following PR aims to improve the stability and performance of the yamux component bringing it up to date:

yamux: Update yamux with window tuning and backport fixes litep2p#256

Stable2412

This version of libp2p does not track timeouts properly for request-response, causing libp2p to postpone failures of SubstreamUpgradeError::Io. The issue should be solved by updating libp2p to the latest version (origin/master). However, we have noticed a few other issues when running the origin/master libp2p on kusama validators:

network/libp2p: DHT Bootstrap error detected on Kusama Validator #7184

In the meanwhile, we can patch the substrate request-response by performing periodic checks and cancelling already-timedout requests similar to this deployment test PR:

https://github.com/paritytech/polkadot-sdk/pull/7154/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7R666-R677

Summary

I believe that we are hitting the following edge-case in libp2p, causing the request-response protocol to not track the timeout properly:

fix(request-response): Report dial IO errors to the user libp2p/rust-libp2p#5429

The issue happens between libp2p -> litep2p communication either due to a difference in implementing connection keep-alive mechanisms (paritytech/litep2p#260) or running a very outdated yamux multiplexer -- ie the component that provides substreams (paritytech/litep2p#256)

Appendix

The request is submitted from availability-recovery:

polkadot-sdk/polkadot/node/network/availability-recovery/src/task/strategy/full.rs

Lines 87 to 89 in e889d18

    
           .send_message(NetworkBridgeTxMessage::SendRequests( 
        
           	vec![Requests::AvailableDataFetchingV1(req)], 
        
           	IfDisconnected::ImmediateError,

The request arrives next in the polkadot networking bridge:

polkadot-sdk/polkadot/node/network/bridge/src/tx/mod.rs

Lines 323 to 325 in e889d18

    
           network_service 
        
           	.start_request( 
        
           		&mut authority_discovery_service,

The authority discovery provides the peerID on Recipient::Authority, then forwards the message to the networking layer:

polkadot-sdk/polkadot/node/network/bridge/src/network.rs

Line 338 in e889d18

<dyn NetworkService>::start_request(

The request arrives to the substrate's Behavior for RequestResponse protocols, point after the request moves to rustlibp2p:

polkadot-sdk/substrate/client/network/src/request_responses.rs

Lines 466 to 467 in e889d18

    
           if behaviour.is_connected(target) || connect.should_connect() { 
        
           	let request_id = behaviour.send_request(target, request);

Unconfirmed: There might be a small race between these lines. The behavior reports an outdated view via is_connected because it didn't catch up on swarm ConnectionClosed events. This is causing the IfDisconnected option to be ignored, and for libp2p to dial the peer either way.

alexggh · 2025-01-17T08:43:18Z

Great investigation @lexnv, I think you nailed it!

On top of that, even on the paths where the timeouts work it seems that the timeouts in stable2412 and before are not respecting the timeout values configured here, because the parameters to this calls are mistakenly reversed in libp2p v0.52: Inbound reqeuest & Outbound request, so you end up with the default 10s timeout for all requests. They seemed to be fixed in master with the upgrade to librustp2p v0.54.1

So, I second introducing this option in the stable versions.

In the meanwhile, we can patch the substrate request-response by performing periodic checks and cancelling already-timedout requests similar to this deployment test PR:

Side note, there is still the open question on why litep2p triggers this errors in libp2p, and even with the timeouts fixed we probably want to root cause the exact specific reason, so I think once we fix the timeouts we should enable this debug logs on our validators: https://github.com/libp2p/rust-libp2p/blob/51070dae6395821c5ab45014b7208f15975c9101/protocols/request-response/src/handler.rs#L151, before we try to re-enable litep2p.

…ch#7158) This PR rejects inbound requests from banned peers (reputation is below the banned threshold). This mirrors the request-response implementation from the libp2p side. I won't expect this to get triggered too often, but we'll monitor this metric. While at it, have registered a new inbound failure metric to have visibility into this. Discovered during the investigation of: paritytech#7076 (comment) cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

@alexggh

This PR enforces that outbound requests are finished within the specified protocol timeout. The stable2412 version running libp2p 0.52.4 contains a bug which does not track request timeouts properly: - libp2p/rust-libp2p#5429 The issue has been detected while submitting libp2p -> litep2p requests in kusama. This aims to check that pending outbound requests have not timedout. Although the issue has been fixed in libp2p, there might be other cases where this may happen. For example: - libp2p/rust-libp2p#5417 For more context see: #7076 (comment) 1. Ideally, the force-timeout mechanism in this PR should never be triggered in production. However, origin/stable2412 occasionally encounters this issue. When this happens, 2 warnings may be generated: - one warning introduced by this PR wrt force timeout terminating the request - possible one warning when the libp2p decides (if at all) to provide the response back to substrate (as mentioned by @alexggh [here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769) and [here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842) 2. This implementation does not propagate to the substrate service the `RequestFinished { error: .. }`. That event is only used internally by substrate to increment metrics. However, we don't have the peer information available to propagate the event properly when we force-timeout the request. Considering this should most likely not happen in production (origin/master) and that we'll be able to extract information by warnings, I would say this is a good tradeoff for code simplicity: https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543 ### Testing Added a new test to ensure the timeout is reached properly, even if libp2p does not produce a response in due time. I've also transitioned the tests to using `tokio::test` due to a limitation of [CI](https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867) ``` --- TRY 1 STDERR: sc-network request_responses::tests::max_response_size_exceeded --- thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26: there is no reactor running, must be called from the context of a Tokio 1.x runtime ``` cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Bastian Köcher <git@kchr.de>

lexnv added this to Networking Jan 7, 2025

This was referenced Jan 13, 2025

litep2p authority discovery is very slow 5m vs 1h #7077

Closed

metrics: Expose litep2p metrics in an agnostic manner paritytech/litep2p#294

Open

lexnv mentioned this issue Jan 14, 2025

req-resp/litep2p: Reject inbound requests from banned peers #7158

Merged

lexnv mentioned this issue Jan 17, 2025

net/libp2p: Enforce outbound request-response timeout limits #7222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kusama Validators Litep2p - Monitoring and Feedback #7076

Kusama Validators Litep2p - Monitoring and Feedback #7076

lexnv commented Jan 7, 2025 •

edited by alexggh

Loading

Tasks

alexggh commented Jan 7, 2025

alexggh commented Jan 8, 2025

alexggh commented Jan 9, 2025

eskimor commented Jan 10, 2025

alexggh commented Jan 10, 2025

alexggh commented Jan 14, 2025 •

edited

Loading

lexnv commented Jan 14, 2025

Sudo-Whodo commented Jan 15, 2025

lexnv commented Jan 15, 2025

alexggh commented Jan 16, 2025 •

edited

Loading

lexnv commented Jan 16, 2025 •

edited

Loading

alexggh commented Jan 17, 2025 •

edited

Loading

Kusama Validators Litep2p - Monitoring and Feedback #7076

Kusama Validators Litep2p - Monitoring and Feedback #7076

Comments

lexnv commented Jan 7, 2025 • edited by alexggh Loading

Enable Litep2p Backend

Rollout Plan

Monitoring & Feedback

Tasks

alexggh commented Jan 7, 2025

alexggh commented Jan 8, 2025

alexggh commented Jan 9, 2025

eskimor commented Jan 10, 2025

alexggh commented Jan 10, 2025

alexggh commented Jan 14, 2025 • edited Loading

Next

lexnv commented Jan 14, 2025

Sudo-Whodo commented Jan 15, 2025

lexnv commented Jan 15, 2025

alexggh commented Jan 16, 2025 • edited Loading

lexnv commented Jan 16, 2025 • edited Loading

Libp2p Issue

libp2p -> litep2p

Stable2412

Summary

Appendix

alexggh commented Jan 17, 2025 • edited Loading

lexnv commented Jan 7, 2025 •

edited by alexggh

Loading

alexggh commented Jan 14, 2025 •

edited

Loading

alexggh commented Jan 16, 2025 •

edited

Loading

lexnv commented Jan 16, 2025 •

edited

Loading

alexggh commented Jan 17, 2025 •

edited

Loading