Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kusama Validators Litep2p - Monitoring and Feedback #7076

Open
1 task done
lexnv opened this issue Jan 7, 2025 · 12 comments
Open
1 task done

Kusama Validators Litep2p - Monitoring and Feedback #7076

lexnv opened this issue Jan 7, 2025 · 12 comments

Comments

@lexnv
Copy link
Contributor

lexnv commented Jan 7, 2025

This is a placeholder issue for the community (kusama validators) to share their feedback, monitoring and logs.

We’re excited to announce the next step in improving the Kusama network with the introduction of litep2p—a more resource-efficient network backend. We need your help to make this transition successful!

Enable Litep2p Backend

We’re gradually rolling out litep2p across all validators. Here’s how you can help:

  1. Ensure you're running the latest Polkadot release (version 2412 or newer).
  2. Restart your node with the following flag
--network-backend litep2p

Rollout Plan

  • Phase 1: We need around 100 validators to start the transition.
  • Phase 2 (in a few days): Increase to 500 validators running litep2p.
  • Phase 3: Full rollout—inviting all validators to switch.

Monitoring & Feedback

Please keep an eye on your node after restarting and report any warnings or errors you encounter. In the first 15–30 minutes after the restart, you may see some temporary warnings, such as:

    Some network error occurred when fetching erasure chunk
    Low connectivity

We'd like to pay special attention to at least the following metrics:

  • Sync peers (substrate_sync_peers)
  • Block height (substrate_block_height)

Tasks

Preview Give feedback
@lexnv lexnv added this to Networking Jan 7, 2025
@alexggh
Copy link
Contributor

alexggh commented Jan 7, 2025

This is a concern for me: #7077, we should pay attention to this.

@alexggh
Copy link
Contributor

alexggh commented Jan 8, 2025

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

Screenshot 2025-01-08 at 11 50 49

https://grafana.teleport.parity.io/goto/-FyJtnvNg?orgId=1

@alexggh
Copy link
Contributor

alexggh commented Jan 9, 2025

This is a concern for me: #7077, we should pay attention to this.

And this is the impact, I assume it was because validators restarted to use litep2p, but we should keep an eye on this if it is keep repeating.

Confirm with paranodes, he had some validators that were constantly restarting, so that was the reason for this finality delays.

@eskimor
Copy link
Member

eskimor commented Jan 10, 2025

And what was the reason for constantly restarting?

@alexggh
Copy link
Contributor

alexggh commented Jan 10, 2025

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

@alexggh
Copy link
Contributor

alexggh commented Jan 14, 2025

And what was the reason for constantly restarting?

Paranode had a script that restarted on low connectivity, which is exactly what this #7077 will produce.

Nevertheless, even after the script was stopped we are still seeing occasional lower spikes in finality because of no-shows on around ~20 validators, I'm working with @lexnv to understand what might cause that, because it perfectly correlates with the enablement of validators with litep2p.

Did a bit more investigation on this path and for this list of candidates, which are slow to be approved, they induce a finality lag of around ~16 blocks.

0x8fe297cb881a48611829b911b9dfc4c176d5a540b5fd0ab4a2114b6b65e04d71
0x16a13885e900a4afc18d1689a0197602ac64176d70ccd8608f9a260de3b3a22e
0x869c13acf8857fc2df09bd3991cfda24f81b639373bdc68ba7a10436940c4a4d
0x83a323ec47de77df87e9e2dfc49650169622fd641304e62f8f83db085fc1822c
0x7881ad19ad2cd502c7af16313bf569c9c5ac5d42cd47b9b2bede8df71f63cd56
0xb062a7c221fcd1ecfe33f1aabc1b48abf3949a9ab09713e8ca6a86800388103c
0xd233f3836a3a4bc7703c7b211d8da49bd7455e7b2d264b6cfd053f42e73948f7
0x8f811f8b2d246badb748a46156fc0a992316616e36f8779f32a0606853f68df8
0x477c4030533603182240063e80b1169fac6674bdd1b207c719d5363b231a28ba
0x756973880ea6a9a08d3ab881ef7e908a4779c5c2f1d4e4aa3f01f2c2510171ec

For this particular candidates around 20/30 random validators(different polkadot versions) are a no-show, those validators aren't no-shows on any other candidate before and after, so it is a one-off for this particular candidates.

What this candidates have in common is that all of them(9 of 9), have been backed in a group that contains STKD.IO/01 https://apps.turboflakes.io/?chain=kusama#/validator/5FKStTNJCk5J3EuuYcvJpNn8CxbkzW1J7mst3aayWCT8XrXh which seem to be one of the nodes that enabled litep2p.

So, my theory is that the presence of this node in the backing group might make others slow on availability-recovery which results in no-shows and finality lag, however I don't have a definitive proof where this happens.

Next

  • Confirm STKD.IO/01 is rolled back from litep2p to the default networking backend and that the occasional lag goes away.
  • Root-cause why the above happens

@lexnv
Copy link
Contributor Author

lexnv commented Jan 14, 2025

Confirmed STKD.IO/01 runs litep2p, reboot to libp2p will happen soon

@Sudo-Whodo
Copy link
Contributor

STKD.IO/01 was restarted with the litep2p flag around 2025-01-08 04:02:20 (at the start of the log file). It ran and outputted errors for about 25-30 min and cleared up around ~2025-01-08 04:30:00. I restarted the service a couple times at the beginning. The flag was removed 2025-01-14 14:43:47.

https://public-logs-stkd.s3.us-west-2.amazonaws.com/extracted-messages.txt

If you need any more info or have any questions let me know.

github-merge-queue bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
github-actions bot pushed a commit that referenced this issue Jan 15, 2025
This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
(cherry picked from commit ef064a3)
@lexnv
Copy link
Contributor Author

lexnv commented Jan 15, 2025

Triage report from the provided logs, thanks again @Sudo-Whodo 🙏

Repo Count Level Triage report
https://github.com/paritytech/polkadot-sdk/ 13399 warn_if_frequent Some network error occurred when fetching erasure chunk
https://github.com/paritytech/polkadot-sdk/ 1238 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/ 805 warn Recovery of available data failed.
https://github.com/paritytech/polkadot-sdk/ 805 warn Data unavailable for candidate .*
https://github.com/paritytech/polkadot-sdk/ 325 warn Report .: . to .. Reason: .. Banned, disconnecting. ( A collator provided a collation for the wrong para. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/ 120 warn fetch_pov_job
https://github.com/paritytech/polkadot-sdk/ 108 warn Low connectivity - authority lookup failed for too many validators.
https://github.com/paritytech/polkadot-sdk/ 11 warn Importing locally an already known assignment
https://github.com/paritytech/polkadot-sdk/ 4 error Protocol controllers receiver stream has returned None. Ignore this error if the node is shutting down.
https://github.com/paritytech/polkadot-sdk/ 4 warn Report .: . to .. Reason: .. Banned, disconnecting. ( A collator advertising a collation for an async backing relay parent using V1. Banned, disconnecting.)
https://github.com/paritytech/polkadot-sdk/ 3 error 💔 Called on_validated_block_announce with a bad peer ID .*
https://github.com/paritytech/polkadot-sdk/ 1 warn 💔 Error importing block .: . ( Parent block of 0x6da5…6146 has no associated weight)
https://github.com/paritytech/polkadot-sdk/ 1 warn .: . is already a reserved peer
https://github.com/paritytech/litep2p/ 1 warn Refusing to add known address that corresponds to a different peer ID
https://github.com/paritytech/polkadot-sdk/ 1 warn 💔 Verification failed for block ..: .*

@alexggh
Copy link
Contributor

alexggh commented Jan 16, 2025

Found some data on one of our validators running with libp2p: https://grafana.teleport.parity.io/goto/volIfyDHR?orgId=1, it looks like sometimes a node running libp2p gets stuck while fetching full PoV from a backer running litep2p and that is causing them to be a no_show in approval-voting.


2025-01-12 23:53:24.508 parachain::availability-recovery: Starting `Full recovery from backers` strategy candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40

2025-01-12 23:55:24.493 parachain::availability-recovery: Receiver for available data dropped. candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40 

2025-01-12 23:55:24.493 DEBUG parachain::availability-recovery: All receivers for available data dropped. candidate_hash=0x5830d4fd4b920c81c3db20db9c1714281dc01afa1a69138dff6bf061588a2d40

The receiver is dropped after 2 minutes of waiting for the PoV, normally fetching from the backer should either succeed or timeout after 2 seconds and if fails recovery from chunks would be started, but in this case it doesn't happen.

@lexnv
Copy link
Contributor Author

lexnv commented Jan 16, 2025

Libp2p Issue

  1. The AvailableDataFetchingV1 is initiated from polkadot with option IfDisconnected::ImmediateError (ie return immediately an error if the remote peer is not connected to us).

  2. Libp2p notifies the connection handler about this request.

  3. Libp2p handler cannot obtain a new outbound substream for the request because of a SubstreamUpgradeError::Io error.

There is an issue in libp2p v0.52.4 (on stable2412) with request-response protocols hitting SubstreamUpgradeError::Io. This can cause potential infinite loops and delay the propagation of the request failure.

Our current origin/master runs at version v0.54.1 which includes the following fix for the issue:

Extracted from the issue's description:

This fixes a potential infinite retrying when dialing bad peers. The error is now reported to the user and they should handle it as they see fit for their case.

The first attempt to solve the libp2p issue was made in this rust-libp2p/pull/5419, which confirms the libp2p was not tracking properly the request timeout

* from protocols/request-response/src/handler.rs

        // If timeout is already reached then there is no need to proceed further.
        if message.time.elapsed() >= self.request_timeout {
            self.pending_events
                .push_back(Event::OutboundTimeout(message.request_id));
            return;
        }

libp2p -> litep2p

It might be possible that the litep2p closes the connection to the libp2p node during the substream negotiation causing a SubstreamUpgradeError::Io.

I suspect this happens because litep2p deems the connection as inactive, just before the libp2p node initiates the request. When this happens, the connection is downgraded then it is closed due to a lack of substream activity or keep-alive connection timeout:

This might not happen on the libp2p -> libp2p node implementation because libp2p might have taken a different approach to keep-alive connections.

There is also the possibility that litep2p closes the connection because of a yamux error / bug. Litep2p runs a fairly outdated version of the yamux implementation. The following PR aims to improve the stability and performance of the yamux component bringing it up to date:

Stable2412

This version of libp2p does not track timeouts properly for request-response, causing libp2p to postpone failures of SubstreamUpgradeError::Io. The issue should be solved by updating libp2p to the latest version (origin/master). However, we have noticed a few other issues when running the origin/master libp2p on kusama validators:

In the meanwhile, we can patch the substrate request-response by performing periodic checks and cancelling already-timedout requests similar to this deployment test PR:

https://github.com/paritytech/polkadot-sdk/pull/7154/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7R666-R677

Summary

I believe that we are hitting the following edge-case in libp2p, causing the request-response protocol to not track the timeout properly:

The issue happens between libp2p -> litep2p communication either due to a difference in implementing connection keep-alive mechanisms (paritytech/litep2p#260) or running a very outdated yamux multiplexer -- ie the component that provides substreams (paritytech/litep2p#256)

Appendix

The request is submitted from availability-recovery:

.send_message(NetworkBridgeTxMessage::SendRequests(
vec![Requests::AvailableDataFetchingV1(req)],
IfDisconnected::ImmediateError,

The request arrives next in the polkadot networking bridge:

network_service
.start_request(
&mut authority_discovery_service,

The authority discovery provides the peerID on Recipient::Authority, then forwards the message to the networking layer:

<dyn NetworkService>::start_request(

The request arrives to the substrate's Behavior for RequestResponse protocols, point after the request moves to rustlibp2p:

if behaviour.is_connected(target) || connect.should_connect() {
let request_id = behaviour.send_request(target, request);

Unconfirmed: There might be a small race between these lines. The behavior reports an outdated view via is_connected because it didn't catch up on swarm ConnectionClosed events. This is causing the IfDisconnected option to be ignored, and for libp2p to dial the peer either way.

@alexggh
Copy link
Contributor

alexggh commented Jan 17, 2025

Great investigation @lexnv, I think you nailed it!

On top of that, even on the paths where the timeouts work it seems that the timeouts in stable2412 and before are not respecting the timeout values configured here, because the parameters to this calls are mistakenly reversed in libp2p v0.52: Inbound reqeuest & Outbound request, so you end up with the default 10s timeout for all requests. They seemed to be fixed in master with the upgrade to librustp2p v0.54.1

So, I second introducing this option in the stable versions.

In the meanwhile, we can patch the substrate request-response by performing periodic checks and cancelling already-timedout requests similar to this deployment test PR:

Side note, there is still the open question on why litep2p triggers this errors in libp2p, and even with the timeouts fixed we probably want to root cause the exact specific reason, so I think once we fix the timeouts we should enable this debug logs on our validators: https://github.com/libp2p/rust-libp2p/blob/51070dae6395821c5ab45014b7208f15975c9101/protocols/request-response/src/handler.rs#L151, before we try to re-enable litep2p.

Nathy-bajo pushed a commit to Nathy-bajo/polkadot-sdk that referenced this issue Jan 21, 2025
…ch#7158)

This PR rejects inbound requests from banned peers (reputation is below
the banned threshold).

This mirrors the request-response implementation from the libp2p side.
I won't expect this to get triggered too often, but we'll monitor this
metric.

While at it, have registered a new inbound failure metric to have
visibility into this.

Discovered during the investigation of:
paritytech#7076 (comment)

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-merge-queue bot pushed a commit that referenced this issue Jan 22, 2025
This PR enforces that outbound requests are finished within the
specified protocol timeout.

The stable2412 version running libp2p 0.52.4 contains a bug which does
not track request timeouts properly:
- libp2p/rust-libp2p#5429

The issue has been detected while submitting libp2p -> litep2p requests
in kusama. This aims to check that pending outbound requests have not
timedout. Although the issue has been fixed in libp2p, there might be
other cases where this may happen. For example:
- libp2p/rust-libp2p#5417

For more context see:
#7076 (comment)


1. Ideally, the force-timeout mechanism in this PR should never be
triggered in production. However, origin/stable2412 occasionally
encounters this issue. When this happens, 2 warnings may be generated:
- one warning introduced by this PR wrt force timeout terminating the
request
- possible one warning when the libp2p decides (if at all) to provide
the response back to substrate (as mentioned by @alexggh
[here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L769)
and
[here](https://github.com/paritytech/polkadot-sdk/pull/7222/files#diff-052aeaf79fef3d9a18c2cfd67006aa306b8d52e848509d9077a6a0f2eb856af7L842)

2. This implementation does not propagate to the substrate service the
`RequestFinished { error: .. }`. That event is only used internally by
substrate to increment metrics. However, we don't have the peer
information available to propagate the event properly when we
force-timeout the request. Considering this should most likely not
happen in production (origin/master) and that we'll be able to extract
information by warnings, I would say this is a good tradeoff for code
simplicity:


https://github.com/paritytech/polkadot-sdk/blob/06e3b5c6a7696048d65f1b8729f16b379a16f501/substrate/client/network/src/service.rs#L1543


### Testing

Added a new test to ensure the timeout is reached properly, even if
libp2p does not produce a response in due time.

I've also transitioned the tests to using `tokio::test` due to a
limitation of
[CI](https://github.com/paritytech/polkadot-sdk/actions/runs/12832055737/job/35784043867)

```
--- TRY 1 STDERR:        sc-network request_responses::tests::max_response_size_exceeded ---
thread 'request_responses::tests::max_response_size_exceeded' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/time/interval.rs:139:26:
there is no reactor running, must be called from the context of a Tokio 1.x runtime
```



cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants