Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Network Bridge error: notification queue size > 100k messages #6904

Closed
Tracked by #26 ...
sandreim opened this issue Mar 17, 2023 · 6 comments
Closed
Tracked by #26 ...

Network Bridge error: notification queue size > 100k messages #6904

sandreim opened this issue Mar 17, 2023 · 6 comments
Assignees
Labels
I3-bug Fails to follow expected behavior. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.

Comments

@sandreim
Copy link
Contributor

Seen on Kusama validators:

2023-03-11 01:28:21.277 ERROR tokio-runtime-worker sc_network::service::out_events: The number of unprocessed events in channel `polkadot-network-bridge` reached 100000.
The channel was created at:
   0: sc_network::service::out_events::channel
   1: polkadot_overseer::spawn
   2: polkadot_overseer::OverseerBuilder<polkadot_overseer::Init<S>,polkadot_overseer::Init<CandidateValidation>,polkadot_overseer::Init<PvfChecker>,polkadot_overseer::Init<CandidateBacking>,polkadot_overseer::Init<StatementDistribution>,polkadot_overseer::Init<AvailabilityDistribution>,polkadot_overseer::Init<AvailabilityRecovery>,polkadot_overseer::Init<BitfieldSigning>,polkadot_overseer::Init<BitfieldDistribution>,polkadot_overseer::Init<Provisioner>,polkadot_overseer::Init<RuntimeApi>,polkadot_overseer::Init<AvailabilityStore>,polkadot_overseer::Init<NetworkBridgeRx>,polkadot_overseer::Init<NetworkBridgeTx>,polkadot_overseer::Init<ChainApi>,polkadot_overseer::Init<CollationGeneration>,polkadot_overseer::Init<CollatorProtocol>,polkadot_overseer::Init<ApprovalDistribution>,polkadot_overseer::Init<ApprovalVoting>,polkadot_overseer::Init<GossipSupport>,polkadot_overseer::Init<DisputeCoordinator>,polkadot_overseer::Init<DisputeDistribution>,polkadot_overseer::Init<ChainSelection>,polkadot_overseer::Init<std::collections::hash::map::HashMap<primitive_types::H256,alloc::vec::Vec<futures_channel::oneshot::Sender<core::result::Result<(),polkadot_node_subsystem_types::errors::SubsystemError>>>>>,polkadot_overseer::Init<std::collections::hash::map::HashMap<primitive_types::H256,alloc::sync::Arc<polkadot_node_jaeger::spans::Span>>>,polkadot_overseer::Init<alloc::vec::Vec<(primitive_types::H256,u32)>>,polkadot_overseer::Init<std::collections::hash::map::HashMap<primitive_types::H256,u32>>,polkadot_overseer::Init<SupportsParachains>,polkadot_overseer::Init<lru::LruCache<primitive_types::H256,()>>,polkadot_overseer::Init<polkadot_overseer::metrics::Metrics>>::build_with_connector
   3: <polkadot_service::overseer::RealOverseerGen as polkadot_service::overseer::OverseerGen>::generate
   4: polkadot_service::new_full
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
   6: sc_cli::runner::Runner<C>::run_node_until_exit
   7: polkadot_cli::command::run
   8: polkadot::main
   9: std::sys_common::backtrace::__rust_begin_short_backtrace
  10: main
  11: __libc_start_main
  12: _start

Related to #6614 . I've observed this frequently in my scaling tests even after implementing 6614. The problem was the bridge rx worker task was not run in a separate thread.

@sandreim sandreim added I3-bug Fails to follow expected behavior. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance. labels Mar 17, 2023
@sandreim sandreim self-assigned this Mar 17, 2023
@sandreim
Copy link
Contributor Author

Fixed in #6905

@h33p
Copy link

h33p commented Mar 29, 2023

@sandreim I believe we're affected by this too. We had this exact same trace in one of kusama nodes we are running, and at some point the node would start idling for a few brief periods of time before completely flatlining. I suspect this is related. I was wondering, when would the fix be released?

@sandreim
Copy link
Contributor Author

It should be present in the next client release (0.9.41).

@kucharskim
Copy link

When next client release 0.9.41 is scheduled? As in, when we can expect it? I am asking, as we have multiple nodes for Kusama, with current number of Kusama instances, we are hitting this problem daily and on-calls are suffering. Estimate time of release with the fix would help us priorities workarounds for this problem.

@bkchr
Copy link
Member

bkchr commented Mar 29, 2023

@kucharskim maybe best to downgrade to 0.9.39

@sandreim
Copy link
Contributor Author

sandreim commented Mar 29, 2023

I subscribe to what @bkchr recommended since 0.9.41 is most likely 2 weeks away. And please let us know if you still get the notification queue size error. I am not convienced that it is related to the issues seen in 0.9.40, maybe it just made the issue described here worse ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I3-bug Fails to follow expected behavior. T4-parachains_engineering This PR/Issue is related to Parachains performance, stability, maintenance.
Projects
No open projects
Development

No branches or pull requests

4 participants