Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substream limits exceeded #1504

Closed
mikevoronov opened this issue Mar 23, 2020 · 6 comments
Closed

Substream limits exceeded #1504

mikevoronov opened this issue Mar 23, 2020 · 6 comments

Comments

@mikevoronov
Copy link
Contributor

mikevoronov commented Mar 23, 2020

I have stumbled with problem of reaching the maximum number of substreams inside multiplexors.

I have complex test, and will try to describe it: from 20 nodes using libp2p-kad for connectivity I constructed 176 pairs (with tuple_combinations), then created 176 clients connected to this nodes. In each pair one client sends some info to other client. Each node has two separated services based on different libp2p's swarms: one for clients, other for nodes in the network. Node swarm behaviour combines kademlia, identity, ping and my own protocol. My own protocol is pretty simple and is based on the OneShotHandler. Peer swarm just controls clients and sends requests from them to node swarm. Clients are separated entities and also built on rust-libp2p.

Node swarm after receiveing request from a client uses the get providers capability of libp2p-kad for finding other client (second in the corresponding pair). After that it tries to connect to it by my own protocol and send some short message. So the flow looks like this:
client_1 -> node_1 {client swarm -> node swarm (FindProviders-Connect-Send)} -> node_2 {node swarm -> client swarm} -> client_2.

The first client from each pair sends 100 message to the second client. All pairs are placed in FuturesUnordered so sending beteen pairs is parallel but inside one pair is sequential - the test waits for recevieng message by the second client with timeout 5 sec.

All of these works fine (without any timeouts) on my 176 pairs only If I increase a number of maximum substreams on multiplexors (both mplex and yamux). But after several tries it stucks also after reaching the set maximum of substreams or smth like that - I have seen several corresponding errors of reaching substream limits from mplex in logs. And for example if I increase substream counts from default 128 for mplex to 1024 the test completes thefirst time successfully. Also from logs it could be seen that message is successfully passed to node swarm of the node_1 connected to the client_1.

I have instrumented mplex a little bit and it seems that finding providers generates several new substreams (5-6 on my network) but only one of them will be closed.

All of this looks like the root case in the stream management on the libp2p-kad side, but after very quick analysis couldn't find the cause.

Also very rarely I have such panic:

thread 'libp2p-task-1' panicked at 'Sink::poll_ready applied to a closed sink.', <::std::macros::panic macros>:2:4
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:59
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1052
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1426
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:62
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:49
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:204
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:224
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:472
  11: std::panicking::begin_panic
  12: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
  13: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
  14: futures_util::compat::compat03as01::io::<impl std::io::Write for futures_util::compat::compat03as01::Compat<W>>::write
  15: tokio_io::async_write::AsyncWrite::poll_write
  16: futures::task_impl::std::set
  17: <futures_codec::framed_write::FramedWrite2<T> as futures_sink::Sink<<T as futures_codec::encoder::Encoder>::Item>>::poll_flush
  18: <libp2p_mplex::Multiplex<C> as libp2p_core::muxing::StreamMuxer>::flush_substream
  19: <libp2p_core::muxing::Wrap<T> as libp2p_core::muxing::StreamMuxer>::flush_substream
  20: futures_util::compat::compat03as01::io::<impl std::io::Write for futures_util::compat::compat03as01::Compat<W>>::flush
  21: tokio_io::async_write::AsyncWrite::poll_flush
  22: futures::task_impl::std::set
  23: <futures_codec::framed_write::FramedWrite2<T> as futures_sink::Sink<<T as futures_codec::encoder::Encoder>::Item>>::poll_flush
  24: <futures_util::sink::with::With<Si,Item,U,Fut,F> as futures_sink::Sink<U>>::poll_flush
  25: <libp2p_kad::handler::KademliaHandler<TUserData> as libp2p_swarm::protocols_handler::ProtocolsHandler>::poll
  26: <libp2p_swarm::protocols_handler::node_handler::NodeHandlerWrapper<TProtoHandler> as libp2p_core::nodes::handled_node::NodeHandler>::poll
  27: libp2p_core::nodes::handled_node::HandledNode<TMuxer,THandler>::poll
  28: <libp2p_core::nodes::tasks::task::Task<F,M,H,I,O,E,C> as core::future::future::Future>::poll
  29: futures_executor::thread_pool::PoolState::work
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
@twittner
Copy link
Contributor

It would be tremendously helpful if you could provide some sort of test case that makes it possible to reproduce your problems. Could you also provide the specific errors you see or some logs? And about the panic ― do you see this with both multiplexer implementations or only with mplex?

@mikevoronov
Copy link
Contributor Author

mikevoronov commented Mar 29, 2020

Hi, thank you for the comment, but unfortunately I can't share the project now, and POC segregation will take much time. I thought that given the fact that there are no any projects that use get providers feature explicitly (maybe I am wrong, but can't find on the github and in the list of notable users) and there are some quickly detectable inconveniences (#1507, #1526), the problem here could be also easily discoverable...

On default libp2p with debug log level I see Refused substream; reached maximum number of substreams with corresponded error.

Panic is appeared only on mplex and I believe now that it isn't linked with reaching the maximum substreams count.

twittner added a commit to twittner/rust-libp2p that referenced this issue Mar 30, 2020
Issues libp2p#1504 and libp2p#1523 reported panics caused by polling the sink of
`secio::LenPrefixCodec` after it had entered its terminal state, i.e.
after it had previously encountered an error or was closed. According
to the reports this happened only when using mplex as a stream
multiplexer. It seems that because mplex always stores and keeps the
`Waker` when polling, a wakeup of any of those wakers will resume the
polling even for those cases where the previous poll did not return
`Poll::Pending` but resolved to a value.

To prevent polling after the connection was closed or an error
happened we check for those conditions prior to every poll.
@twittner
Copy link
Contributor

Panic is appeared only on mplex and I believe now that it isn't linked with reaching the maximum substreams count.

@michaelvoronov: Would you mind testing #1529 to see if the panic is gone?

@mikevoronov
Copy link
Contributor Author

@twittner thanks, will try to do so today or tomorrow.

tomaka added a commit that referenced this issue Apr 1, 2020
* mplex: Check for error and shutdown.

Issues #1504 and #1523 reported panics caused by polling the sink of
`secio::LenPrefixCodec` after it had entered its terminal state, i.e.
after it had previously encountered an error or was closed. According
to the reports this happened only when using mplex as a stream
multiplexer. It seems that because mplex always stores and keeps the
`Waker` when polling, a wakeup of any of those wakers will resume the
polling even for those cases where the previous poll did not return
`Poll::Pending` but resolved to a value.

To prevent polling after the connection was closed or an error
happened we check for those conditions prior to every poll.

* Keep error when operations fail.

Co-authored-by: Pierre Krieger <pierre.krieger1708@gmail.com>
@mikevoronov
Copy link
Contributor Author

@twittner I've tested with this complex test several times - there weren't any panics on mplex.

@tomaka
Copy link
Member

tomaka commented Apr 29, 2020

I guess this is resolved? 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants