Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random upgrade timeouts between peers #1629

Closed
BlackYoup opened this issue Jun 25, 2020 · 21 comments
Closed

Random upgrade timeouts between peers #1629

BlackYoup opened this issue Jun 25, 2020 · 21 comments
Assignees

Comments

@BlackYoup
Copy link
Contributor

Hello everyone,

I'm having a weird issue in which, sometimes, a substream will timeout during its upgrade. I have little knowledge in rust futures and in this project's internals so I'm sorry if I misinterpret some stuff. You can find a commit with my debug logs here.

Context

I'm running on my local machine 3 services (let's call them node A / B / C), all of them using rust-libp2p. The swarm created uses a custom NetworkBehaviour which has 3 behaviours:

  • Kademlia
  • Identify
  • A custom behaviour I made

The transport used is plaintext TCP using the mplex streams multiplexer. I have some other issues with yamux which made me use mplex.

My mplex configuration:

let mut mplex = MplexConfig::default();
mplex.max_buffer_len_behaviour(MaxBufferBehaviour::Block);
mplex.split_send_size(1024 * 1024);
mplex.max_substreams(100000); // Yes, this is insane, it's to workaround another issue I had and I didn't take the time to lower it yet

I'm using rust-libp2p 0.20.1 but had the same issue on 0.19.1. The issue may be related to timing. It might not occur in debug mode with enough logs to "slow" the services but release mode almost always triggers it. Also, I started seeing it after a refactor of my code which seems to have brought a performance boost as well.

The 3 nodes connect to each other on my local machine using 127.0.0.1:port . Node B then advertise 1/2k providers records into the DHT. Node C will then receive requests from another external service and will do providers lookups into the DHT. The timeout randomly appears during those lookups. From time to time I get something like Request to PeerId("12D3KooWAZ1gXB3ZokKzEnkXvWDK6RWj38rNkkeVTsqyevL5AYqV") in query QueryId(67) failed with Upgrade(Timeout). The timeout is the default one, 10 seconds.

Upgrade timeout

I started using RUST_LOG=trace to get all the logs I could and started tracing the issue. It seems that node C successfully creates the substream and starts sending data to node B. On node B, I can see the created substream:

Substream 68 logs
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 68 }
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Filter pass for Open { substream_id: 68 }
2020-06-25T14:46:00.419Z DEBUG libp2p_mplex                                  > Successfully opened inbound substream 68
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 68, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.419Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 68, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }, substream_num=2, endpoint=Listener
2020-06-25T14:46:09.000Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:10.419Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:10.419Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0    

I added a few logs in node_handler and mplex, the whole commit is my debug attempt.

From what I see here, the newly created substream never leaves the negotiating_in "state" and I'm not sure why. For example, here's another substream successfully created right before substream 68 would be created:

Substream 67 logs
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 67 }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filter pass for Open { substream_id: 67 }
2020-06-25T14:46:00.373Z DEBUG libp2p_mplex                                  > Successfully opened inbound substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-25T14:46:00.373Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }, substream_num=67, endpoint=Listener
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Filter pass for Data { substream_id: 67, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE multistream_select::protocol                  > Received message: Header(V1)
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.373Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.374Z TRACE multistream_select::protocol                  > Received message: Protocol(Protocol(b"/ipfs/kad/1.0.0"))
2020-06-25T14:46:00.374Z DEBUG multistream_select::listener_select           > Listener: confirming protocol: /ipfs/kad/1.0.0
2020-06-25T14:46:00.374Z DEBUG multistream_select::listener_select           > Listener: sent confirmed protocol: /ipfs/kad/1.0.0
2020-06-25T14:46:00.374Z DEBUG libp2p_core::upgrade::apply                   > Successfully applied negotiated protocol
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Ready(Ok(upgrade))
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }
2020-06-25T14:46:00.374Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }, substream_num=2, endpoint=Listener
2020-06-25T14:46:00.374Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 67, endpoint: Dialer, data: b"E\x08\x03\x12?/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jarP\n" }, substream_num=67, endpoint=Listener
2020-06-25T14:46:00.375Z TRACE libp2p_mplex                                  > Poll::Ready for substream 67
2020-06-25T14:46:00.375Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.375Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-25T14:46:00.376Z TRACE libp2p_mplex                                  > Received message: Close { substream_id: 67, endpoint: Dialer }

As you can see, the difference between substreams 67 and 68 is that the "filtering" function (logs starting with the Filtering. log) successfully returns when called on substream 67 but for substream 68, it tries to match it to a different substream and never try again later. I've seen substreams where the filter function "fails" and then "succeed" to match the substream at some point.

I can't know if that's the issue for sure but that's where I am right now and I don't have much more ideas. I'm not even sure that my issue comes from here and that's why I opened this issue.

Unfortunately I can't share a reproducer as I'm not sure how to reproduce exactly. Like I said at the start, I first start by adding ~1/2k providers records to the DHT so it creates that much substreams too and it works fine.

Would you happen to have any ideas as to where I should look next to continue debugging this?

Thanks!

@tomaka
Copy link
Member

tomaka commented Jun 25, 2020

Hi,

To give some insights: the "upgrade", in the case of Kademlia, also includes transferring the Kademlia request and response. The word "upgrade" is a bit misleading. The substream doesn't leave negotiating_in until after the response to the query has been sent back.

Could you maybe look at the libp2p_kad logs on node B?

@BlackYoup
Copy link
Contributor Author

Thank you for the explanation and your fast response. If I understood correctly, it may be that kademlia on node B doesn't lookup for the record then?

libp2p_kad doesn't seem to log a lot, the only logs I have on node B are:

2020-06-25T14:45:59.383Z TRACE libp2p_mplex                                  > Received message: Close { substream_id: 53, endpoint: Dialer }
2020-06-25T14:45:59.383Z TRACE libp2p_kad::handler                           > Inbound substream: EOF

When node B first starts to add the providers records, then it logs a bit more like:

2020-06-25T16:32:27.486Z TRACE libp2p_kad::behaviour                         > Request to PeerId("QmanhXTSM44jJamXWjegs94nYLWUy9rT5JcaW1BK5a26m8") in query QueryId(878) succeeded.
2020-06-25T16:32:27.486Z TRACE libp2p_kad::behaviour                         > Peer KadPeer { node_id: PeerId("12D3KooWHMdn8cNUQBbHQPWyH1WYEirYadF4YsEafZuD8BeH1EjW"), multiaddrs: ["/ip4/127.0.0.1/tcp/4042", "/ip4/172.17.0.1/tcp/4042", "/ip4/192.168.1.5/tcp/4042", "/ip6/::1/tcp/4042", "/ip6/2a01:e35:2fd2:4920:3648:1d8f:7550:efe8/tcp/4042"], connection_ty: Connected } reported by PeerId("QmanhXTSM44jJamXWjegs94nYLWUy9rT5JcaW1BK5a26m8") in query QueryId(878).
2020-06-25T16:32:27.486Z TRACE libp2p_kad::behaviour                         > Query QueryId(878) finished.
2020-06-25T16:32:27.487Z TRACE libp2p_kad::behaviour                         > Query QueryId(878) finished

But that's about it, no apparent error or anything else. I'll double check those (that's about 9k lines of log).

Maybe this part of the code doesn't have much logs? I'll try to add some logs and see what happens then!

If it can help, here is my kademelia configuration:

let mut store_config = MemoryStoreConfig::default();
store_config.max_records = 1024 * 1024;
store_config.max_provided_keys = 1024 * 1024;
store_config.max_providers_per_key = 1024; // TODO: check again
let store = MemoryStore::with_config(peer_id.clone(), store_config);
let mut kademlia_config = KademliaConfig::default();
let kademlia = Kademlia::with_config(peer_id.clone(), store, kademlia_config);

@tomaka
Copy link
Member

tomaka commented Jun 29, 2020

My intuition is that you're not properly polling the Swarm. It's necessary to do so for it to respond to incoming requests.

@BlackYoup
Copy link
Contributor Author

Thanks for the hint. I'm using the exact same pattern as this example https://github.com/libp2p/rust-libp2p/blob/master/examples/chat.rs#L150 to poll the swarm. I also have the stdin pattern to read debug commands I might need at runtime.

When I'm reproducing the timeout, hitting <enter> on both B and C nodes, it will poll the stdin then the swarm but that's it, the timeout is still hit a few seconds later. No logs from libp2p are emitted following this poll (still using RUST_LOG=trace). My understanding is that if I was not properly polling the Swarm, then I guess I should have got my DHT request "unlocked" when I hit <enter> and the swarm gets polled?

Node B logs
2020-06-29T15:05:46.553Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 257 }
2020-06-29T15:05:46.553Z TRACE libp2p_mplex                                  > Filter pass for Open { substream_id: 257 }
2020-06-29T15:05:46.553Z DEBUG libp2p_mplex                                  > Successfully opened inbound substream 257
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Pending
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 257, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_in
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for 0
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > Poll stream for 0
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 257, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x10/ipfs/kad/1.0.0\n" }, substream_num=257, endpoint=Listener
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z TRACE multistream_select::protocol                  > Received message: Header(V1)
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z TRACE multistream_select::protocol                  > Received message: Protocol(Protocol(b"/ipfs/kad/1.0.0"))
2020-06-29T15:05:46.554Z DEBUG multistream_select::listener_select           > Listener: confirming protocol: /ipfs/kad/1.0.0
2020-06-29T15:05:46.554Z DEBUG multistream_select::listener_select           > Listener: sent confirmed protocol: /ipfs/kad/1.0.0
2020-06-29T15:05:46.554Z TRACE libp2p_kad::protocol                          > upgrade_inbound
2020-06-29T15:05:46.554Z DEBUG libp2p_core::upgrade::apply                   > Successfully applied negotiated protocol
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > Got Poll::Ready(Ok(upgrade))
2020-06-29T15:05:46.554Z TRACE libp2p_kad::handler                           > inject_fully_negociated_inbound

// the line above is the line where I hit <enter>
2020-06-29T15:05:51.584Z ERROR p2p_deps                                      > Unknown command
2020-06-29T15:05:51.584Z INFO  p2p_deps                                      > polled swarm(): Pending
2020-06-29T15:05:52.837Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:56.575Z TRACE libp2p_mplex                                  > Received message: Open { substream_id: 258 }
Node C logs
2020-06-29T15:05:46.552Z DEBUG libp2p_mplex                                  > Successfully opened outbound substream 257
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.553Z DEBUG libp2p_core::upgrade::apply                   > 71aa9965-6f65-4b99-9694-98fb934e7bc9 init still pending
2020-06-29T15:05:46.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.553Z DEBUG libp2p_core::upgrade::apply                   > 71aa9965-6f65-4b99-9694-98fb934e7bc9 init still pending
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 256, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 256, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 256, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }, substream_num=256, endpoint=Dialer
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 256
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 256
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 256, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }, substream_num=256, endpoint=Dialer
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 256
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 256
2020-06-29T15:05:46.554Z DEBUG libp2p_core::upgrade::apply                   > 84806ec6-c319-4a9b-bb0f-99b43b5de3da Init ready
2020-06-29T15:05:46.554Z DEBUG libp2p_core::upgrade::apply                   > 84806ec6-c319-4a9b-bb0f-99b43b5de3da Successfully applied negotiated protocol
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 257, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 257, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }, substream_num=257, endpoint=Dialer
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 257
2020-06-29T15:05:46.554Z DEBUG libp2p_core::upgrade::apply                   > 71aa9965-6f65-4b99-9694-98fb934e7bc9 init still pending
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 257, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 257, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }, substream_num=3, endpoint=Dialer
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Received message: Data { substream_id: 256, endpoint: Listener, data: b"p\x08\x03Bj\n&\0$\x08\x01\x12 \xae \xc4\x8cO\xc69\x85l[\xef\xcb$\xa7\xee\x0c\xb1Rlbux\x015Zi\xde\xdd\xb2\xb0\x97\xed\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc9\x12\x08\x04\xc0\xa8\x01\x05\x06\x0f\xc9\x12\x14)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\x06\x0f\xc9\x12\x14)*\x01\x0e5/\xd2I 6H\x1d\x8fuP\xef\xe8\x06\x0f\xc9\x18\x01P\t" }
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 256, endpoint: Listener, data: b"p\x08\x03Bj\n&\0$\x08\x01\x12 \xae \xc4\x8cO\xc69\x85l[\xef\xcb$\xa7\xee\x0c\xb1Rlbux\x015Zi\xde\xdd\xb2\xb0\x97\xed\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc9\x12\x08\x04\xc0\xa8\x01\x05\x06\x0f\xc9\x12\x14)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\x06\x0f\xc9\x12\x14)*\x01\x0e5/\xd2I 6H\x1d\x8fuP\xef\xe8\x06\x0f\xc9\x18\x01P\t" }, substream_num=256, endpoint=Dialer
2020-06-29T15:05:46.554Z TRACE libp2p_mplex                                  > Poll::Ready for substream 256
2020-06-29T15:05:46.554Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:46.554Z INFO  p2p_deps                                      > polled swarm(): Pending

// the line above is the line I hit <enter>
2020-06-29T15:05:50.356Z ERROR p2p_deps                                      > Unknown command
2020-06-29T15:05:50.356Z INFO  p2p_deps                                      > polled swarm(): Pending
2020-06-29T15:05:56.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:56.553Z ERROR libp2p_swarm::protocols_handler::node_handler > ProcolsHandlerUpgrErr::Timeout. Handler id: 71aa9965-6f65-4b99-9694-98fb934e7bc9
2020-06-29T15:05:56.553Z TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
2020-06-29T15:05:56.553Z DEBUG libp2p_kad::behaviour                         > Request to PeerId("12D3KooWMY664xNU35FnMYkW8xHCeT6eg3bJk7ENmebouk3HjNe8") in query QueryId(255) failed with Upgrade(Timeout). Query=Some(GetProviders { key: Key(b"/org/scala-sbt/test-interface/1.0/test-interface-1.0.pom.sha1"), providers: {} })

The log line polled swarm(): is the result of let polled = swarm.poll_next_unpin(cx);

I haven't had the time since last thursday to get back into debugging this so I don't have more information yet. Also if you have some ideas as to where I could add logs to try to figure out what's happening, I will gladly adding them.

As a side note, my DHT requests are not concurrent. I mean that in this case, I only have a DHT request at a time but I may have multiple requests per second. Once one is finished, a second might happen.

@BlackYoup
Copy link
Contributor Author

Hey,

So I've debugged this a bit more but still haven't found the root cause yet. But here's what I noticed. It seems like the problem comes from node C instead of node B after all. Node B successfully (at least now) calls inject_fully_negociating_inbound. Its state is then set to SubstreamState::InWaitingMessage(connec_unique_id, protocol).

Node C is the node that initiate the DHT requests.

On node C it never reaches the inject_fully_negociating_outbound for the substream that timeouts. If I understood correctly, when a DHT request is made, the kademlia behaviour (or whatever is doing it) will send the DHT request to all the peers it knows or something like that. Which means a request will be sent both to node A and B. I'm not truly sure which node timeout though as I couldn't find a way to know which handler handled which node.

Anyway, I always have at least one of the two requests that get a response for one of the nodes. From here, here's what happens on the request that fails:

  • The substream is opened
  • The connection handler is polled a few times until we receive a message: BlackYoup@eefd240#diff-0618f4d473dc8dc779adc1ad14f3eed7R263
  • We receive a message: TRACE libp2p_mplex > Received message: Data { substream_id: 8, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
  • The connection handler gets polled again: BlackYoup@eefd240#diff-0618f4d473dc8dc779adc1ad14f3eed7R263
  • Thus, we poll the IO object and the "upgrade state" is advanced.
  • We then receive another message: TRACE libp2p_mplex > Received message: Data { substream_id: 8, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
  • This time, the connection handler doesn't get polled and will never be polled again until the timeout occurs.

The second request sent will go through the same process but I guess we already had both messages in the network buffer so they get polled only once. But on previous requests, I've noticed the same behaviour: the last stream to be opened will be the first one to be polled and the first message is already there but not the second (yet). BUT in that case, when the second one is received, the connection handler gets polled again, which is not the case here.

Here are some logs, it's quite verbose but I didn't want to clean up so you could have the exact same view as I have, they are annotated though. I've deleted the timestamps but they all happen in the same 800 milliseconds except for the few at the end once the timeout is reached. Annotated lines are prefixed with ##. Once again, you can find all those added logs in BlackYoup@eefd240

Detailed logs
Ids:
- 8689112a-274f-4952-8d0d-51f58ccae855 and fd17a8b5-14ca-4b85-9e32-d60579f73c13: kademlia handlers for node A or B.
- 7a5e24c0-97f7-4fad-ac35-cc49d887834b and c41a702c-c3bd-4571-8196-90edfb023f7d: substreams IDs the handlers are handling. Each handler handles one substream
- 7 or 8: the substream "num" seen by mplex

Here, the substream num 8 with ID c41a702c-c3bd-4571-8196-90edfb023f7d will fail to upgrade.
substream 7 with ID 7a5e24c0-97f7-4fad-ac35-cc49d887834b will successfully upgrade.


## The substream will start to either node A or B
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutPendingOpen(GetProviders { key: Key(b"/org/scala-sbt/sbt/1.3.7/sbt-1.3.7.pom.sha1") }), stream_id=None
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced without new_state. empty=true
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_out. outs=[]
DEBUG libp2p_mplex                                  > Successfully opened outbound substream 7
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 poll. substreams=1
## The substream will start to either node A or B
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 advancing substream. state=OutPendingOpen(GetProviders { key: Key(b"/org/scala-sbt/sbt/1.3.7/sbt-1.3.7.pom.sha1") }), stream_id=None
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 Advanced without new_state. empty=true
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=false, handler_negotiating_out=Some([7a5e24c0-97f7-4fad-ac35-cc49d887834b])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[7a5e24c0-97f7-4fad-ac35-cc49d887834b]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
DEBUG libp2p_mplex                                  > Successfully opened outbound substream 8
DEBUG libp2p_core::upgrade::apply                   > 7a5e24c0-97f7-4fad-ac35-cc49d887834b init still pending
TRACE libp2p_swarm::protocols_handler::node_handler > Got pending for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=0
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=false, handler_negotiating_out=Some([c41a702c-c3bd-4571-8196-90edfb023f7d])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([7a5e24c0-97f7-4fad-ac35-cc49d887834b])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[c41a702c-c3bd-4571-8196-90edfb023f7d]
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[7a5e24c0-97f7-4fad-ac35-cc49d887834b]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
DEBUG libp2p_core::upgrade::apply                   > 7a5e24c0-97f7-4fad-ac35-cc49d887834b init still pending
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_swarm::protocols_handler::node_handler > Got pending for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=0
DEBUG libp2p_core::upgrade::apply                   > c41a702c-c3bd-4571-8196-90edfb023f7d init still pending
TRACE libp2p_swarm::protocols_handler::node_handler > Got pending for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 poll. substreams=0
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([c41a702c-c3bd-4571-8196-90edfb023f7d])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[c41a702c-c3bd-4571-8196-90edfb023f7d]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
DEBUG libp2p_core::upgrade::apply                   > c41a702c-c3bd-4571-8196-90edfb023f7d init still pending
TRACE libp2p_swarm::protocols_handler::node_handler > Got pending for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 poll. substreams=0
## We receive the first message for our faulty stream upgrade
TRACE libp2p_mplex                                  > Received message: Data { substream_id: 8, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
## The connection handler gets polled so we "get" the message and we poll the appropriate node handler
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([c41a702c-c3bd-4571-8196-90edfb023f7d])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[c41a702c-c3bd-4571-8196-90edfb023f7d]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
## Our message seems to pass this filter and so will return Ready right below
TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 8, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }, substream_num=8, endpoint=Dialer
TRACE libp2p_mplex                                  > Poll::Ready for substream 8
TRACE libp2p_mplex                                  > Poll::Ready for substream 8
## We are still pending since we are waiting for the /ipfs/kad message
DEBUG libp2p_core::upgrade::apply                   > c41a702c-c3bd-4571-8196-90edfb023f7d init still pending
TRACE libp2p_swarm::protocols_handler::node_handler > Got pending for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
## Here it is
TRACE libp2p_mplex                                  > Received message: Data { substream_id: 8, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
## We filter but it seems that we are not polling the correct substream yet but we never polle the ConnectionHandler for substream c41a702c-c3bd-4571-8196-90edfb023f7d ever again
TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 8, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }, substream_num=4, endpoint=Dialer
TRACE libp2p_kad::handler                           > fd17a8b5-14ca-4b85-9e32-d60579f73c13 poll. substreams=0
## We receive the first message of the second stream, that will work as expected
TRACE libp2p_mplex                                  > Received message: Data { substream_id: 7, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
## We poll its connection handler
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([7a5e24c0-97f7-4fad-ac35-cc49d887834b])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[7a5e24c0-97f7-4fad-ac35-cc49d887834b]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_swarm::protocols_handler::node_handler > Poll future for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 7, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }, substream_num=7, endpoint=Dialer
TRACE libp2p_mplex                                  > Poll::Ready for substream 7
TRACE libp2p_mplex                                  > Poll::Ready for substream 7
## We receive a new message on our successful substream and somehow we are already polling the right stream so it matches directly
TRACE libp2p_mplex                                  > Received message: Data { substream_id: 7, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 7, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }, substream_num=7, endpoint=Dialer
TRACE libp2p_mplex                                  > Filter pass for Data { substream_id: 7, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
TRACE libp2p_mplex                                  > Poll::Ready for substream 7
TRACE libp2p_mplex                                  > Poll::Ready for substream 7
## This stream becomes successfull
DEBUG libp2p_core::upgrade::apply                   > 7a5e24c0-97f7-4fad-ac35-cc49d887834b Init ready
DEBUG libp2p_core::upgrade::apply                   > 7a5e24c0-97f7-4fad-ac35-cc49d887834b Successfully applied negotiated protocol
TRACE libp2p_swarm::protocols_handler::node_handler > Got Ready for n=0, id=7a5e24c0-97f7-4fad-ac35-cc49d887834b
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 inject_fully_negotiated_outbound. Number of substreams=1, ids=[None], msg=GetProviders { key: Key(b"/org/scala-sbt/sbt/1.3.7/sbt-1.3.7.pom.sha1") }
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=1
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutPendingSend(GetProviders { key: Key(b"/org/scala-sbt/sbt/1.3.7/sbt-1.3.7.pom.sha1") }), stream_id=None
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced with new state and poll again. new_state=OutPendingFlush
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutPendingFlush, stream_id=None
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced with new state and poll again. new_state=OutWaitingAnswer
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutWaitingAnswer, stream_id=None
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 OutWaitingAnswer Pending
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced with new state and don't poll again. new_state=OutWaitingAnswer
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_out. outs=[]
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=1
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutWaitingAnswer, stream_id=None
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 OutWaitingAnswer Pending
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced with new state and don't poll again. new_state=OutWaitingAnswer
TRACE libp2p_mplex                                  > Received message: Data { substream_id: 7, endpoint: Listener, data: b"p\x08\x03Bj\n&\0$\x08\x01\x12 X\xf4\xa1\x90g\xc4\x90\x19\x8e\xf7\xe6\x98\xcd\xd8pw\xdcQ\xe6 \xcan\xbe\xe9\x8aQ\xe7\xe5\xc4\x1c\tY\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc9\x12\x08\x04\xc0\xa8\x01\x05\x06\x0f\xc9\x12\x14)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\x06\x0f\xc9\x12\x14)*\x01\x0e5/\xd2I 6H\x1d\x8fuP\xef\xe8\x06\x0f\xc9\x18\x01P\t" }
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_out. outs=[]
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=1
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutWaitingAnswer, stream_id=None
TRACE libp2p_mplex                                  > Filtering. elem=Data { substream_id: 7, endpoint: Listener, data: b"p\x08\x03Bj\n&\0$\x08\x01\x12 X\xf4\xa1\x90g\xc4\x90\x19\x8e\xf7\xe6\x98\xcd\xd8pw\xdcQ\xe6 \xcan\xbe\xe9\x8aQ\xe7\xe5\xc4\x1c\tY\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc9\x12\x08\x04\xc0\xa8\x01\x05\x06\x0f\xc9\x12\x14)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\x06\x0f\xc9\x12\x14)*\x01\x0e5/\xd2I 6H\x1d\x8fuP\xef\xe8\x06\x0f\xc9\x18\x01P\t" }, substream_num=7, endpoint=Dialer
TRACE libp2p_mplex                                  > Poll::Ready for substream 7
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 OutWaitingAnswer Ready: GetProviders { closer_peers: [KadPeer { node_id: PeerId("12D3KooWFocQ7BiCfXQV3h3mJwH7j2f2gzPwA7i1opz6uhRrJUuE"), multiaddrs: ["/ip4/127.0.0.1/tcp/4041", "/ip4/192.168.1.5/tcp/4041", "/ip6/::1/tcp/4041", "/ip6/2a01:e35:2fd2:4920:3648:1d8f:7550:efe8/tcp/4041"], connection_ty: Connected }], provider_peers: [] }
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 Advanced with new_state. new_state=OutClosing
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_out. outs=[]
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 poll. substreams=1
TRACE libp2p_kad::handler                           > 8689112a-274f-4952-8d0d-51f58ccae855 advancing substream. state=OutClosing, stream_id=None
INFO  p2p_deps                                      > polled swarm(): Pending
## Timeout gets hit. Yes the swarm getting polled and returning Pending is the last log until the timeout
TRACE libp2p_core::connection                       > Polling ConnectionHandler. io_pending=true, handler_negotiating_out=Some([c41a702c-c3bd-4571-8196-90edfb023f7d])
TRACE libp2p_swarm::protocols_handler::node_handler > We have 0 negotiating_in
TRACE libp2p_swarm::protocols_handler::node_handler > We have 1 negotiating_out. outs=[c41a702c-c3bd-4571-8196-90edfb023f7d]
TRACE libp2p_swarm::protocols_handler::node_handler > Poll timeout for n=0, id=c41a702c-c3bd-4571-8196-90edfb023f7d
ERROR libp2p_swarm::protocols_handler::node_handler > ProcolsHandlerUpgrErr::Timeout. Handler id: c41a702c-c3bd-4571-8196-90edfb023f7d

And from here, I'm not sure where to look or what to do :)

Again, thanks for your time on this!

@romanb
Copy link
Contributor

romanb commented Jul 1, 2020

Unfortunately I can't share a reproducer as I'm not sure how to reproduce exactly.

It really doesn't have to be minimal or reliably reproduce the problem all the time but if we would have a way to run your scenario ourselves with some reasonable probability of reproducing the problem, I think that would help us greatly. In any case, thanks for your detailed write-ups.

@romanb
Copy link
Contributor

romanb commented Jul 1, 2020

I am also wondering if there is some relation to #1504 which, albeit being closed due to the panic being resolved, also seemed to be primarily about problems with provider lookups as this excerpt shows:

I have instrumented mplex a little bit and it seems that finding providers generates several new substreams (5-6 on my network) but only one of them will be closed.
[..]
All of this looks like the root case in the stream management on the libp2p-kad side, but after very quick analysis couldn't find the cause.

I can try to generally take a close look at the substream management in libp2p-kad and see if anything is suspicious (or rewrite it, there is a related ticket in #1104) but it would still be great to first reproduce the problem(s).

@twittner
Copy link
Contributor

twittner commented Jul 2, 2020

I have some other issues with yamux which made me use mplex.

Once the issue here is resolved I would love to hear about the problems you had with yamux. Maybe you could open an issue at https://github.com/paritytech/yamux or write me an e-mail?

BlackYoup added a commit to BlackYoup/libp2p-handler-timeout-reproducer that referenced this issue Jul 2, 2020
BlackYoup added a commit to BlackYoup/libp2p-handler-timeout-reproducer that referenced this issue Jul 2, 2020
@BlackYoup
Copy link
Contributor Author

Hello,

So I've come up with a reproducer. Since I knew more about the issue, I knew what to do to trigger it again. You can find the repository here: https://github.com/BlackYoup/libp2p-handler-timeout-reproducer

It's still a handler upgrade timeout but during multiple PUT of providers into the DHT, not during GET PROVIDERS requests. The PUT follow the same pattern, one at a time.

I unexpectedly ran into it when iterating the reproducer and, on the opposite, couldn't reproduce my issue when using GET PROVIDERS requests. Anyway, hopefully the same thing happens.

The readme contains most of the stuff to reproduce and how I start it on my machine, so ask me any questions about it if needed. I'm using rust 1.44.1.

Once the issue here is resolved I would love to hear about the problems you had with yamux. Maybe you could open an issue at https://github.com/paritytech/yamux or write me an e-mail?

Sure. It's on my todolist to report this issue. I've hit the issue 2/3 months ago and my knowledge was more limited so I would have to retry. If I remember correctly, it was about the fact that I wanted to send big chunks over the network to another peer and yamux would not handle it correctly, leading to data not being sent. When I switched to mplex, I used

let mut mplex = MplexConfig::default();
mplex.max_buffer_len_behaviour(MaxBufferBehaviour::Block);

Which solved my problem. I believe this is the same issue as explained in #1257 (comment) and from which I then moved to use mplex.

@twittner
Copy link
Contributor

twittner commented Jul 2, 2020

If I remember correctly, it was about the fact that I wanted to send big chunks over the network to another peer and yamux would not handle it correctly, leading to data not being sent.

I do not want to hijack this issue thread so it would be best to discuss this elsewhere but I would be keen to know what exactly was not handled correctly.

@romanb
Copy link
Contributor

romanb commented Jul 6, 2020

So I've come up with a reproducer. [..]

I'm sorry to say that so far I've been unable to reproduce a problem. All 1500 provider records are always successfully stored when I run the provided code, i.e. it always counts down to remaining=0 and I see all queries succeeding. I will keep playing with it though, let me know if you have an idea on what else could help trigger the problem.

@twittner
Copy link
Contributor

twittner commented Jul 13, 2020

I could reproduce the problem using https://github.com/BlackYoup/libp2p-handler-timeout-reproducer. It looks like a missing task wakeup to me. Usually a stalling output resumes later, e.g. because identify kicks in:

2020-07-13T10:06:30.280Z DEBUG libp2p_reproducer > PUT provider eV3sBWzR3OzmAyOPktMh6sMHZz2aRx. remaining=259
2020-07-13T10:06:30.280Z DEBUG libp2p_mplex      > Successfully opened outbound substream 2479
2020-07-13T10:06:30.280Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:06:30.280Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2479, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:06:30.280Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2479, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-07-13T10:06:30.280Z DEBUG libp2p_mplex      > Successfully opened outbound substream 2480
2020-07-13T10:06:30.280Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2480, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:06:30.280Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2480, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-07-13T10:11:15.148Z DEBUG libp2p_mplex      > Successfully opened outbound substream 2481
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2481, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2480, endpoint: Listener, data: b"\x04\x08\x04P\t" }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2481, endpoint: Listener, data: b"\x0f/ipfs/id/1.0.0\n" }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2481, endpoint: Listener, data: b"\xbb\x03" }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 2481, endpoint: Listener, data: b"\n\xab\x02\x08\0\x12\xa6\x020\x82\x01\"0\r\x06\t*\x86H\x86\xf7\r\x01\x01\x01\x05\0\x03\x82\x01\x0f\00\x82\x01\n\x02\x82\x01\x01\0\xe9\x0e\xdf\xf0V[\xd1LF\xbc*\xda\x8b\xe4\x95\x8c\xd5\x81\xbe\x96\xcf\x16~\xb1\x83d\x82\x82;\xd4g\xf7\xd7<\x1bm\x8a\"4\xadz\x90\xe6\xe4\xb2\x84&\ng\x0bha@9\xcf\xbc/\xaf\x02\xc5&\xa5\xdf\x9eE]i}\x9e=wm\xc3UE\xfb\xa1\xa8\x02\xadOn\x80\xab\x9d;\r\xdc?\xc7#\xc3\x91\xb8\xf35R\x8f\x1a\n\x1c\x88ih\xadK\x85\xde-\xaf\xe1\x81\t\xb9\xc8\xc1\x1d\xd4\xd2\xb4\xb4\x12 \xb5\xc5\x80\xdb\xe4\xd2\xde\x8a\xa0\xc4\xf5^]\xf6\xf8\xf1\xf1\xbaR\xe0&\x01\xcd'B\x18\xbc\xfa\xa0\xd0\xd8@c>-a\xca\xa7W\xac<]\xce\xbb\x8d\xf5\x85\x0b\x86H\xa5i\xd2\xad\x9eM\xa9\x8eH\xf7\xa5U\xff\x89\x93\x1dDM\xe0\xc2\xfe\x8a@\x83\x03\xb43\x91k\xf1\x83_C\x16\xbcHD\x16\xfeX\x9b\xb1{\x05l\xc1\xcb\xb6\xfa-%\x91\x83\x05o\x82\xbc\xa3nb|\xd9\xbb\xef\xde\xb4X~H\x94\x1dV8\xa4\x14\xa7\xf3\x9f\xc0\x18Q=?\x02\x03\x01\0\x01\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc8\x12\x08\x04\x7f\0\0\x01\x06\x0f\xc8\x12\x08\x04\xc0\xa8\xb2,\x06\x0f\xc8\x12\x14)\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\x06\x0f\xc8\x1a\x0f/ipfs/kad/1.0.0\x1a\x0e/ipfs/id/1.0.0\"\x08\x04\x7f\0\0\x01\x06\xb0\xea*\x0f/identify/1.0.02\x1b/clevercloud-identify/1.0.0" }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Close { substream_id: 2481, endpoint: Listener }
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Open { substream_id: 1 }
2020-07-13T10:11:15.148Z DEBUG libp2p_mplex      > Successfully opened inbound substream 1
2020-07-13T10:11:15.148Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1, endpoint: Dialer, data: b"\x13/multistream/1.0.0\n\x0f/ipfs/id/1.0.0\n" }
2020-07-13T10:11:15.149Z DEBUG libp2p_mplex      > Successfully opened outbound substream 2482
2020-07-13T10:11:15.149Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:11:15.149Z DEBUG libp2p_reproducer > PUT provider rXWV9oxXGpNZO7zi3HuXIDeFThJYSw. remaining=258

Sometimes, there are shorter pauses:

2020-07-13T10:16:43.058Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:16:43.058Z DEBUG libp2p_reproducer > PUT provider ZcLLBbOXa9o2PMlZDj7iLH5IxXcREz. remaining=753
2020-07-13T10:16:43.058Z DEBUG libp2p_mplex      > Successfully opened outbound substream 1492
2020-07-13T10:16:43.058Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:16:43.058Z DEBUG libp2p_mplex      > Successfully opened outbound substream 1493
2020-07-13T10:16:53.058Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:16:53.058Z DEBUG libp2p_reproducer > PUT provider YHtUVuC4maeBoh0y8xdOpgtnbXUlca. remaining=752
2020-07-13T10:16:53.058Z INFO  libp2p_reproducer > polled swarm(): Pending
2020-07-13T10:16:53.058Z DEBUG libp2p_mplex      > Successfully opened outbound substream 1494
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1492, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1492, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1493, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1493, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1494, endpoint: Listener, data: b"\x13/multistream/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1494, endpoint: Listener, data: b"\x10/ipfs/kad/1.0.0\n" }
2020-07-13T10:16:53.059Z TRACE libp2p_mplex      > Received message: Data { substream_id: 1494, endpoint: Listener, data: b"\x04\x08\x04P\t" }
2020-07-13T10:16:53.059Z INFO  libp2p_reproducer > polled swarm(): Pending

It may be related to mplex, but I am not sure. Maybe it just has different timing characteristics which triggers the bug. Anyway I tried the test with yamux and could no longer reproduce it. @BlackYoup, could you try with yamux and see if the problem persists for you? If not we may focus on mplex, otherwise we may need to look elsewhere.

Edit: I should add that I was using libp2p 0.21.1 from crates.io and async-std 1.6.2

@BlackYoup
Copy link
Contributor Author

Hello!

I was in holidays last week so I couldn't update this thread.

I'm sorry to say that so far I've been unable to reproduce a problem.

Oh, alright then. Did you use my fork of libp2p or did you change it for the official repository? If you did not use my fork, then it's normal that all the records are correctly created because no errors are emitted. The pauses will happen and the process will normally resume after them. With my fork, a panic occurs which directly points the issue. Other than that, I don't have much ideas. Maybe it comes down to processor speed, I don't really know.

Sometimes, there are shorter pauses:

Yeah, I've seen the longer pauses too. On my side, I mostly hit the 10 seconds pause. The longer ones happened here and there but were very rare. They were so rare that I forgot about them to be honest. And if I recall correctly, I did not had them using this reproducer.

Sure, I'll try with yamux and I'll keep you posted. I'll try to find time today or at some point during the week.

@BlackYoup
Copy link
Contributor Author

Hello everyone,

I tested my reproducer with yamux instead of mplex. It doesn't seem to trigger the bug for the ~10 tests I ran. I quickly tried on my internal project but unfortunately I'm having another issue, probably due to how I send messages between peers. Not sure if it is due to yamux or my debug code but I won't have much time to dig into it those next few days. Sorry about that.

@romanb
Copy link
Contributor

romanb commented Sep 28, 2020

Since #1769 addressed some problems with mplex, it would be great if you could check whether the issues you ran into with mplex still occur with your test setup when using rust-libp2p/master. Since I personally wasn't very successful at reproducing these issues earlier, I'm not giving much on the fact that it "works for me", so your feedback would be very valuable.

@BlackYoup
Copy link
Contributor Author

@romanb It seems like I'm still hitting the 10 seconds timeout with my reproducer. I used commit f66f40b and I hit it every time I tried (5 tests).

Another weird thing is that, on my reproducer, both using yamux and mplex on the current master, the swarm is sometimes not properly working. When I start my two instances, both peers seem to connect fine to each other but some things are off:

  • The identity network behaviour doesn't trigger. By that I mean I don't see the events sent/received in the logs
  • If I start to "populate" the providers, the first put_providers() call will never trigger a QueryResult::StartProviding event
    Those weird issues do not appear at all on v0.28.1

I pushed this branch with my up-to-date setup (completely deleted Cargo.lock and target folder): https://github.com/BlackYoup/libp2p-handler-timeout-reproducer/tree/mplex-fix

@romanb
Copy link
Contributor

romanb commented Sep 28, 2020

Thanks for the feedback, I will take another look at your setup.

@romanb
Copy link
Contributor

romanb commented Sep 30, 2020

So the good news is that I think I found the reasons for the sporadic upgrade timeouts (see #1781). Regarding the other two new issues you describe, I'm still looking into that but I've been confused so far by the fact that there was a period where I could trigger them rather easily, but only with async-std, not with tokio, and periods where I cannot reproduce them. During the time where I could reproduce them, it looked like there was a Pending poll_read on the TCP transport level which resulted in the still "pending" connection background task to be Pending as well, waiting for a multistream-select message, without being woken up again, despite seeing the message delivered on the TCP level via tcpdump / wireshark. I will continue to look into that. Thanks again for your test setup.

@romanb
Copy link
Contributor

romanb commented Oct 26, 2020

My current status here is that I can no longer reproduce any of the issues mentioned with libp2p-0.29. The patch for the test code for updating libp2p can be found here. Whenever you find the time, it would be great to get brief feedback again w.r.t. which problems you are still observing with the latest libp2p release 0.29, just so I know if there are still issues to be solved even if I may not be able to reproduce them. Thanks in advance!

@romanb
Copy link
Contributor

romanb commented Nov 10, 2020

Another weird thing is that, on my reproducer, both using yamux and mplex on the current master, the swarm is sometimes not properly working. When I start my two instances, both peers seem to connect fine to each other but some things are off: [..]

I've probably stumbled upon the cause for these new early connection negotiation stalls, which seem related to the plaintext protocol. See #1831. I hope that this was the last problem to resolve here 🙏.

@romanb
Copy link
Contributor

romanb commented Nov 13, 2020

I think all issues here have been resolved. Feel free to re-open if you still encounter the problems discussed here so far or open new issues. Thanks again for your detailed reports and sample code.

@romanb romanb closed this as completed Nov 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants