Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloodSub is not working as of b8a312f7d512c25f86bc16989cb588900483946d #843

Closed
echupriyanov opened this issue Jan 10, 2019 · 18 comments
Closed
Labels
bug difficulty:moderate priority:important The changes needed are critical for libp2p, or are blocking another project
Milestone

Comments

@echupriyanov
Copy link
Contributor

At present master (commit b8a312f)
FloodSub doesn't seem to work properly.

When I run provided chat example, I don't see messages being delivered between nodes.
Running example with debug enabled, I see, that dialed TCP connection is dropped immediately after successful negotiation.

Changing FloodsubHandler::connection_keep_alive() to always return true resolves this issue.

My guess is that this happens because:

  1. libp2p closes idle (useless?) TCP connections
  2. floodsub closes the substream after successful send.

So, we end up with no TCP connection between nodes.

@tomaka
Copy link
Member

tomaka commented Jan 10, 2019

Oh, I didn't think of that in #816
This is also strongly related to #713

@tomaka tomaka added the priority:important The changes needed are critical for libp2p, or are blocking another project label Jan 10, 2019
@tomaka
Copy link
Member

tomaka commented Jan 11, 2019

One possible fix that would make sense to me is to have connection_keep_alive() always return true if the remote is subscribed to a topic that we are also subscribed to.

@echupriyanov
Copy link
Contributor Author

But do we have such information on ProtocolHandler level?

@tomaka
Copy link
Member

tomaka commented Jan 11, 2019

We can have the NetworkBehaviour indicate this to the protocol handler.

@echupriyanov
Copy link
Contributor Author

Maybe it makes sense to move connection_keep_alive() to the NetworkBehaviour? It seems to me more logical

@tomaka
Copy link
Member

tomaka commented Jan 11, 2019

At the lower level each connection to a remote gets its own tokio task, supposedly asynchronous from the network behaviours.

If we put connection_keep_alive() in the NetworkBehaviour, then we need additional roundtrips between the behaviour and the protocol handler. Performance-wise, this is the same as if we do this manually specifically for floodsub.

@echupriyanov
Copy link
Contributor Author

I see. In that case, I'd suggest to add something like keep_alive flag to the FloodsubHandler, which is initialized to true on start, and can be turned off by NetworkBehaviour with SendEvent action

@drozdziak1
Copy link

@tomaka Couldn't this perhaps be implemented at the TcpConfig level? I mean, all TCP connections need to be kept alive no matter the layer above, is there a use case where someone might want the TCP transport to just destroy one?

@tomaka
Copy link
Member

tomaka commented Jan 15, 2019

@drozdziak1 The issues right now is that we purposefully destroy TCP connections when they become unused. We don't want to stay connected to everyone forever, especially if you are in a network that is made of thousands of nodes.

@drozdziak1
Copy link

Yeah, I mean having the TCP transport implement some form of smarter connection GC? Like a "last_used" timestamp with a configurable time window, like 15 minutes? Or a connection cap where you only keep up to N connections alive at a time and destroy the one with oldest "last_used" when you want to open another (N+1)th? Perhaps we could have some different modes for this, and maybe let the user specify a filter closure that could sort connections by a priority metric of their choosing?

@tomaka
Copy link
Member

tomaka commented Jan 15, 2019

The "last used" thing is already implemented, except that it's configured to 5 seconds right now.

@drozdziak1
Copy link

So when I run the chat example with RUST_LOG=debug, the connection appears to be broken up regardless of activity. Is there a way I could try and maybe bump that timeout upon Floodsub sending something over?

My general impression is that this is hard to fix mainly because of the mental overhead of the architecture. Is there any refactor options you find worth considering, which could help this issue?

@echupriyanov
Copy link
Contributor Author

@drozdziak1
For a time being, until better solution is designed, I'm using fork of libp2p where FloodsubHandler::connection_keep_alive() always returns true.

Here s the fork - https://github.com/stegos/rust-libp2p/tree/stegos

@drozdziak1
Copy link

Thanks @echupriyanov! I'm more looking to help fix this properly for everyone else though.

@tomaka
Copy link
Member

tomaka commented Jan 16, 2019

I think that a proper fix goes in pair with #854.
The reason why the chat example works at all is because Mdns asks the swarm to connect to the nodes it discovers. The fact that mDNS automatically connects is debatable by itself, and should definitely not be the backbone of the example.

@drozdziak1
Copy link

@tomaka Thanks! I'll try to understand that when I find the time.

@tomaka
Copy link
Member

tomaka commented Mar 22, 2019

Should work fine now.

@tomaka tomaka closed this as completed Mar 22, 2019
@cbarraford
Copy link

@tomaka what is the solution to this problem?? The code has changed quite a bit since stegos's fork. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug difficulty:moderate priority:important The changes needed are critical for libp2p, or are blocking another project
Projects
None yet
Development

No branches or pull requests

4 participants