-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossipsub interop issue with go-ipfs #1671
Comments
@AgeManning would really appreciate your help. I can get detailed info about this scenario. |
We are doing this too, subscribing on startup. Interested to hear the result of this thread. |
If this issue on github is correct libp2p/rust-libp2p#1671 (comment) then we should have an open connection to another node on the network before subscribing. Add a method to the orderbook and all the layes above it to facilitate subscribing. Call it after a succesfull dial is made.
I have implemented a debug instance for the Gossipsub behaviour ( https://github.com/Actyx/rust-libp2p/tree/rkl/gossipsub-debug-instances ), and discovered a few weird things. Here is the state of the gossipsub in the case where I added a delay fudge to make things work:
Never mind the topic names. What I found strange is that the mesh for the topic
|
I definitely think the duplicate peer ids in the mesh are wrong and will cause trouble... See #1674 |
Hey @rklaehn - Thanks for raising this. Sorry I've not been more vocal, there are a few known bugs in gossipsub. I'm trying to resolve them all in #1583. Let me enumerate known issues as some may apply to you:
I've not seen the duplicate peer id's in the mesh before. I'll look into and try add the fix to #1583. #1583 is almost complete (however not usable in it's current form). For it to be completed we need to add some configurable signing options, such that it can match other implementations. Specifically around whether to sign or not sign and validate or not validate signed messages. |
To test, you could try adding this and seeing if it helps your subscription issue: 7d678f5 |
I tried the diff in 7d678f5 , and it does not seem to help. The only thing that helps is that cursed fudge delay after connect. However, I have been banging my head against this all day. I will try again with a clear mind tomorrow. Regarding the duplicate peer ids: the mesh for a topic is a set of peers, so what is the reason to use Here is a draft PR that changes all Vec sets to proper (BTree)Sets: |
Ok, I'm curious about your scenario. I have applications which subscribe on startup and work as expected. Could you collect some gossipsub logs for the scenario you are describing? Perhaps your peer is not being added to the mesh and the gossip is not working as expected. On a new peer connection you should share your subscription with newly peer'd nodes. If you can paste some gossipsub logs of the node starting up and having the new node connect, I might be able to see something. |
Yes, me too. This does not happen all the time. In fact I did not see this before. But it is reproducible with this particular node.
You mean from my side (just dumping the traffic with some println! statements) or from the go-ipfs side? Not sure how to get the latter. I will try to get you logs asap.
|
Here is some traffic, just dumping msgs as they go in or out: No initial delay between connect and subscribe - not working
|
Here is the exact same code as above, but with a delay between the connection to the peer and the pubsub subscribe. As you can see, now we do get data:
|
This is a manually instrumented debug branch, but it does include the changes you suggested in 7d678f5 . Here is the branch: https://github.com/Actyx/rust-libp2p/tree/rkl/gossip-debug |
I have run this with the message signing branch. Same behaviour. Works with the delay, but not without. |
FYI I have made a PR with a few minor improvements against your message signing branch: sigp#32 |
Hey awesome. Thanks for the help. And use the environment variable It should output the gossipsub logs, which might help us diagnose this. |
Got another PR, which you may or may not want to merge. I hope you don't mind: sigp#33 Will get you the tracing logs. There is something wrong with the tracing setup in my app though, so this might take a bit longer. |
OK, here are the tracing logs with both delay and no delay. https://gist.github.com/rklaehn/db7b167987a37a3898c07adedba4a3e4 There is some application data mixed in in the "works" case, but just ignore that. Let me know if you need more. One thing that is already interesting is that apparently some proto msgs from go-ipfs can not be parsed.
BTW this is against the gossipsub-signing branch, which seems to be your current dev branch. |
OK, now this is weird. I looked at the proto msgs in detail. E.g.
|
I am getting message ids in the control msgs which are not utf8 strings, causing the parsing of the entire msg to fail...
|
I changed MessageId to be an arbitrary |
Issue in go-libp2p-gossipsub: libp2p/go-libp2p-pubsub#361 Workaround: Actyx@536ce88 (requires changing the .proto, which we probably don't want to do) |
Ok, awesome. Thanks for tracking this down. Seems like we relax the string condition in the protobuf and generalise the message-id. I'll update this and merge your commit. Thanks again. If you're happy with this solution, I'll make these changes and finish of #1583 and this should resolve this as well as #1674 |
Yep, both issues should be fixed. I might have a few other small things, now that I am a bit more familiar with the code base. The PR to the specs has been merged, so messageid is now officially bytes... libp2p/specs@c5b328a |
This should be resolved in #1583 |
I have the following scenario: I create a libp2p based program using gossipsub and connect to a go-ipfs peer.
When I subscribe to a topic immediately at startup, sometimes I don't ever get any messages for that topic despite the remote nodes clearly publishing on that topic.
When adding a small delay (
tokio::time::delay_for(Duration::from_secs(2)).await
) to allow for some time to pass before subscribing, everything works as expected.I tried to work around this issue by regularly unsubscribing and subscribing to all topics that somebody is interested in at the moment, but that did not work either.
Just calling subscribe again will not work, since subscribe will abort early with a
false
return value, indicating that the topic is already subscribed.Calling unsubscribe immediately followed by subscribe did both return true, but after the first unsubscribe/subscribe I did not get any pubsub msgs anymore.
Am I using the API correctly? It seems that I should be able to call subscribe once for a topic and then rely on the topic being subscribed despite not yet having any peers...
The text was updated successfully, but these errors were encountered: