-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
relay/DCUtR: Add Direct Connection Upgrade through Relay protocol #173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A really good start, @vyzo! Happy to sit on the Interest Group for this.
relay/DCUtR.md
Outdated
obtained from the `Connect` message. | ||
- Upon expiry of the timer, `B` starts a direct dial to `A` using the addresses obtained | ||
from the `Connect` message. | ||
6. If the connection is successful, then it is prioritized over the relay connection, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to cover the stream migration procedure in this spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it needs to be specified, otherwise this whole thing is incomplete. We can incubate it in this spec and then spin it off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily need any stream migration for the protocol to work.
We can simply open all new streams in the direct connection and garbage collect the relay connection when it no longer has any streams.
Or we can just close it after a grace period and force new streams to be created in the direct connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply open all new streams in the direct connection
This behaviour needs to be specified. Failing to specify the overall choreography makes this spec unactionable. "We now established a direct connection, now what?"
Maybe it's a naming thing. "Upgrade" implies the existing connection will evolve. If all we're intending to cover is the signalling and synchronisation, then this spec should be named accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing stream migration protocol here #328.
The protocol starts with the completion of a relay connection from `A` | ||
to `B`. Upon observing the new connection, the inbound peer (here `B`) | ||
checks the addresses advertised by `A` via identify. If that set | ||
includes public addresses, then `A` _may_ be reachable by a direct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it possible that A
may also be directly reachable at a private address if A
and B
are on the same local network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is possible, but that would have been dialed directly as the private addresses are still advertised with relay addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @albrow has a point. @vyzo: while that should be the case, if we want to be resilient and robust, this protocol should not make assumptions about how any other part of the system behaves. Usually those implicit assumptions make systems brittle.
Luckily our spec lifecycle process allows us to add this topic as an active discussion:
To facilitate open progress tracking and observability, as the
Working Draft
evolves, the author(s) SHOULD assemble a checklist of items that are pending
specification, explicitly stating which items are compulsory for promoting the
spec to aCandidate Recommendation
.
from: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not making this assumption will make us dial private addresses in vain multiple times.
We already have a problem with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At best, we can consider dialing them in the bidirectional part of the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if A is public and B is private, we can't possibly be behind the same NAT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, for the bidirectional part of the protocol we could check the public address of the other node. If that doesn't match our own, we can't possibly be behind the same NAT and dialing private addrs is pointless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid dialing private addrs if we can avoid it though. Perhaps we could still exchange them, but in a separate field. Then they can be ignored unless your public address matches the other node and you infer that you're behind the same NAT. Or your implementation may be able to always ignore them, since they would have been dialed previously.
Anyway, I agree that we could punt on this for this round and discuss when we promote to candidate rec.
@albrow and the 0x guys pointed us to the Trickle ICE spec, which seems like a relevant background read for us. I do expect our specs to be influenced by ICE – as a real-life, successful technology for coordinating hole punching between any two peers. We should aim to reference ICE WG material to back up our ideas and routines. |
@raulk there is a reference to the ICE RFC already. |
@vyzo great, I missed that. What are the parallels between the algo we propose and the ICE procedure? |
It's like ICE without a signalling server, and distributed STUN - we rely on public peers to tell us our observed addresses instead of using STUN servers. |
I think the Trickle ICE spec @raulk linked to is an iteration on ICE that incrementally exchanges candidates instead of sending them all at once. Apparently this lets you start testing connectivity sooner |
related previous discussion - #64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am proposing to merge this specification in its current state as a Working Draft.
Reviews welcome!
Below I am highlighting the two most notable recent changes.
If the unilateral connection upgrade attempt fails or if `A` is itself a NATed | ||
peer that doesn't advertise public address, then `B` initiates the direct | ||
connection upgrade protocol as follows: | ||
1. `B` opens a stream to `A` using the `/libp2p/dcutr` protocol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the protocol name /libp2p/dcutr
.
relay/DCUtR.md
Outdated
6. On failure go back to step (2), reusing the same stream opened in (1). | ||
Inbound peers (here `B`) SHOULD retry twice (thus a total of 3 attempts) | ||
before considering the upgrade as failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the added retry logic. Also see FAQ further below for additional reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change makes a lot of sense to me.
relay/DCUtR.md
Outdated
relies on the two peers synchronizing and simultaneously opening | ||
connections to each other to their predicted external address. It | ||
works well for UDP, with an estimated 80% success rate, and reasonably | ||
well for TCP, with an estimated 60% success rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we see much better numbers than this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they have been in the same ballpark, but I might as well be mistaken. Unfortunately I am unable to access the data from project flare phase 1. Either the data or my access seems to be removed.
@vyzo do you know more here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion continued on #173 (comment).
relay/DCUtR.md
Outdated
6. On failure go back to step (2), reusing the same stream opened in (1). | ||
Inbound peers (here `B`) SHOULD retry twice (thus a total of 3 attempts) | ||
before considering the upgrade as failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change makes a lot of sense to me.
Co-authored-by: Marten Seemann <martenseemann@gmail.com>
We had much better experimental success rate, 90% for UDP/QUIC.
Maybe NATs have gotten better behaved in the 10-15 years since the
literature, or at least our sample had better behaved ones.
…On Mon, Aug 23, 2021, 14:17 Max Inden ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In relay/DCUtR.md
<#173 (comment)>:
> + Packets should be sent in random intervals between 10 and 200 ms to each
+ address.
Done via 6f558f1
<6f558f1>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#173 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAI4SQL56BDHB6WDFOSCWDT6IU6NANCNFSM4HQM3A6Q>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Thanks @vyzo. cab60cc removes the concrete (outdated) success rate statements. My reasoning for removing them is the following: The results of Project Flare Phase 1 were convincing enough that we consider Project Flare worth finishing. We can only measure the real success rates once the protocols are widely deployed. Instead of outdated numbers in this spec from early on, I am in favor of removing them, maybe bringing them back once we have more data. |
Co-authored-by: Marten Seemann <martenseemann@gmail.com>
Thanks to the many people involved here! 🙏 |
5. Simultaneous Connect. The two nodes follow the steps below in parallel for | ||
every address obtained from the `Connect` message: | ||
- For a TCP address: | ||
- Upon receiving the `Sync`, `A` immediately dials the address to `B`. | ||
- Upon expiry of the timer, `B` dials the address to `A`. | ||
- This will result in a TCP Simultaneous Connect. For the purpose of all | ||
protocols run on top of this TCP connection, `A` is assumed to be the | ||
client and `B` the server. | ||
- For a QUIC address: | ||
- Upon receiving the `Sync`, `A` immediately dials the address to `B`. | ||
- Upon expiry of the timer, `B` starts to send UDP packets filled with | ||
random bytes to `A`'s address. Packets should be sent repeatedly in | ||
random intervals between 10 and 200 ms. | ||
- This will result in a QUIC connection where `A` is the client and `B` is | ||
the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see, this whole mechanism would also fit nicely upgrading the relay connection to a direct WebRTC connection, if the peers would be allowed to exchange their SDP data here.
Would you be open in amending the spec?
(cc @mxinden)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. We had this in mind, but as you said, it isn't mentioned anywhere. Given that the protocol uses protocol buffers, we could easily extend the messages to include additional data such as SDP payloads, or derive an SDP payload based on the information exchanged through the protocol.
Unfortunately there is no uniform way of speaking WebRTC across the many libp2p libraries (yet). In addition there is no specification yet (see #220 and #159). This is not to say that the project is not interested in adding WebRTC support in the future. Quite the opposite (see https://github.com/libp2p/specs/blob/master/connections/hole-punching.md and https://github.com/libp2p/specs/blob/master/ROADMAP.md#-unprecedented-global-connectivity).
With the above in mind, I am not sure whether it makes much sense to extend this paragraph with a section on WebRTC quite yet.
@wngr what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DCUtR would be a great way to add support for upgrading relayed connections to a direct WebRTC connection -- this just feels like the right abstraction, and the the alternative proposals so far appear inferior. Now I acknowledge that the big downside of this approach is that this requires a valid TLS certificate for the peer offering a WS endpoint, but I think that is a pill that can be swallowed, but that's orthogonal to the relayed connection upgrade.
In other words, I think DCUtR is the right way to add support for upgrades to WebRTC (or allow exchanging arbitrary payloads here?), and I don't want to let the current opportunity window slide ;-).
(By the way, I hacked on an experimental webrtc transport for rust-libp2p which supports both browser apis (through wasm) and native; signalling is currently done via p2p-webrtc-star
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, I think DCUtR is the right way to add support for upgrades to WebRTC (or allow exchanging arbitrary payloads here?), and I don't want to let the current opportunity window slide ;-).
👍
(By the way, I hacked on an experimental webrtc transport for rust-libp2p which supports both browser apis (through wasm) and native; signalling is currently done via p2p-webrtc-star.)
🚀 that is great to hear. Mind opening a work-in-progress pull request on rust-libp2p @wngr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current WIP is at https://github.com/wngr/libp2p-webrtc; however I really want to replace the WS signalling server with a libp2p relay node; this is why I started adding my own custom (behaviour, transport) tuple on top of rust-libp2p, which very much is similar to dcutr on a higher level.
What's the state of your dcutr branch? Maybe it makes more sense to prototype it ontop of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the state of your dcutr branch? Maybe it makes more sense to prototype it ontop of that?
You could leverage both libp2p/rust-libp2p#2059 and libp2p/rust-libp2p#2076. In case my understanding of WebRTC and SDP is correct, it solely needs to exchange a payload. If so (at least for now) you could just extend the Protobuf definition of the DCUTR protocol by a single field for that payload.
Happy to talk through this in person if that is preferred. Feel free to reach out via mail @wngr.
still early draft, but it's an important subject that needs to gain some momentum.Status: Ready for review.