Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

relay/DCUtR: Add Direct Connection Upgrade through Relay protocol #173

Merged
merged 29 commits into from
Aug 23, 2021

Conversation

vyzo
Copy link
Contributor

@vyzo vyzo commented May 29, 2019

still early draft, but it's an important subject that needs to gain some momentum.

In this specification, we describe a synchronization protocol for direct
connectivity with hole punching that eschews signaling servers and utilizes
existing relay connections instead.

Status: Ready for review.

Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A really good start, @vyzo! Happy to sit on the Interest Group for this.

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated
obtained from the `Connect` message.
- Upon expiry of the timer, `B` starts a direct dial to `A` using the addresses obtained
from the `Connect` message.
6. If the connection is successful, then it is prioritized over the relay connection, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cover the stream migration procedure in this spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have any yet...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it needs to be specified, otherwise this whole thing is incomplete. We can incubate it in this spec and then spin it off.

Copy link
Contributor Author

@vyzo vyzo May 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily need any stream migration for the protocol to work.
We can simply open all new streams in the direct connection and garbage collect the relay connection when it no longer has any streams.
Or we can just close it after a grace period and force new streams to be created in the direct connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply open all new streams in the direct connection

This behaviour needs to be specified. Failing to specify the overall choreography makes this spec unactionable. "We now established a direct connection, now what?"

Maybe it's a naming thing. "Upgrade" implies the existing connection will evolve. If all we're intending to cover is the signalling and synchronisation, then this spec should be named accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing stream migration protocol here #328.

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Show resolved Hide resolved
raulk
raulk previously requested changes May 29, 2019
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
@vyzo vyzo dismissed raulk’s stale review May 29, 2019 18:12

addressed.

@vyzo vyzo requested a review from raulk May 29, 2019 18:13
The protocol starts with the completion of a relay connection from `A`
to `B`. Upon observing the new connection, the inbound peer (here `B`)
checks the addresses advertised by `A` via identify. If that set
includes public addresses, then `A` _may_ be reachable by a direct
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible that A may also be directly reachable at a private address if A and B are on the same local network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is possible, but that would have been dialed directly as the private addresses are still advertised with relay addresses.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @albrow has a point. @vyzo: while that should be the case, if we want to be resilient and robust, this protocol should not make assumptions about how any other part of the system behaves. Usually those implicit assumptions make systems brittle.

Luckily our spec lifecycle process allows us to add this topic as an active discussion:

To facilitate open progress tracking and observability, as the Working Draft
evolves, the author(s) SHOULD assemble a checklist of items that are pending
specification, explicitly stating which items are compulsory for promoting the
spec to a Candidate Recommendation.

from: https://github.com/libp2p/specs/blob/master/00-framework-01-spec-lifecycle.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not making this assumption will make us dial private addresses in vain multiple times.
We already have a problem with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At best, we can consider dialing them in the bidirectional part of the protocol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if A is public and B is private, we can't possibly be behind the same NAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, for the bidirectional part of the protocol we could check the public address of the other node. If that doesn't match our own, we can't possibly be behind the same NAT and dialing private addrs is pointless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid dialing private addrs if we can avoid it though. Perhaps we could still exchange them, but in a separate field. Then they can be ignored unless your public address matches the other node and you infer that you're behind the same NAT. Or your implementation may be able to always ignore them, since they would have been dialed previously.

Anyway, I agree that we could punt on this for this round and discuss when we promote to candidate rec.

@raulk
Copy link
Member

raulk commented May 30, 2019

@vyzo

@albrow and the 0x guys pointed us to the Trickle ICE spec, which seems like a relevant background read for us. I do expect our specs to be influenced by ICE – as a real-life, successful technology for coordinating hole punching between any two peers.

We should aim to reference ICE WG material to back up our ideas and routines.

@vyzo
Copy link
Contributor Author

vyzo commented May 30, 2019

@raulk there is a reference to the ICE RFC already.

@raulk
Copy link
Member

raulk commented May 30, 2019

@vyzo great, I missed that. What are the parallels between the algo we propose and the ICE procedure?

@vyzo
Copy link
Contributor Author

vyzo commented May 30, 2019

What are the parallels between the algo we propose and the ICE procedure?

It's like ICE without a signalling server, and distributed STUN - we rely on public peers to tell us our observed addresses instead of using STUN servers.
Also note that ICE mainly caters to UDP, while we very much care for TCP.

@yusefnapora
Copy link
Contributor

I think the Trickle ICE spec @raulk linked to is an iteration on ICE that incrementally exchanges candidates instead of sending them all at once. Apparently this lets you start testing connectivity sooner

@dryajov
Copy link
Member

dryajov commented Dec 17, 2019

related previous discussion - #64

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
@mxinden mxinden changed the title RFC: Direct Connection Upgrade through Relay relay/DCUtR: Add specification for Direct Connection Upgrade through Relay protocol Aug 17, 2021
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am proposing to merge this specification in its current state as a Working Draft.

Reviews welcome!

Below I am highlighting the two most notable recent changes.

If the unilateral connection upgrade attempt fails or if `A` is itself a NATed
peer that doesn't advertise public address, then `B` initiates the direct
connection upgrade protocol as follows:
1. `B` opens a stream to `A` using the `/libp2p/dcutr` protocol.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the protocol name /libp2p/dcutr.

relay/DCUtR.md Outdated
Comment on lines 92 to 94
6. On failure go back to step (2), reusing the same stream opened in (1).
Inbound peers (here `B`) SHOULD retry twice (thus a total of 3 attempts)
before considering the upgrade as failed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the added retry logic. Also see FAQ further below for additional reasoning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change makes a lot of sense to me.

@mxinden mxinden changed the title relay/DCUtR: Add specification for Direct Connection Upgrade through Relay protocol relay/DCUtR: Add Direct Connection Upgrade through Relay protocol Aug 19, 2021
relay/DCUtR.md Outdated
relies on the two peers synchronizing and simultaneously opening
connections to each other to their predicted external address. It
works well for UDP, with an estimated 80% success rate, and reasonably
well for TCP, with an estimated 60% success rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we see much better numbers than this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they have been in the same ballpark, but I might as well be mistaken. Unfortunately I am unable to access the data from project flare phase 1. Either the data or my access seems to be removed.

@vyzo do you know more here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion continued on #173 (comment).

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated
Comment on lines 92 to 94
6. On failure go back to step (2), reusing the same stream opened in (1).
Inbound peers (here `B`) SHOULD retry twice (thus a total of 3 attempts)
before considering the upgrade as failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change makes a lot of sense to me.

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
@vyzo
Copy link
Contributor Author

vyzo commented Aug 23, 2021 via email

@mxinden
Copy link
Member

mxinden commented Aug 23, 2021

Thanks @vyzo.

cab60cc removes the concrete (outdated) success rate statements. My reasoning for removing them is the following: The results of Project Flare Phase 1 were convincing enough that we consider Project Flare worth finishing. We can only measure the real success rates once the protocols are widely deployed. Instead of outdated numbers in this spec from early on, I am in favor of removing them, maybe bringing them back once we have more data.

relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
relay/DCUtR.md Outdated Show resolved Hide resolved
Co-authored-by: Marten Seemann <martenseemann@gmail.com>
@mxinden mxinden merged commit 689e5cb into master Aug 23, 2021
@mxinden
Copy link
Member

mxinden commented Aug 23, 2021

Thanks to the many people involved here! 🙏

Comment on lines +89 to +103
5. Simultaneous Connect. The two nodes follow the steps below in parallel for
every address obtained from the `Connect` message:
- For a TCP address:
- Upon receiving the `Sync`, `A` immediately dials the address to `B`.
- Upon expiry of the timer, `B` dials the address to `A`.
- This will result in a TCP Simultaneous Connect. For the purpose of all
protocols run on top of this TCP connection, `A` is assumed to be the
client and `B` the server.
- For a QUIC address:
- Upon receiving the `Sync`, `A` immediately dials the address to `B`.
- Upon expiry of the timer, `B` starts to send UDP packets filled with
random bytes to `A`'s address. Packets should be sent repeatedly in
random intervals between 10 and 200 ms.
- This will result in a QUIC connection where `A` is the client and `B` is
the server.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I see, this whole mechanism would also fit nicely upgrading the relay connection to a direct WebRTC connection, if the peers would be allowed to exchange their SDP data here.
Would you be open in amending the spec?
(cc @mxinden)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. We had this in mind, but as you said, it isn't mentioned anywhere. Given that the protocol uses protocol buffers, we could easily extend the messages to include additional data such as SDP payloads, or derive an SDP payload based on the information exchanged through the protocol.

Unfortunately there is no uniform way of speaking WebRTC across the many libp2p libraries (yet). In addition there is no specification yet (see #220 and #159). This is not to say that the project is not interested in adding WebRTC support in the future. Quite the opposite (see https://github.com/libp2p/specs/blob/master/connections/hole-punching.md and https://github.com/libp2p/specs/blob/master/ROADMAP.md#-unprecedented-global-connectivity).

With the above in mind, I am not sure whether it makes much sense to extend this paragraph with a section on WebRTC quite yet.

@wngr what do you think?

Copy link

@wngr wngr Sep 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DCUtR would be a great way to add support for upgrading relayed connections to a direct WebRTC connection -- this just feels like the right abstraction, and the the alternative proposals so far appear inferior. Now I acknowledge that the big downside of this approach is that this requires a valid TLS certificate for the peer offering a WS endpoint, but I think that is a pill that can be swallowed, but that's orthogonal to the relayed connection upgrade.
In other words, I think DCUtR is the right way to add support for upgrades to WebRTC (or allow exchanging arbitrary payloads here?), and I don't want to let the current opportunity window slide ;-).

(By the way, I hacked on an experimental webrtc transport for rust-libp2p which supports both browser apis (through wasm) and native; signalling is currently done via p2p-webrtc-star.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, I think DCUtR is the right way to add support for upgrades to WebRTC (or allow exchanging arbitrary payloads here?), and I don't want to let the current opportunity window slide ;-).

👍

(By the way, I hacked on an experimental webrtc transport for rust-libp2p which supports both browser apis (through wasm) and native; signalling is currently done via p2p-webrtc-star.)

🚀 that is great to hear. Mind opening a work-in-progress pull request on rust-libp2p @wngr?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current WIP is at https://github.com/wngr/libp2p-webrtc; however I really want to replace the WS signalling server with a libp2p relay node; this is why I started adding my own custom (behaviour, transport) tuple on top of rust-libp2p, which very much is similar to dcutr on a higher level.
What's the state of your dcutr branch? Maybe it makes more sense to prototype it ontop of that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the state of your dcutr branch? Maybe it makes more sense to prototype it ontop of that?

You could leverage both libp2p/rust-libp2p#2059 and libp2p/rust-libp2p#2076. In case my understanding of WebRTC and SDP is correct, it solely needs to exchange a payload. If so (at least for now) you could just extend the Protobuf definition of the DCUTR protocol by a single field for that payload.

Happy to talk through this in person if that is preferred. Feel free to reach out via mail @wngr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.