refactor(iroh-net): remove random choice of direct addr #2509

flub · 2024-07-16T10:22:00Z

Description

When we had any direct addresses but not yet a best address we would
randomly try to pick one and send to it on the off chance it worked.
This required to always pick the same direct addr to send to which was
not happening and a bug.

However the scenarios in which this would speed up holepunching were
very unlikely: only really if a single direct addr known to work was
added manually would this improve things.

Simply removing this feature is rather nice:

The direct connection still happens very quickly in the best-case
scenario: a single ping-pong roundtrip is all it costs. No relay
server or discovery mechanism needed so no functionality change.
In all other scenarios there is no change. Connection establishment
behaves exactly the same as before.
There are no more confusing connection type changes at startup.
This would go relay -> mixed (-> mixed a few times more due to the
bug) -> direct. Now it just goes relay -> direct which is intuitively
what people would expect.

The simplest way to test this is by running doctor accept and connect
with direct addrs only and observe the holepunching events emitted.

Breaking Changes

none

Notes & open questions

This is an alternative to #2487. And my preferred solution.

Fixes #2480.

See #2512 for the flaky test.

test_two_devices_roundtrip_network_change managed to hit a timeout on
cross CI. Give it some more time.

Change checklist

Self-review.
Documentation updates following the style guide, if relevant.
Tests if relevant.
All breaking changes documented.

When we had any direct addresses but not yet a best address we would randomly try to pick one and send to it on the off chance it worked. This required to always pick the same direct addr to send to which was not happening and a bug. However the scenarios in which this would speed up holepunching were very unlikely: only really if a single direct addr known to work was added manually would this improve things. Simply removing this feature is rather nice: - The direct connection still happens very quickly in the best-case scenario: a single ping-pong roundtrip is all it costs. No relay server or discovery mechanism needed so no functionality change. - In all other scenarios there is no change. Connection establishment behaves exactly the same as before. - There are no more confusing connection type changes at startup. This would go relay -> mixed (-> mixed a few times more due to the bug) -> direct. Now it just goes relay -> direct which is intuitively what people would expect. The simplest way to test this is by running doctor accept and connect with direct addrs only and observe the holepunching events emitted.

dignifiedquire · 2024-07-16T10:32:31Z

In case no relay is available (eg offline network), will this still work? My worry is that if we can't use the relay, that we will error out in this case, even though we have potential available direct addresses

iroh-net/src/magicsock/node_map/node_state.rs

flub · 2024-07-16T10:50:08Z

In case no relay is available (eg offline network), will this still work? My worry is that if we can't use the relay, that we will error out in this case, even though we have potential available direct addresses

In case there is no relay available this will do a ping to the direct address and use it as soon as there's a pong. I tested this to make sure using iroh doctor and all looks good.

This delays using this scenario by 1rtt but for the simplicity we get in return that's a nice tradeoff to me.

flub · 2024-07-16T14:38:43Z

It should be pointed out that this is no observable behaviour change. Before an arbitrary number of packets, but not all of them, might have made it to a reachable node. So essentially you get a significant amount of packet loss. Then the ping-pong would get through and then all packets make it. Now you get a bit more predictability in that all packets will be lost until the ping-pong makes it through.

dignifiedquire · 2024-07-16T17:07:49Z

if it works I am happy

matheus23

This looks good from my perspective, so I'm approving, but would like the others to chime in, still, as I'm super new to these parts ✌️

divagant-martian

Just want to be sure I'm following:

Since assign_best_addr_from_candidates_if_empty (damn that's long) picks only from addresses with recent pongs, and in case the best address is not set (so, we have no recent pongs, or no direct addresses at all) we go with relay, we will not use for app-level packages any direct address unless it's first pong-verified. Is this correct? If so then I guess this makes sense and the (None, None) case for best address and relay is actually recoverable.

divagant-martian · 2024-07-17T15:29:55Z

iroh-net/src/magicsock/node_map/node_state.rs

@@ -259,11 +258,7 @@ impl NodeState {
    /// Returns the address(es) that should be used for sending the next packet.
    ///
    /// Any or all of the UDP and relay addrs may be non-zero.


unrelated I guess, but does this even mean? an address being "non zero"? more go baggage?

I read this as "may be Some". Probably the wording has survived from go.

i re-worded this a bit.

divagant-martian · 2024-07-17T15:32:21Z

iroh-net/src/magicsock/node_map/node_state.rs

-                    .map(|ipp| SocketAddr::from(*ipp));
-                trace!(udp_addr = ?addr, "best_addr is unset, use candidate addr and relay");
-                (addr, self.relay_url())
+                trace!("best_addr is unset, use relay");


will we get here from somewhere above the node for which a relay only attempt will be made? Or should we have here self.me (or whatever the name of that field around here is)

2024-07-18T11:31:49.239813Z TRACE poll_send{me=xamcuaerdkwkwwjj}:get_send_addrs{node=dgjnkpaczxaekzxf}: iroh_net::magicsock::node_map::node_state: 285: best_addr is unset, use relay

is how this shows up. so i think this information is already suitably captured in the spans.

flub · 2024-07-18T11:46:55Z

Just want to be sure I'm following:

Since assign_best_addr_from_candidates_if_empty (damn that's long) picks only from addresses with recent pongs, and in case the best address is not set (so, we have no recent pongs, or no direct addresses at all) we go with relay, we will not use for app-level packages any direct address unless it's first pong-verified. Is this correct?

Yes, this is correct.

Note also that assign_best_addr_from_candidates_if_empty (lol, yeah) will normally be a no-op as that should have been done already at pong-handling time. It's only there because of a combination of legacy and backup-in-case-of-a-bug.

If so then I guess this makes sense and the (None, None) case for best address and relay is actually recoverable.

Yes, this will drop packets for a little while until it is recovered. I think that is the desired behaviour.

This reverts commit c1c3539.

…dr (#2509)"" This reverts commit 54f4b19.

## Description In #2509 (c1c3539) we removed chosing a random unconfirmed direct address and through both it and the relay if we did not yet have any confirmed direct address. This required any direct address passed in via a NodeAddr to do a DISCO Ping-Pong round trip before it would be used. It turns out this is used a lot in the sense that a common socket address is used: when you know an IP is reachable. Requiring a ping would mean the first QUIC packets would be dropped, Quinn would wait to detect this before retrying and connections are delayed by an observable second by the end of it all. Not great for this scenario, even if it can be argued that outside of tests this is not a common scenario. So this brings back the random selection of unconfirmed addresses. It tries to do this without adding more complexity to the NodeState. Instead it consolidates the path information and best address into the NodeUdpState struct. But **without any logic changes.** It then adds just this extra caching of the chosen address to this struct. This is a huge simplification of a much larger refactor I was trying to do first. I believe that was the right track but resulted in too many changes to do in one large PR. So the long-term goal here is to continue removing state manipulation outside of NodeUdpState struct so that it gains more control of how to manage the UDP paths. This will help the logic of how UDP paths are chosen to be based entirely on the PathState - which is the right model for this. ## Breaking Changes None ## Notes & open questions Fixed #2546 ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.

flub requested review from ramfox, dignifiedquire and matheus23 July 16, 2024 10:22

flub added hole punching c-iroh-net bug Something isn't working labels Jul 16, 2024

flub added this to the v0.21.0 milestone Jul 16, 2024

matheus23 reviewed Jul 16, 2024

View reviewed changes

iroh-net/src/magicsock/node_map/node_state.rs Show resolved Hide resolved

flub added 2 commits July 16, 2024 11:56

ignore flaky test

c211c1a

cross ci hit a timeout on this one, give it more time

b33e83a

matheus23 approved these changes Jul 17, 2024

View reviewed changes

divagant-martian reviewed Jul 17, 2024

View reviewed changes

ramfox assigned flub Jul 17, 2024

Improve comment a little

759cd91

flub enabled auto-merge July 18, 2024 11:47

flub added this pull request to the merge queue Jul 18, 2024

Merged via the queue into main with commit c1c3539 Jul 18, 2024
26 checks passed

flub deleted the flub/no-random-direct-addr branch July 18, 2024 12:23

flub mentioned this pull request Jul 18, 2024

fix(iroh-net): Stable select of direct address guess #2487

Closed

4 tasks

Arqu added a commit that referenced this pull request Jul 23, 2024

Revert "refactor(iroh-net): remove random choice of direct addr (#2509)"

54f4b19

This reverts commit c1c3539.

Arqu added a commit that referenced this pull request Jul 23, 2024

Revert "Revert "refactor(iroh-net): remove random choice of direct ad…

664c4ac

…dr (#2509)"" This reverts commit 54f4b19.

flub mentioned this pull request Jul 25, 2024

NodeAddr with a direct address incurs an extra RTT #2546

Closed

flub mentioned this pull request Aug 2, 2024

fix(iroh-net): Make a single direct address in NodeAddr instant #2580

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(iroh-net): remove random choice of direct addr #2509

refactor(iroh-net): remove random choice of direct addr #2509

flub commented Jul 16, 2024 •

edited

Loading

dignifiedquire commented Jul 16, 2024

flub commented Jul 16, 2024

flub commented Jul 16, 2024

dignifiedquire commented Jul 16, 2024

matheus23 left a comment

divagant-martian left a comment

divagant-martian Jul 17, 2024

flub Jul 18, 2024

flub Jul 18, 2024

divagant-martian Jul 17, 2024

flub Jul 18, 2024

flub commented Jul 18, 2024

refactor(iroh-net): remove random choice of direct addr #2509

refactor(iroh-net): remove random choice of direct addr #2509

Conversation

flub commented Jul 16, 2024 • edited Loading

Description

Breaking Changes

Notes & open questions

Change checklist

dignifiedquire commented Jul 16, 2024

flub commented Jul 16, 2024

flub commented Jul 16, 2024

dignifiedquire commented Jul 16, 2024

matheus23 left a comment

Choose a reason for hiding this comment

divagant-martian left a comment

Choose a reason for hiding this comment

divagant-martian Jul 17, 2024

Choose a reason for hiding this comment

flub Jul 18, 2024

Choose a reason for hiding this comment

flub Jul 18, 2024

Choose a reason for hiding this comment

divagant-martian Jul 17, 2024

Choose a reason for hiding this comment

flub Jul 18, 2024

Choose a reason for hiding this comment

flub commented Jul 18, 2024

flub commented Jul 16, 2024 •

edited

Loading