Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(iroh-net): remove random choice of direct addr #2509

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

flub
Copy link
Contributor

@flub flub commented Jul 16, 2024

Description

When we had any direct addresses but not yet a best address we would
randomly try to pick one and send to it on the off chance it worked.
This required to always pick the same direct addr to send to which was
not happening and a bug.

However the scenarios in which this would speed up holepunching were
very unlikely: only really if a single direct addr known to work was
added manually would this improve things.

Simply removing this feature is rather nice:

  • The direct connection still happens very quickly in the best-case
    scenario: a single ping-pong roundtrip is all it costs. No relay
    server or discovery mechanism needed so no functionality change.

  • In all other scenarios there is no change. Connection establishment
    behaves exactly the same as before.

  • There are no more confusing connection type changes at startup.
    This would go relay -> mixed (-> mixed a few times more due to the
    bug) -> direct. Now it just goes relay -> direct which is intuitively
    what people would expect.

The simplest way to test this is by running doctor accept and connect
with direct addrs only and observe the holepunching events emitted.

Breaking Changes

none

Notes & open questions

This is an alternative to #2487. And my preferred solution.

Fixes #2480.

See #2512 for the flaky test.

test_two_devices_roundtrip_network_change managed to hit a timeout on
cross CI. Give it some more time.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented.

When we had any direct addresses but not yet a best address we would
randomly try to pick one and send to it on the off chance it worked.
This required to always pick the same direct addr to send to which was
not happening and a bug.

However the scenarios in which this would speed up holepunching were
very unlikely: only really if a single direct addr known to work was
added manually would this improve things.

Simply removing this feature is rather nice:

- The direct connection still happens very quickly in the best-case
  scenario: a single ping-pong roundtrip is all it costs.  No relay
  server or discovery mechanism needed so no functionality change.

- In all other scenarios there is no change.  Connection establishment
  behaves exactly the same as before.

- There are no more confusing connection type changes at startup.
  This would go relay -> mixed (-> mixed a few times more due to the
  bug) -> direct.  Now it just goes relay -> direct which is intuitively
  what people would expect.

The simplest way to test this is by running doctor accept and connect
with direct addrs only and observe the holepunching events emitted.
@flub flub added hole punching c-iroh-net bug Something isn't working labels Jul 16, 2024
@flub flub added this to the v0.21.0 milestone Jul 16, 2024
@dignifiedquire
Copy link
Contributor

In case no relay is available (eg offline network), will this still work? My worry is that if we can't use the relay, that we will error out in this case, even though we have potential available direct addresses

@flub
Copy link
Contributor Author

flub commented Jul 16, 2024

In case no relay is available (eg offline network), will this still work? My worry is that if we can't use the relay, that we will error out in this case, even though we have potential available direct addresses

In case there is no relay available this will do a ping to the direct address and use it as soon as there's a pong. I tested this to make sure using iroh doctor and all looks good.

This delays using this scenario by 1rtt but for the simplicity we get in return that's a nice tradeoff to me.

@flub
Copy link
Contributor Author

flub commented Jul 16, 2024

It should be pointed out that this is no observable behaviour change. Before an arbitrary number of packets, but not all of them, might have made it to a reachable node. So essentially you get a significant amount of packet loss. Then the ping-pong would get through and then all packets make it. Now you get a bit more predictability in that all packets will be lost until the ping-pong makes it through.

@dignifiedquire
Copy link
Contributor

if it works I am happy

Copy link
Contributor

@matheus23 matheus23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good from my perspective, so I'm approving, but would like the others to chime in, still, as I'm super new to these parts ✌️

Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to be sure I'm following:

Since assign_best_addr_from_candidates_if_empty (damn that's long) picks only from addresses with recent pongs, and in case the best address is not set (so, we have no recent pongs, or no direct addresses at all) we go with relay, we will not use for app-level packages any direct address unless it's first pong-verified. Is this correct? If so then I guess this makes sense and the (None, None) case for best address and relay is actually recoverable.

@@ -259,11 +258,7 @@ impl NodeState {
/// Returns the address(es) that should be used for sending the next packet.
///
/// Any or all of the UDP and relay addrs may be non-zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated I guess, but does this even mean? an address being "non zero"? more go baggage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as "may be Some". Probably the wording has survived from go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i re-worded this a bit.

.map(|ipp| SocketAddr::from(*ipp));
trace!(udp_addr = ?addr, "best_addr is unset, use candidate addr and relay");
(addr, self.relay_url())
trace!("best_addr is unset, use relay");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we get here from somewhere above the node for which a relay only attempt will be made? Or should we have here self.me (or whatever the name of that field around here is)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2024-07-18T11:31:49.239813Z TRACE poll_send{me=xamcuaerdkwkwwjj}:get_send_addrs{node=dgjnkpaczxaekzxf}: iroh_net::magicsock::node_map::node_state: 285: best_addr is unset, use relay

is how this shows up. so i think this information is already suitably captured in the spans.

@flub
Copy link
Contributor Author

flub commented Jul 18, 2024

Just want to be sure I'm following:

Since assign_best_addr_from_candidates_if_empty (damn that's long) picks only from addresses with recent pongs, and in case the best address is not set (so, we have no recent pongs, or no direct addresses at all) we go with relay, we will not use for app-level packages any direct address unless it's first pong-verified. Is this correct?

Yes, this is correct.

Note also that assign_best_addr_from_candidates_if_empty (lol, yeah) will normally be a no-op as that should have been done already at pong-handling time. It's only there because of a combination of legacy and backup-in-case-of-a-bug.

If so then I guess this makes sense and the (None, None) case for best address and relay is actually recoverable.

Yes, this will drop packets for a little while until it is recovered. I think that is the desired behaviour.

@flub flub enabled auto-merge July 18, 2024 11:47
@flub flub added this pull request to the merge queue Jul 18, 2024
Merged via the queue into main with commit c1c3539 Jul 18, 2024
26 checks passed
@flub flub deleted the flub/no-random-direct-addr branch July 18, 2024 12:23
Arqu added a commit that referenced this pull request Jul 23, 2024
Arqu added a commit that referenced this pull request Jul 23, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 5, 2024
## Description

In #2509 (c1c3539) we removed chosing
a random unconfirmed direct address and through both it and the relay
if we did not yet have any confirmed direct address.  This required
any direct address passed in via a NodeAddr to do a DISCO Ping-Pong
round trip before it would be used.

It turns out this is used a lot in the sense that a common socket
address is used: when you know an IP is reachable.  Requiring a ping
would mean the first QUIC packets would be dropped, Quinn would wait
to detect this before retrying and connections are delayed by an
observable second by the end of it all.  Not great for this scenario,
even if it can be argued that outside of tests this is not a common
scenario.

So this brings back the random selection of unconfirmed addresses.

It tries to do this without adding more complexity to the NodeState.
Instead it consolidates the path information and best address into the
NodeUdpState struct.  But **without any logic changes.**  It then adds
just this extra caching of the chosen address to this struct.

This is a huge simplification of a much larger refactor I was trying
to do first.  I believe that was the right track but resulted in too
many changes to do in one large PR.  So the long-term goal here is to
continue removing state manipulation outside of NodeUdpState struct so
that it gains more control of how to manage the UDP paths.  This will
help the logic of how UDP paths are chosen to be based entirely on the
PathState - which is the right model for this.

## Breaking Changes

None

## Notes & open questions

Fixed #2546

## Change checklist

- [x] Self-review.
- [x] Documentation updates following the [style
guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text),
if relevant.
- [x] Tests if relevant.
- [x] All breaking changes documented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c-iroh-net hole punching
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

iroh-net: When guessing at a direct address we do not always send to the same one
4 participants