-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(iroh-net): remove random choice of direct addr #2509
Conversation
When we had any direct addresses but not yet a best address we would randomly try to pick one and send to it on the off chance it worked. This required to always pick the same direct addr to send to which was not happening and a bug. However the scenarios in which this would speed up holepunching were very unlikely: only really if a single direct addr known to work was added manually would this improve things. Simply removing this feature is rather nice: - The direct connection still happens very quickly in the best-case scenario: a single ping-pong roundtrip is all it costs. No relay server or discovery mechanism needed so no functionality change. - In all other scenarios there is no change. Connection establishment behaves exactly the same as before. - There are no more confusing connection type changes at startup. This would go relay -> mixed (-> mixed a few times more due to the bug) -> direct. Now it just goes relay -> direct which is intuitively what people would expect. The simplest way to test this is by running doctor accept and connect with direct addrs only and observe the holepunching events emitted.
In case no relay is available (eg offline network), will this still work? My worry is that if we can't use the relay, that we will error out in this case, even though we have potential available direct addresses |
In case there is no relay available this will do a ping to the direct address and use it as soon as there's a pong. I tested this to make sure using iroh doctor and all looks good. This delays using this scenario by 1rtt but for the simplicity we get in return that's a nice tradeoff to me. |
It should be pointed out that this is no observable behaviour change. Before an arbitrary number of packets, but not all of them, might have made it to a reachable node. So essentially you get a significant amount of packet loss. Then the ping-pong would get through and then all packets make it. Now you get a bit more predictability in that all packets will be lost until the ping-pong makes it through. |
if it works I am happy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good from my perspective, so I'm approving, but would like the others to chime in, still, as I'm super new to these parts ✌️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to be sure I'm following:
Since assign_best_addr_from_candidates_if_empty
(damn that's long) picks only from addresses with recent pongs, and in case the best address is not set (so, we have no recent pongs, or no direct addresses at all) we go with relay, we will not use for app-level packages any direct address unless it's first pong-verified. Is this correct? If so then I guess this makes sense and the (None, None
) case for best address and relay is actually recoverable.
@@ -259,11 +258,7 @@ impl NodeState { | |||
/// Returns the address(es) that should be used for sending the next packet. | |||
/// | |||
/// Any or all of the UDP and relay addrs may be non-zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated I guess, but does this even mean? an address being "non zero"? more go baggage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as "may be Some
". Probably the wording has survived from go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i re-worded this a bit.
.map(|ipp| SocketAddr::from(*ipp)); | ||
trace!(udp_addr = ?addr, "best_addr is unset, use candidate addr and relay"); | ||
(addr, self.relay_url()) | ||
trace!("best_addr is unset, use relay"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will we get here from somewhere above the node for which a relay only attempt will be made? Or should we have here self.me
(or whatever the name of that field around here is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024-07-18T11:31:49.239813Z TRACE poll_send{me=xamcuaerdkwkwwjj}:get_send_addrs{node=dgjnkpaczxaekzxf}: iroh_net::magicsock::node_map::node_state: 285: best_addr is unset, use relay
is how this shows up. so i think this information is already suitably captured in the spans.
Yes, this is correct. Note also that
Yes, this will drop packets for a little while until it is recovered. I think that is the desired behaviour. |
## Description In #2509 (c1c3539) we removed chosing a random unconfirmed direct address and through both it and the relay if we did not yet have any confirmed direct address. This required any direct address passed in via a NodeAddr to do a DISCO Ping-Pong round trip before it would be used. It turns out this is used a lot in the sense that a common socket address is used: when you know an IP is reachable. Requiring a ping would mean the first QUIC packets would be dropped, Quinn would wait to detect this before retrying and connections are delayed by an observable second by the end of it all. Not great for this scenario, even if it can be argued that outside of tests this is not a common scenario. So this brings back the random selection of unconfirmed addresses. It tries to do this without adding more complexity to the NodeState. Instead it consolidates the path information and best address into the NodeUdpState struct. But **without any logic changes.** It then adds just this extra caching of the chosen address to this struct. This is a huge simplification of a much larger refactor I was trying to do first. I believe that was the right track but resulted in too many changes to do in one large PR. So the long-term goal here is to continue removing state manipulation outside of NodeUdpState struct so that it gains more control of how to manage the UDP paths. This will help the logic of how UDP paths are chosen to be based entirely on the PathState - which is the right model for this. ## Breaking Changes None ## Notes & open questions Fixed #2546 ## Change checklist - [x] Self-review. - [x] Documentation updates following the [style guide](https://rust-lang.github.io/rfcs/1574-more-api-documentation-conventions.html#appendix-a-full-conventions-text), if relevant. - [x] Tests if relevant. - [x] All breaking changes documented.
Description
When we had any direct addresses but not yet a best address we would
randomly try to pick one and send to it on the off chance it worked.
This required to always pick the same direct addr to send to which was
not happening and a bug.
However the scenarios in which this would speed up holepunching were
very unlikely: only really if a single direct addr known to work was
added manually would this improve things.
Simply removing this feature is rather nice:
The direct connection still happens very quickly in the best-case
scenario: a single ping-pong roundtrip is all it costs. No relay
server or discovery mechanism needed so no functionality change.
In all other scenarios there is no change. Connection establishment
behaves exactly the same as before.
There are no more confusing connection type changes at startup.
This would go relay -> mixed (-> mixed a few times more due to the
bug) -> direct. Now it just goes relay -> direct which is intuitively
what people would expect.
The simplest way to test this is by running doctor accept and connect
with direct addrs only and observe the holepunching events emitted.
Breaking Changes
none
Notes & open questions
This is an alternative to #2487. And my preferred solution.
Fixes #2480.
See #2512 for the flaky test.
test_two_devices_roundtrip_network_change managed to hit a timeout on
cross CI. Give it some more time.
Change checklist