Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoRelay + HolePunching + AutoNAT regressions #2965

Closed
burdiyan opened this issue Sep 17, 2024 · 5 comments
Closed

AutoRelay + HolePunching + AutoNAT regressions #2965

burdiyan opened this issue Sep 17, 2024 · 5 comments

Comments

@burdiyan
Copy link
Contributor

Recently we upgraded the version of libp2p we are using to v0.36.3. We started having a lot of problems with hole punching, to the point that it just doesn't work.

I spent a few days digging into the issues, and I'd like to share my findings, because I believe there're a few bugs in there, which may be a regression compared to an older version of libp2p.

The setup I used to reproduce this issues is the following:

  1. A server on DigitalOcean with a public IP and all the ports open which acts as a relay. It uses libp2p with hole punching, relay service, and nat service enabled. It also has public reachability forced.
  2. A macOS laptop on a home NAT-ed network, with private reachability forced, hole punching enabled, and AutoRelay configured with the static address of the relay in step 1.
  3. A linux computer running on a totally different network, also behind NAT and with the similar configuration as the laptop from step 2.

I manually run the relay, then run the first node, wait until it connects to the relay, and then I copy its addresses. I then spin up the second node on the other computer, and make it connect using the addresses I previously copied. Connection is established, but it gets stuck with Limited state and never gets upgraded into Connected state, so I'm never being able to open any streams, unless I use the AllowLimitedConn option.

I tried doing the same thing, without forcing reachability on NAT-ed nodes, and letting them figure it out using AutoNAT. It didn't help. Using AutoNAT v2 doesn't seem to make any difference either. Computers correctly find their are private, then connect to the relay, but they never figure out their own public IPs. Sometimes I see a lot of random AutoNAT dialing failures in the logs.

After spending a lot of time tweaking the code and enabling all sorts of log messages I figured the following:

Regardless of whether reachability is forced, and regardless of whether AutoNAT v2 is used, it in both of my totally separate networks the libp2p node is not able to discover its public IP address. This in turn never starts the hole puncher service, which is why the relayed connection never gets upgraded into a direct one.

I tried to fix this problem by manually detecting my public IP using STUN, and then adding it to the list of my addresses using custom AddrFactory option.

That didn't fix the problem, because the hole punching code doesn't use host.Addrs() to detect its public IP to perform the DCUtR protocol. It only takes observed addresses + network interface addresses.

See this code:

func (s *Service) getPublicAddrs() []ma.Multiaddr {
addrs := removeRelayAddrs(s.ids.OwnObservedAddrs())
interfaceListenAddrs, err := s.host.Network().InterfaceListenAddresses()
if err != nil {
log.Debugf("failed to get to get InterfaceListenAddresses: %s", err)
} else {
addrs = append(addrs, interfaceListenAddrs...)
}
addrs = ma.Unique(addrs)
publicAddrs := make([]ma.Multiaddr, 0, len(addrs))
for _, addr := range addrs {
if manet.IsPublicAddr(addr) {
publicAddrs = append(publicAddrs, addr)
}
}
return publicAddrs
}
// DirectConnect is only exposed for testing purposes.
// TODO: find a solution for this.
func (s *Service) DirectConnect(p peer.ID) error {
<-s.hasPublicAddrsChan
s.holePuncherMx.Lock()
holePuncher := s.holePuncher
s.holePuncherMx.Unlock()

And this line here:

obsAddrs := removeRelayAddrs(hp.ids.OwnObservedAddrs())

In my case, observed addresses are always empty, because for some reason AutoNAT doesn't seem to be doing its job. And because host.Addrs() is not being called there my custom AddrFactory is not being used either.

I forked libp2p and added the necessary changes to use host.Addrs() to collect all the addresses. Unfortunately that didn't work, because AutoRelay seem to be overwriting my custom AddrFactory. I created a separate issue for this: #2964.

So, this is where I realized that no amount of duct-tape will fix the problem for me, so I decided to create this issue.

To summarize:

  1. AutoNAT doesn't seem to work to detect public IP of the node.
  2. AutoRelay breaks custom AddrFactory.
  3. Even if custom AddrFactory worked, the hole punching code wouldn't notice it.
@sukunrt
Copy link
Member

sukunrt commented Sep 17, 2024

I think this is happening because hole punching wants addresses obtained from identify. This is the hp.ids.OwnObservedAddrs() bit that you've highlighted.

To confirm if this is the issue, can you try bootstrapping your private nodes with a DHT? Try using the IPFS DHT(https://github.com/libp2p/go-libp2p-kad-dht). This will connect to a bunch of peers who will provide you your public addresses.

@burdiyan
Copy link
Contributor Author

At one point I did try connecting with the known dht peers. Although I didn’t actually initialize the dht itself. I will try that and let you know.

@burdiyan
Copy link
Contributor Author

I tried bootstrapping the DHT and repeated the test. And it worked. Which is to be honest a bit frustrating :) Because it clearly was not working before, even with bootstrapping. Although I previously have been seeing some errors during bootstrapping.

Could some temporary failure on the IPFS bootstrap nodes cause something like this?

@burdiyan
Copy link
Contributor Author

I created a separate issue about the use of addresses in hole punching: #2966

I guess this issue could be closed, unless it can be useful to discover and track the issues with AutoNAT being unreliable.

@sukunrt
Copy link
Member

sukunrt commented Sep 19, 2024

Let's use #2966. Please open this again if you run in to this again:

Because it clearly was not working before, even with bootstrapping.

@sukunrt sukunrt closed this as completed Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants