Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073

ludost · 2024-10-04T10:26:32Z

Description

Although K8s, Flannel, and Wireguard, all support having k8s nodes running in a private network behind a NAT-applying firewall, the implementation of the Wireguard backend in Flannel is explicitly breaking this setup by over-zealously cleaning remote nodes. The resulting system behavior is: when a new node is started, the old node in the same private network gets its routing broken, without warning and user feedback, leading to unwanted, unexpected network timeouts. This PR removes calling that cleanup code in the Wireguard backend.

I consider this change a bug fix, but maybe it should be treated as a new feature given the potential impact. Please advise me on what level of configurability should be added around this new behavior.

We (@ Asimovo B.V., www.asimovo.com) have a production setup with K3s-client-nodes for end-users, running as a sort of "roadwarrior VPN" clients to our cloud solution. With this fix applied, we can support multiple clients from the same private network. (e.g. several employees of the same company) We are running this setup without encountering any issues regarding old/stale routes, etc.

Discussion

For background, here is a description of the current identification of nodes in the related components:

K3s assigns a subnetwork to given nodes, based on the name of the node as identifier/unique key.
Flannel sets up the routing for these nodes, based on the subnet address as identifier/unique key.
A Wireguard endpoint is set up (by Flannel), based on the node's publicKey as identfier/unique key.

No layer here considers the remote public IP address as a unique key, and therefor this IP address doesn't need to be unique. Wireguard is brilliant in solving this specific NAT-traversal challenge, especially in figuring out the changed/reassigned remote UDP port. So, there is no need for Flannel to remove Wireguard endpoints if multiple publicKeys are found to be behind the same IP address. In doing so, Flannel breaks an otherwise fully working feature.

Todos

As mentioned, we might want to place the new behavior behind a feature gate in the configuration. Unless you agree this is just a bugfix, and this feature should have been enabled anyway.

Depending on this choice between fix and feature, the documentation impact is either:

Bugfix: No major impact, just a release note: "The default setup provided by the Wireguard backend now supports running multiple nodes behind the same public IP address."
Feature: We need to add this as a configuration flag, with associated documentation. And I would suggest adding a note to the main Wireguard backend documentation to warn users about the cleanup behavior and refer to the new flag.

Release Note

The default setup provided by the Wireguard backend now supports running multiple nodes behind the same public IP address.

ludost · 2024-10-04T14:32:30Z

Hi @thomasferrandiz, I've seen the failed Linter run, which I've fixed with the latest commit. If you decide this should be placed behind a feature flag, I can easily reintroduce the method.

manuelbuil · 2024-10-17T17:55:12Z

Thanks for this PR, the good explanation and sorry for the late review.

I have nothing against the general idea. I just wonder why the author of the cleanupEndpointPeers function decided to always use it when adding a new peer. Perhaps he was deploying Flannel very often and by always removing old peers, he made sure that the new configuration is always in place. Hopefully he is still around and can remember! @andreek

I also see it as a bugfix, let's wait for others to chime in

andreek · 2024-10-18T10:34:57Z

I have nothing against the general idea. I just wonder why the author of the cleanupEndpointPeers function decided to always use it when adding a new peer. Perhaps he was deploying Flannel very often and by always removing old peers, he made sure that the new configuration is always in place. Hopefully he is still around and can remember! @andreek

Yes, this was the reason. In a cluster with dynamic node scaling some WireGuard devices had many inactive peers, because the public key of the public IP changed. I don't know if this still occurs today. But this comes with the assumption that one public IP has only one public key. So it should be fine to remove this to support the use case of this PR.

thomasferrandiz · 2024-10-18T14:30:34Z

@ludost I think we're all ok to merge the PR now.
Unfortunately we had an issue with the CI which is now fixed in the master branch.
Could you squash your commits and rebase on top of it to get the fix, then we'll be able to merge the PR.

Thanks!

ludost changed the title ~~Feature allow multiple nodes behind a single ip, remote nodes behind NAT.~~ Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. Oct 4, 2024

Try not to cleanup multiple peers behind same PublicIP

233e281

ludost force-pushed the Feature-multiNodeBehindSingleIP branch from 6ab2456 to 233e281 Compare October 18, 2024 19:25

thomasferrandiz approved these changes Oct 21, 2024

View reviewed changes

thomasferrandiz merged commit 4ec57ab into flannel-io:master Oct 22, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073

Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073

ludost commented Oct 4, 2024 •

edited

Loading

ludost commented Oct 4, 2024

manuelbuil commented Oct 17, 2024

andreek commented Oct 18, 2024

thomasferrandiz commented Oct 18, 2024

Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073

Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073

Conversation

ludost commented Oct 4, 2024 • edited Loading

Description

Discussion

Todos

Release Note

ludost commented Oct 4, 2024

manuelbuil commented Oct 17, 2024

andreek commented Oct 18, 2024

thomasferrandiz commented Oct 18, 2024

ludost commented Oct 4, 2024 •

edited

Loading