Bugfix/feature: Allow multiple nodes behind a single public IP address, e.g. remote nodes behind NAT. #2073
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Although K8s, Flannel, and Wireguard, all support having k8s nodes running in a private network behind a NAT-applying firewall, the implementation of the Wireguard backend in Flannel is explicitly breaking this setup by over-zealously cleaning remote nodes. The resulting system behavior is: when a new node is started, the old node in the same private network gets its routing broken, without warning and user feedback, leading to unwanted, unexpected network timeouts. This PR removes calling that cleanup code in the Wireguard backend.
I consider this change a bug fix, but maybe it should be treated as a new feature given the potential impact. Please advise me on what level of configurability should be added around this new behavior.
We (@ Asimovo B.V., www.asimovo.com) have a production setup with K3s-client-nodes for end-users, running as a sort of "roadwarrior VPN" clients to our cloud solution. With this fix applied, we can support multiple clients from the same private network. (e.g. several employees of the same company) We are running this setup without encountering any issues regarding old/stale routes, etc.
Discussion
For background, here is a description of the current identification of nodes in the related components:
No layer here considers the remote public IP address as a unique key, and therefor this IP address doesn't need to be unique. Wireguard is brilliant in solving this specific NAT-traversal challenge, especially in figuring out the changed/reassigned remote UDP port. So, there is no need for Flannel to remove Wireguard endpoints if multiple publicKeys are found to be behind the same IP address. In doing so, Flannel breaks an otherwise fully working feature.
Todos
As mentioned, we might want to place the new behavior behind a feature gate in the configuration. Unless you agree this is just a bugfix, and this feature should have been enabled anyway.
Depending on this choice between fix and feature, the documentation impact is either:
Release Note