-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod can't access clusterip service for another pod with endpoint on the same node #1702
Comments
To reproduce, after creating a 3 node kubeadm cluster (1.24.8 in case that matters), with flannel 0.20.2 $ cat bugtest.yaml
This works (node2 pod to podip)
This works (node2 pod to clusterip)
This works (node1 to podip)
But this does not (node1 to clusterip). It just hangs.
|
This could be a bug maybe related to #1703 or it could be centos related I'll try your setup. |
Could you try to do |
Thanks @rbrtbnfgl, With a clean setup before the above pods/service are created.
Then creating the pods/service as per my steps:
And here's the iptables-save after reproducing it:
|
please ignore the extra
In the nat table, I had added that while troubleshooting, as per another incident I saw on the issues page, but realized later that the issue was fixed in 0.20.2. The behavior is the same without that entry as well |
I have a couple of other clusters running older versions of k8s (1.21.x) and flannel (0.12, and 0.17) which don't exhibit this issue. When I use the older versions of flannel (0.12, and 0.17) against this 1.24.8 cluster, the issue remains. So perhaps this isn't flannel related, but maybe kube-proxy? |
It can be related to kube-proxy and centos, because I tested your env with ubuntu and it worked. |
I am hitting this issue for UDP datagrams only when sending traffic from Node1 to pod on Node2 via ClusterIP service. My Environment
I am running Kubespray installation in AWS, which uses flannel:
Traffic description: Node1 --> ClusterIP (for pod on Node2) Node1 is sending traffic to ClusterIP covering a pod on Node2, TCP works fine, UDP datagrams get dropped. I was able to trace the datagram to Node2, it arrives there fine with proper IP addresses and port (vxlan 8472). The first header has a destination IP set to ClusterIP of the service, the inside header has destination IP of the pod (expected scenario). After stripping the first header, the packet/datagram disappers in kernel. I made sure there are no rules dropping this in iptables (couple of people verified this).
I have the setup automated in case more details are needed and also have a tcpdump trace of the packet but I am reluctant to share that without permission. I have tried several suggestions mentioned in similar bugs either mentioned here, on Kubespray page or on AWS page (like disabling source/destination address checking), none of them worked. |
if you test this
Does it fix it? |
Yes, it does! Please, feel free to educate me and/or send the links for the reason behind this, I would like to understand it more in order to avoid it in the future. Or if there is another ticket where this was already discussed, I will look there. Thank you in advance. EDIT: I guess this is the issue where it is discussed more in depth, right? #1279 |
Flannel v0.20.2 should fix this bug. |
One last question - v.0.20.2 will have a "general" fix that will work regardless of the kernel version? |
yes |
My issue, which I was able to consistently reproduce no longer does after I rebuilt the entire cluster. If this happens again, I'll report it again, but for the time being, I think this was just my environment. |
Thanks for reporting. |
@rbrtbnfgl This issue came back unfortunately. After rebooting all nodes in the cluster, was able to see the same problem again. Rebuilt the cluster once more, and things went back to working, rebooted all the nodes, and back to broken state. |
Are you using flannel v.0.20.2 or only the fix with |
I'm on 0.20.2 and had checked the ethtool fix as well in case, but no difference. |
Now that I can reproduce again by rebooting a node after cluster rebuild, I did some iptable-save s before (when it works) and after reboot (when it doesn't work) and found some differences. Better yet, confirmed that after a rebuild, and after any number of reboots, I can get back to working state with a simple which is really odd.
Just with an iptables-save | iptables-restore (again I don't know why this happens, but perhaps a race-condition when flannel first starts up, vs. when it detects and corrects stuff) The new nat table has two additional entries
Specifically:
|
@rbrtbnfgl could you please re-open this issue. I don't see a way to do it myself. Some new developments.
And according to https://github.com/coreos/go-iptables/blob/d2b8608923d15b0800af7d9f4bb6dea90e03b7d5/iptables/iptables.go#L661 it should not support --random-fully and I've confirmed that it doesn't by trying to create a rule with that option. Unfortunately flannel seems to want to use this flag when it's trying to reset/restore rules, which is why I think I get the higher ordered MASQUERADE rules, because I think it's not properly deleting those before adding. I'll post some verbose logs when I have a better understanding of what's going on. I went and changed the code so that it always took the NOT has_random_fully code path, and also changed:
to
and now on a reboot, the pod->service on same node works correctly.
I haven't wrapped my head around all these rules, so there's a high chance I broke something else as a result, but I wanted to provide that detail. I should have more logs and details tomorrow, but if you see something obvious, please let me know. Thank you |
Thanks @oe-hbk for finding this. I'll check it and I'll make a PR to fix this. |
Thanks @rbrtbnfgl , I have a question. For the --random-fully option, the iptables version detection is happening inside the flannel container, and that version in the standard flannel image is While my hosts have which doesn't support --random-fully How does this work? Can the iptables client in the container do what it needs to do on a kernel at the host that might not support it? |
I'll check this too. That's a strange behaviour. |
I think I have an explanation for why things work right after cluster build, but not after a reboot. After cluster install, because the cni-plugin in kube-flannel.yaml enables the portmap plugin, the POSTROUTING chain has
After a reboot however, the CNI portmap plugin created chain is no longer there.
I am not sure why the CNI portmap plugin created rules aren't surviving a reboot. Without them, the FLANNEL-POSTRTG rules are not in the correct order, the mark rule has to happen after the first two listed. |
All the iptables rules are deleted on reboot. I don't think the ordering is related to portmap. Flannel is always checking that every rules are created and when something is missing it should delete all the Flannel related rules and create them on the same order. |
I get that iptables aren't saved (unless you run a save/restore as it used to be the case) but I still don't know how the CNI portmap created ones are supposed to persist, but they do seem to play a role at least in my env. for flannel to work correctly after cluster is initialized, and when they're gone with the reboot, the issue is experienced. Anyway, thank you so much for looking into this. Please let me know if I can help test in any way. |
So it could be related on the two missing rules:
Somehow they should be responsible of the cluster-ip FWD. Could you also provide the flannel pod logs? |
Here are 3 pod logs. After cluster create, before adding the bugtest.yaml
After applying bugtest.yaml (things work at this point because we haven't rebooted yet).. Nothing new gets added to the pod log. Now after a reboot, at this point bug is triggered:
|
The logs on flannel side seem rights. I'm checking if it could be an issue related to portmap. |
Are you using containerd? And which version? |
Yes I am.
|
It could be containerd related containerd/containerd#7843 |
upgraded to 1.6.15 (latest) and rebuilt the cluster, but no luck once again after reboot |
I will also try upgrading k8s to 1.24.10 |
same thing with 1.24.10 |
could you check kubelet logs too? Maybe we can find something there. |
Nothing in the kubelet logs. Should this pod to service (to pod on same node) be masqueraded? I am trying to follow the packet flow and having a hard time figuring out which path it should take. |
If you are using the Cluster-IP it's not masquerading the source IP but only changing the destination IP. There aren't any interfaces that are listening on the service-IP; iptables should change the destination IP to the right IP that are listening on that service. |
Thank you @rbrtbnfgl . I finally found the issue. After comparing so many before and after reboots, iptables, config files, tcpdumps, sysctls. One thing I hadn't checked was loaded modules. Hopefully this helps someone. The docs lay this out, but I missed this. https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic |
One final thing before I close the issue in case it helps someone else in the future. It's not that the br_netfilter module is newly required, it has long been, and I was config managing the cluster node builds through ansible and had used the https://docs.ansible.com/ansible/latest/collections/community/general/modprobe_module.html module to install br_netfilter. In my older clusters with older flannel, I did not run into any networking issues so never bothered to check, but that module does not persist the module load through reboots, which one would expect it to, either by default, or provide an option to. Anyway, with the new k8s version+flannel and with the missing module after reboot, things broke. So if you're using that ansible module to load kernel modules, be careful. |
Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8
* Update magnum from branch 'master' to 2c193622de9a0bd7dfb7498686862a661489e966 - Merge "Fix pods unable to send traffic to ClusterIP" - Fix pods unable to send traffic to ClusterIP Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8
Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8
Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8 (cherry picked from commit ae7a50e)
Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8 (cherry picked from commit ae7a50e)
Flannel with VXLAN suffers from a bug[1] where pods on the same node are unable to send traffic to a service's ClusterIP when the endpoint is on the same node. This is due to improper NATTing of the return traffic. The fix is to load the br_netfilter module as specified in the kubernetes doc.[2] [1] flannel-io/flannel#1702 [2] https://kubernetes.io/docs/setup/production-environment/container-runtimes/#forwarding-ipv4-and-letting-iptables-see-bridged-traffic Change-Id: Ic182bba9d480421c2cb581558ebde8dfb20421c8 (cherry picked from commit ae7a50e)
With latest flannel 0.20.2 using vxlan (with and without DirectRouting enabled) and firewalld disabled on identical nodes with latest CentOS 7.
When node1 is running pod1 and pod2, and there is a clusterip service around pod2, pod1 cannot route traffic to the clusterip, but it can route to the pod2 ip.
If pod2 moves to node2, and pod1 remains on node1, pod1 is now able to route traffic to both pod2 ip and the clusterip.
Expected Behavior
Expect traffic flow from pod to clusterip when clusterip's endpoint hits a pod on the same node as original pod.
Current Behavior
Traffic from pod1 does not properly flow to pod2 when pod2 is an endpoint for a clusterip service and pod1 tries to communicate to this clusterip. This is only broken if pod1 and pod2 are on the same node. Pod1 is able to contact pod2 directly by pod2's IP.
Possible Solution
Steps to Reproduce (for bugs)
Context
Your Environment
The text was updated successfully, but these errors were encountered: