Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable tx and rx offloading on VXLAN interfaces #1282

Closed
wants to merge 1 commit into from

Conversation

aojea
Copy link
Contributor

@aojea aojea commented Apr 13, 2020

Description

Seems that some kernel versions have issues with VXLAN checksum
offloading, causing that flannel stop to work on some scenarios
where the traffic is encapsulated, but the checksum is wrong and is
discarded by the receiver.

A known workaround that works is disabling offloading on the flannel
interface:

ethtool --offload flannel.1 rx off tx off

This PR force to disable it always on VXLAN interfaces

Todos

  • Tests
  • Documentation
  • Release note

Release Note

disable rx and tx checksum offloading on VXLAN interfaces

@aojea
Copy link
Contributor Author

aojea commented Apr 13, 2020

Fixes #1282

Seems that some kernel versions have issues with VXLAN checksum
offloading, causing that flannel stop to work on some scenarios
where the traffic is encapsulated, but the checksum is wrong and is
discarded by the receiver.

A known workaround that works is disabling offloading on the flannel
interface:

ethtool --offload flannel.1 rx off tx off

Flannel disables tx and rx offloading on VXLAN interfaces.
@aojea
Copy link
Contributor Author

aojea commented Apr 13, 2020

@aojea aojea changed the title Make VXLAN checksum configurable Disable tx and rx offloading on VXLAN interfaces Apr 13, 2020
@Capitrium
Copy link

@aojea I just tested this on my dev cluster, still running into #1243. Seems like this might be more related to #1279, but I'm not seeing that on my cluster so I can't be sure.

@aojea
Copy link
Contributor Author

aojea commented Apr 14, 2020

@Capitrium seems that other people that has tested the patch has it working,
maybe you need to restart flannel so the interfaces are recreated and the workaround is applied?

@aojea
Copy link
Contributor Author

aojea commented Apr 14, 2020

ping @rajatchopra :-)

@Miouge1
Copy link

Miouge1 commented Apr 14, 2020

@aojea any idea on which kernel versions are affected?

@aojea
Copy link
Contributor Author

aojea commented Apr 14, 2020

no idea, but this issue created an explosion of issues opened against kubernetes and related projects, it was also discussed in the sig-network mailing list https://groups.google.com/d/msg/kubernetes-sig-network/JxkTLd4M8WM/EW8O1E0PAgAJ
The information is scattered in multiple places sorry 😄

@Capitrium
Copy link

@aojea Yeah, looks like it was an issue with configuration or stale pods on my end - rebuilt and redeployed, seems like it's working now.

I was still seeing networking issues with pods on one node after deploying the patch and had to kill the node, but I was doing a fair amount of testing with different kube-proxy/kube-router/flannel versions and probably broke something else in the process. Most existing nodes and all new nodes are working properly. 👍

@hakman
Copy link
Contributor

hakman commented Apr 16, 2020

Very cool @aojea. This create a lot of discussions in Kops project also. Really happy to see that there is a solution :).

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work @aojea
Thanks.

Who is tracking the 'real' fix in the kernel? @dcbw Any idea?
Is there a RedHat bug on it?
This fix, while great, is only patchwork to hide the actual issue.

@rajatchopra
Copy link
Contributor

@aojea I am not sure if we should disable rx/tx checksum always. A config parameter maybe? This is a temporary hack, right?

@aojea
Copy link
Contributor Author

aojea commented Apr 21, 2020

Who is tracking the 'real' fix in the kernel? @dcbw Any idea?

@dcbw posted more info here kubernetes/kubernetes#88986 (comment)
there is a patch upstream in the kernel

@aojea I am not sure if we should disable rx/tx checksum always. A config parameter maybe? This is a temporary hack, right?

@rajatchopra I've considered that, but

   -  UDP Checksum: It SHOULD be transmitted as zero.  When a packet
     is received with a UDP checksum of zero, it MUST be accepted
     for decapsulation.  Optionally, if the encapsulating end point
     includes a non-zero UDP checksum, it MUST be correctly
     calculated across the entire packet including the IP header,
     UDP header, VXLAN header, and encapsulated MAC frame.  When a
     decapsulating end point receives a packet with a non-zero
     checksum, it MAY choose to verify the checksum value.  If it
     chooses to perform such verification, and the verification
     fails, the packet MUST be dropped.  If the decapsulating
     destination chooses not to perform the verification, or
     performs it successfully, the packet MUST be accepted for
     decapsulation.

and despite we make it configurable we have to disable offloading by default or people will keep hitting the bug ... is it worth the effort? I personally don't think nobody is going to enable it afterward and this will protect flannel from future breakages or odd scenarios like the described in the weave issue 🤷

@rikatz
Copy link

rikatz commented Apr 21, 2020

@aojea any idea on which kernel versions are affected?

Kernel 3.10.0 that's used in RHEL7/CentOS 7 with minor variations (but I've checked and this happens in CentOS from the 7.0 to 7.7 and with the latest RH provided kernel)

@hakman
Copy link
Contributor

hakman commented Apr 22, 2020

@rajatchopra any plans for a bugfix release containing this?
The Kops 1.17 release it almost ready and would like to know if we should wait for this or pick it up in the next release. Thanks! :)
kubernetes/kops#8614

@hakman
Copy link
Contributor

hakman commented May 1, 2020

The "proper" fix for this was accepted by netfilter.

The comment says that the problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented.
This makes this fix the best solution for now for anyone wanting to upgrade to k8s 1.17 and use flannel.

@zhangguanzhang
Copy link
Contributor

zhangguanzhang commented May 27, 2020

Oddly enough, if I run a vxlan using a binary flannel file, it will work, whereas if I deploy it using kubeadm, the SVC access will timeout in 63 seconds

@zhangguanzhang
Copy link
Contributor

zhangguanzhang commented May 28, 2020

I did a control group and found the cause of the problem,the os is Centos 7.6

deploy type flannel version flannel is running in pod? will 63 sec delay?
kubeadm v0.11.0 yes yes
kubeadm v0.12.0 yes yes
kubeadm v0.11.0 no yes
kubeadm v0.12.0 no yes
ansible v0.11.0 yes no
ansible v0.12.0 yes no
ansible v0.11.0 no no
ansible v0.12.0 no no

The control group above proved that flannel was not the source of the problem, so I made a second control group:

type(kubeadm or ansible) kube-proxy version kube-proxy is running in pod? will 63 sec delay?
kubeadm v1.17.5 yes yes
kubeadm v1.17.5 no no
kubeadm v1.16.9 yes no
kubeadm v1.16.9 no no
ansible v1.17.5 yes yes
ansible v1.17.5 no no

so,If the kube-proxy is running in pod, It will trigger the kernel bug,I found this by comparing submissions on github
kubernetes/kubernetes@fed5823
so I hack the docker images, there is the Dockerfile

FROM k8s.gcr.io/kube-proxy:v1.17.5
RUN rm -f /usr/sbin/iptables && 
    clean-install iptables

after I set the images use the hack images , it never causing 63 second delays in vxlan mode.

@danwinship Please take a look at why iptables is not installed in the docker image

@aojea
Copy link
Contributor Author

aojea commented May 28, 2020

@zhangguanzhang impressive work

iptables is installed in that image, however, due to another bug, it has to use the latest version and use an script to detect if it should use the nft or legacy backend

# Install latest iptables package from buster-backports
RUN echo deb http://deb.debian.org/debian buster-backports main >> /etc/apt/sources.list; \
    apt-get update; \
    apt-get -t buster-backports -y --no-install-recommends install iptables

so, do you think that the iptables version is the trigger?

@jhohertz
Copy link

I do. The recent release of kubernetes 1.16.10 is also affecting 1.16 for the first time so I studied the diffs yesterday. The only thing I see is that there is an iptables container build and the version was bumped from 11.x to 12.x, which 1.17 also did.

@zhangguanzhang
Copy link
Contributor

The problem, in terms of docker image modification, is with iptables

@danwinship
Copy link

danwinship commented May 28, 2020

The iptables packaging bump triggered the --random-fully bug because the old (11.x) kube images had a version of iptables that was too old to support --random-fully, while the new (12.x) images, in addition to having the whole legacy-vs-nft wrapper thing, also have a much newer version of iptables that does support --random-fully. Unfortunately it turns out that just because the iptables binary supports it doesn't mean the kernel supports it. There's a PR to fix this. But anyway, this is a completely separate problem from the vxlan offload bug.

[EDIT:
Narrator: It wasn't a completely separate problem.
]

@jhohertz
Copy link

Yet the workaround often suggested seems to be disabling offload, and it does have an effect on the issue

@MarkRose
Copy link

Stable kernels 5.6.13, 5.4.41, 4.19.123, 4.14.181 and later have the checksum patch included.

@zhangguanzhang
Copy link
Contributor

@MarkRose Which jkernels in Centos?

@jhohertz
Copy link

Trying to test a flatcar image w/ one of those kernels, but issues @ quay making for a fun day...

@jhohertz
Copy link

jhohertz commented May 28, 2020

Just going to summarise what I know about this for sure now:

  1. kubernetes updating ip-tables to 12.x breaks it (nothing else networking related in the 1.16.10 diff)
  2. The systemd workaround for disabling vxlan on flannel.1 works ok in superficial testing
  3. The kernel patch, now in the stable trees, does nothing to help matters (latest flatcar has a sufficiently recent kernel as of a couple days ago and was used to test.)

Which says, that while it's related to vxlan offload in some manner (because the workaround seems to work...), it is not that exact kernel bug (because it makes no difference when tested).

Looking to understand this random-fully thing now.

@danwinship
Copy link

@jhohertz and just to confirm, you're running RHEL/CentOS 7 with kernel 3.10.something?

@jhohertz
Copy link

Not here, I have been using coreos/flatcar container linux

@rikatz
Copy link

rikatz commented May 28, 2020

@danwinship just as a follow up from our sig-network meeting, I'll post the tests results from disabling --random-fully scenarios in kube-proxy and flannel, and also the generated rules for each scenario.

This is the same environment I could use to confirm that disabling tx offload solves (CentOS 7 + Flannel 0.11), but in this case I'm enabling tx offload again in everything:

  • Before each test, iptables rules are cleaned in the hosts so flannel and kube-proxy recreate them
  • In NodePort scenarios, I'm calling my own IP:Port as part of the cluster to reach a Pod running outside of it

Scenario 1 - Both kube-proxy and flannel with MASQUERADE rules created with --random-fully

NodePort: Exactly 63s
ClusterIP: Exactly 63s

IPTables rules containing --random-fully:

-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE --random-fully

Scenario 2 - Original kube-proxy with random-fully and flannel recompiled without it
NodePort: Exactly 63s
ClusterIP: Exactly 63s

IPTables rules containing --random-fully:

-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE --random-fully

Scenario 3 - kube-proxy without random-fully and original flannel with random-fully enabled
NodePort: 0.03s
ClusterIP: 0.007s

IPTables rules containing --random-fully:

-A POSTROUTING -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
-A POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully

Scenario 4 - kube-proxy and flannel without random-fully
NodePort: 0.005s
ClusterIP: 0.03s

IPTables rules containing --random-fully: NONE


This way, we can almost for sure say that the insertion of --random-fully in the MASQUERADE rules triggered an already existing Kernel Bug in vxlan part :)

I'll post this also in the original issue at k/k repo, and please let me know if I can help with any further test

@rikatz
Copy link

rikatz commented Jun 15, 2020

Also I've seen some discussion here about iptables-legacy x nft and this is starting to hit CentOS 8 users: kubernetes/kubernetes#91331

I was going to ask what's the effort and gain of putting the iptables-wrapper to work here but as noted by the README it seems it wouldn't solve the CentOS 8 case :/

@aojea
Copy link
Contributor Author

aojea commented Jun 16, 2020

/close

This is no longer needed thanks to @danwinship 👏
kubernetes/kubernetes#92035

@aojea aojea closed this Jun 16, 2020
@aojea aojea deleted the vxlancsum branch June 16, 2020 07:56
@hakman
Copy link
Contributor

hakman commented Jun 16, 2020

@aojea will kubernetes/kubernetes#92035 be cherry-picked to older k8s releases?

@aojea
Copy link
Contributor Author

aojea commented Jun 16, 2020

@aojea will kubernetes/kubernetes#92035 be cherry-picked to older k8s releases?

it should be backported to the supported releases, just ping the author in the PR if he can do it, if he can not the process is described here https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

@hakman
Copy link
Contributor

hakman commented Jun 16, 2020

Thanks @aojea and @danwinship 😄

Fluxay-666

This comment was marked as spam.

@Fluxay-666
Copy link

Thanks, I used the latest version v0.21.5 plus this PR patch to successfully fix the 63-second connection delay problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.