-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable TX checksum offload for Flannel VXLAN #9074
Disable TX checksum offload for Flannel VXLAN #9074
Conversation
46b0cb2
to
52e8f14
Compare
const serviceName = "flannel-tx-checksum-offload-disable.service" | ||
|
||
manifest := &systemd.Manifest{} | ||
manifest.Set("Unit", "Description", "Disable TX checksum offload on flannel.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does canal also use the name flannel.1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I tested this with Canal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we run canal
and flannel.1
is the interface name
ubuntu@ip-10-129-0-88:~$ ifconfig | grep flannel
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 8951
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in my setup. By the time flannel-tx-checksum-offload-disable
is started, flannel device is not ready.
admin@ip-10-63-80-157:~$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP group
...
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state
...
This change breaks my cluster. The service flannel-tx-checksum-offload-disable
returns an error that prevents kops-configuration
to finish its job.
admin@ip-10-63-80-157:~$ sudo /sbin/ethtool -K flannel.1 tx-checksum-ip-generic off
Cannot get device feature names: No such device
I'm running on AWS using canal
networking:
canal: {}
I can't find sys-devices-virtual-net-flannel.1.device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this block kops-configuration
from finishing the job? Worst case scenario it stays as the last job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My bad. Looks like flannel pod wasn't started because the node wasn't able to join the cluster. As a result, the device wasn't created and kops-configuration service log was spammed by the messages from flannel-tx-checksum-offload-disable service. I had to manually fix the issue with etcd manager certificate first(my cert was already expired and rolling upgrade to the latest kops 1.17.0 failed).
52e8f14
to
3f86323
Compare
/lgtm |
Fixes #8562 |
@johngmyers do we want to also bump Flannel to 0.12? |
I would like to do the least amount of change needed to get 1.17 out the door. So unless bumping Flannel is necessary for 1.17, let's do that in a separate PR. |
/approve looking forward to seeing how this changes our testgrid dashboard |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hakman, rifelpet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@johngmyers Agreed, flannel stays 0.11 for now and we will consider 0.12 for 1.18. |
…pstream-release-1.17 Automated cherry pick of #9074: Disable TX checksum offload for Flannel VXLAN
Looks like this has now been fixed in Kubernetes (kubernetes/kubernetes#92035), so we may be able to remove this once it has made its way into the relevant releases (k8s v1.16.13, v1.17.9, v1.18.6). |
Agreed. Those releases should be out in about 2 weeks. Probably would be good to remove it now for 1.18 and 1.19. Do we want to remove it from earlier releases also? |
For 1.16 and 1.17 I think that would make sense yeah 👍 |
Tx checksum offloading is buggy for NAT-ed VXLAN endpoints, leading to an invalid checksum sent and causing Flannel to stop to working as the traffic is being discarded by the receiver.
flannel-io/flannel#1279
/cc @johngmyers @justinsb