-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839
Comments
This issue is not present on kubernetes v1.24.7+k3s1 version. Issue is occuring due to following error in sonobuoy pods on some of the nodes:
on one of the nodes sonobuoy pod results in success and only this node's result appears to be there in the generated scan report. |
Can you compare what version of sonobuoy image is getting used by the k3s profiles v1.24.7+k3s1 version Vs v1.24.8+k3s1 ? |
Looks like a timeout accessing the DNS service. Assuming there are no problems with the corends pods, this usually indicates an error with the kernel dropping CNI traffic between nodes, see rancher/rke2#1541 (comment). What sort of infra is this on? EC2, vsphere, etc. And what Linux distribution? Could you please try disabling the offloading in all nodes? Execute this command on all the nodes before running the scan: |
the sonobuoy version is same v0.56.7. the released chart version 2.1.0 is working fine on 1.24.7 but not on even 1.24.8. i tested older versions as well 2.0.3 and 2.0.4 so i think it is an issue with k8s version only. |
created an Amazon EC2 RKE1 cluster with ubuntu image which is used by default by rancher. I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI. So ideally i guess user should be able to run the scans after creating the cluster from rancher using 1.24.8 also. |
tested this with kubernetes December patch version v1.24.9+k3s1. The chart is working fine on that version. So it seems there was some issue with November patches. |
I don't see the results from testing running the |
Tried the
|
OK. So that confirms that there is a kernel bug on these hosts that is corrupting the ip checksum on vxlan packets when offload is enabled. See flannel-io/flannel#1279 for context. If you cannot update to a kernel that does not contain this bug, you should probably apply this |
@manuelbuil @thomasferrandiz did Flannel have a workaround for this at some point that got reverted? I'm wondering why this is showing up more frequently again. |
I'm not sure why I was assigned this - is someone expecting me to fix something? This is an OS issue. |
Should this issue be documented? Does not look like we can fix it from the chart or K3s. |
Since it is an OS issue, should be ok to document the problem and the workaround found. Does that sound ok @brandond |
sure but I'm not sure which docs it should go in. The issue is with the underlying kernel, and there are issues discussing it across multiple repos (kubernetes, rke2, flannel, ubuntu, RHEL...) |
Based on this comment: flannel-io/flannel#1679 (comment) |
2.6.11 released |
@mitulshah-suse Can we retest this considering the upstream might have closed the issue. |
@prachidamle tested this on 2.6-head,
|
@cwayne18 @brandond As noted above, team has revalidated after upstream fix is available and the issue is not found in k8s 1.24.9+k3s2. But it is still present for 1.24.8+k3s1 - just confirming if there is any fix available for 1.24.8+k3s1 version from k3s or should this stay in release-note for this version? |
@prachidamle the fix was made available in the new version. The older release will always be affected by that issue. |
Thanks for confirming @brandond I am moving this to-test for the Q2 2.6 milestone to verfiy with latest versions. |
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
k3s-cis-1.23-profile
scan gets stuck into running state for about 25 mins before it gets completed.security-scan-runner-scan-*
pod logs:To Reproduce
v1.24.8+k3s1
cluster(1-cp, 1-etcd, 1-w).2.1.1-rc1
k3s-cis-1.23-profile
scans.Result
The text was updated successfully, but these errors were encountered: