[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

rishabhmsra · 2022-12-09T12:11:24Z

Rancher Server Setup

Rancher version: v2.6-head(8048eee)
Installation option (Docker install/Helm Chart): Docker
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):

Information about the Cluster

Kubernetes version: v1.24.8+k3s1
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin
- If custom, define the set of permissions:

Describe the bug

CIS k3s-cis-1.23-profile scan gets stuck into running state for about 25 mins before it gets completed.
security-scan-runner-scan-* pod logs:

level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""

To Reproduce

Provision a k3s v1.24.8+k3s1 cluster(1-cp, 1-etcd, 1-w).
Install CIS chart version 2.1.1-rc1
Run k3s-cis-1.23-profile scans.

Result

Scan gets stuck into running state for around 25 mins before it gets completed.

The text was updated successfully, but these errors were encountered:

vardhaman22 · 2022-12-13T13:31:09Z

This issue is not present on kubernetes v1.24.7+k3s1 version.
On current version: v1.24.8+k3s1 older versions of cis chart(tested on 2.1.0 and 2.0.4) have this issue as well. so this issue seems to be related to k3s. This also occurs on kubernetes v1.23.14+k3s1.

Issue is occuring due to following error in sonobuoy pods on some of the nodes:

time="2022-12-13T12:32:19Z" level=trace msg="Invoked command single-node with args [] and flags [level=trace logtostderr=true sleep=-1 v=6]"
time="2022-12-13T12:32:19Z" level=info msg="Waiting for waitfile" waitfile=/tmp/sonobuoy/done
time="2022-12-13T12:32:19Z" level=info msg="Starting to listen on port 8099 for progress updates and will relay them to https://service-rancher-cis-benchmark/api/v1/progress/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench"
time="2022-12-13T12:49:22Z" level=trace msg="Detected done file but sleeping for 5s then checking again for file. This allows other containers to intervene if desired."
time="2022-12-13T12:49:27Z" level=info msg="Detected done file, transmitting result file" resultFile=/tmp/sonobuoy/kb.tar.gz
time="2022-12-13T12:50:17Z" level=error msg="error entry for attempt: 1, verb: PUT, time: 2022-12-13 12:50:17.121624279 +0000 UTC m=+1078.028491965, URL: https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench: Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout" trace="Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout\nerror entry for attempt: 1, verb: PUT, time: 2022-12-13 12:50:17.121624279 +0000 UTC m=+1078.028491965, URL: https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\ngit.luolix.top/vmware-tanzu/sonobuoy/pkg/worker.DoRequest.func1\n\t/go/src/github.com/vmware-tanzu/sonobuoy/pkg/worker/request.go:44\ngit.luolix.top/sethgrid/pester.(*Client).log\n\t/go/pkg/mod/github.com/sethgrid/pester@v0.0.0-20190127155807-68a33a018ad0/pester.go:398\ngit.luolix.top/sethgrid/pester.(*Client).pester.func2\n\t/go/pkg/mod/github.com/sethgrid/pester@v0.0.0-20190127155807-68a33a018ad0/pester.go:289\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"

on one of the nodes sonobuoy pod results in success and only this node's result appears to be there in the generated scan report.
For other nodes there are no records in the report.
cc: @prachidamle

prachidamle · 2022-12-15T03:06:17Z

Can you compare what version of sonobuoy image is getting used by the k3s profiles v1.24.7+k3s1 version Vs v1.24.8+k3s1 ?

brandond · 2022-12-15T03:24:40Z

Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout

Looks like a timeout accessing the DNS service. Assuming there are no problems with the corends pods, this usually indicates an error with the kernel dropping CNI traffic between nodes, see rancher/rke2#1541 (comment). What sort of infra is this on? EC2, vsphere, etc. And what Linux distribution?

Could you please try disabling the offloading in all nodes? Execute this command on all the nodes before running the scan: sudo ethtool -K flannel.1 tx-checksum-ip-generic off

vardhaman22 · 2022-12-15T19:09:31Z

Can you compare what version of sonobuoy image is getting used by the k3s profiles v1.24.7+k3s1 version Vs v1.24.8+k3s1 ?

the sonobuoy version is same v0.56.7. the released chart version 2.1.0 is working fine on 1.24.7 but not on even 1.24.8. i tested older versions as well 2.0.3 and 2.0.4 so i think it is an issue with k8s version only.

vardhaman22 · 2022-12-15T19:13:21Z

Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout

Looks like a timeout accessing the DNS service. Assuming there are no problems with the corends pods, this usually indicates an error with the kernel dropping CNI traffic between nodes, see rancher/rke2#1541 (comment). What sort of infra is this on? EC2, vsphere, etc. And what Linux distribution?

Could you please try disabling the offloading in all nodes? Execute this command on all the nodes before running the scan: sudo ethtool -K flannel.1 tx-checksum-ip-generic off

created an Amazon EC2 RKE1 cluster with ubuntu image which is used by default by rancher. I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI. So ideally i guess user should be able to run the scans after creating the cluster from rancher using 1.24.8 also.

vardhaman22 · 2022-12-22T14:50:52Z

tested this with kubernetes December patch version v1.24.9+k3s1. The chart is working fine on that version. So it seems there was some issue with November patches.
@prachidamle @brandond what should be done next for this ?

brandond · 2022-12-22T18:33:01Z

I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI.

I don't see the results from testing running the ethtool commands on the node before executing the scan. The results of this test will suggest where we look next. Please do that and report back.

vardhaman22 · 2022-12-23T10:57:27Z

I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI.

I don't see the results from testing running the ethtool commands on the node before executing the scan. The results of this test will suggest where we look next. Please do that and report back.

Tried the ethtool command on all the nodes before running the scans. This command resolves the issue (the scan gets completed within a minute).
sample output of the command on the node.

ubuntu@vardhaman-cis-test-pool1-b513cb13-kqb88:~$ sudo ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
	tx-checksum-ip-generic: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]
ubuntu@vardhaman-cis-test-pool1-b513cb13-kqb88:~$

@brandond

brandond · 2022-12-23T20:41:43Z

OK. So that confirms that there is a kernel bug on these hosts that is corrupting the ip checksum on vxlan packets when offload is enabled. See flannel-io/flannel#1279 for context. If you cannot update to a kernel that does not contain this bug, you should probably apply this ethtool change to the nodes as part of the provisioning process.

brandond · 2022-12-23T20:42:28Z

@manuelbuil @thomasferrandiz did Flannel have a workaround for this at some point that got reverted? I'm wondering why this is showing up more frequently again.

brandond · 2023-01-05T09:07:52Z

I'm not sure why I was assigned this - is someone expecting me to fix something? This is an OS issue.

mitulshah-suse · 2023-01-09T06:53:40Z

Should this issue be documented? Does not look like we can fix it from the chart or K3s.
The only option is to use the workaround sudo ethtool -K flannel.1 tx-checksum-ip-generic off on all the nodes before running the scan.
@brandond @prachidamle

prachidamle · 2023-01-10T15:54:24Z

Since it is an OS issue, should be ok to document the problem and the workaround found. Does that sound ok @brandond

brandond · 2023-01-10T23:33:10Z

sure but I'm not sure which docs it should go in. The issue is with the underlying kernel, and there are issues discussing it across multiple repos (kubernetes, rke2, flannel, ubuntu, RHEL...)

niusmallnan · 2023-03-07T03:15:23Z

Based on this comment: flannel-io/flannel#1679 (comment)
I think K3s with Flannel 0.20.2 embedded has fixed this issue, like v1.24.9+k3s2.

ronhorton · 2023-03-08T19:32:18Z

2.6.11 released
this ticket needs to be documented (determine where)
moving to next milestone for tracking.

prachidamle · 2023-03-08T19:44:21Z

@mitulshah-suse Can we retest this considering the upstream might have closed the issue.

vardhaman22 · 2023-03-10T11:32:18Z

@prachidamle tested this on 2.6-head,

Custom cluster with k8s 1.24.8+k3s1 (EC2 instances), the issue is present.
Custom cluster with k8s 1.24.9+k3s2 (EC2 instances), the issue is not present.

prachidamle · 2023-03-27T15:51:59Z

@cwayne18 @brandond As noted above, team has revalidated after upstream fix is available and the issue is not found in k8s 1.24.9+k3s2. But it is still present for 1.24.8+k3s1 - just confirming if there is any fix available for 1.24.8+k3s1 version from k3s or should this stay in release-note for this version?

brandond · 2023-03-28T05:37:31Z

@prachidamle the fix was made available in the new version. The older release will always be affected by that issue.

prachidamle · 2023-03-28T19:06:17Z

Thanks for confirming @brandond

I am moving this to-test for the Q2 2.6 milestone to verfiy with latest versions.

vivek-shilimkar · 2023-03-29T09:36:33Z

Validated this issue on rancher v2.6.11.

Provisioned a k3s v1.24.10+k3s1 cluster(1-cp, 1-etcd, 1-w).
Installed CIS chart version 2.1.1.
Ran k3s-cis-1.23-profile scans.

Scan passed successfully.

rishabhmsra added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/area3 labels Dec 9, 2022

rishabhmsra added this to the 2023-Q1-v2.6x milestone Dec 9, 2022

rishabhmsra self-assigned this Dec 9, 2022

rishabhmsra added the team/infracloud label Dec 9, 2022

vardhaman22 self-assigned this Dec 13, 2022

vardhaman22 added the [zube]: Working label Dec 13, 2022

mitulshah-suse assigned brandond Jan 5, 2023

zube bot added the release-note Note this issue in the milestone's release notes label Jan 10, 2023

zube bot assigned MKlimuszka Jan 10, 2023

zube bot added [zube]: Release Note and removed [zube]: Working labels Jan 10, 2023

zube bot removed the team/area3 label Jan 19, 2023

ronhorton modified the milestones: v2.6.11, 2023-Q2-v2.6.x Mar 8, 2023

prachidamle modified the milestones: 2023-Q2-v2.6.x, 2.6.12 Mar 28, 2023

prachidamle added the [zube]: To Test label Mar 28, 2023

zube bot removed the [zube]: Release Note label Mar 28, 2023

prachidamle removed the release-note Note this issue in the milestone's release notes label Mar 28, 2023

vivek-shilimkar added [zube]: QA Working and removed [zube]: To Test labels Mar 29, 2023

vivek-shilimkar closed this as completed Mar 29, 2023

zube bot added [zube]: Done and removed [zube]: QA Working labels Mar 29, 2023

MKlimuszka modified the milestones: 2.6.12, 2023-Q2-v2.6.x Apr 17, 2023

zube bot removed the [zube]: Done label Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

rishabhmsra commented Dec 9, 2022

vardhaman22 commented Dec 13, 2022 •

edited

Loading

prachidamle commented Dec 15, 2022

brandond commented Dec 15, 2022 •

edited

Loading

vardhaman22 commented Dec 15, 2022

vardhaman22 commented Dec 15, 2022 •

edited

Loading

vardhaman22 commented Dec 22, 2022 •

edited

Loading

brandond commented Dec 22, 2022 •

edited

Loading

vardhaman22 commented Dec 23, 2022 •

edited

Loading

brandond commented Dec 23, 2022 •

edited

Loading

brandond commented Dec 23, 2022

brandond commented Jan 5, 2023

mitulshah-suse commented Jan 9, 2023

prachidamle commented Jan 10, 2023

brandond commented Jan 10, 2023 •

edited

Loading

niusmallnan commented Mar 7, 2023

ronhorton commented Mar 8, 2023

prachidamle commented Mar 8, 2023

vardhaman22 commented Mar 10, 2023 •

edited

Loading

prachidamle commented Mar 27, 2023

brandond commented Mar 28, 2023

prachidamle commented Mar 28, 2023 •

edited

Loading

vivek-shilimkar commented Mar 29, 2023

[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

Comments

rishabhmsra commented Dec 9, 2022

vardhaman22 commented Dec 13, 2022 • edited Loading

prachidamle commented Dec 15, 2022

brandond commented Dec 15, 2022 • edited Loading

vardhaman22 commented Dec 15, 2022

vardhaman22 commented Dec 15, 2022 • edited Loading

vardhaman22 commented Dec 22, 2022 • edited Loading

brandond commented Dec 22, 2022 • edited Loading

vardhaman22 commented Dec 23, 2022 • edited Loading

brandond commented Dec 23, 2022 • edited Loading

brandond commented Dec 23, 2022

brandond commented Jan 5, 2023

mitulshah-suse commented Jan 9, 2023

prachidamle commented Jan 10, 2023

brandond commented Jan 10, 2023 • edited Loading

niusmallnan commented Mar 7, 2023

ronhorton commented Mar 8, 2023

prachidamle commented Mar 8, 2023

vardhaman22 commented Mar 10, 2023 • edited Loading

prachidamle commented Mar 27, 2023

brandond commented Mar 28, 2023

prachidamle commented Mar 28, 2023 • edited Loading

vivek-shilimkar commented Mar 29, 2023

vardhaman22 commented Dec 13, 2022 •

edited

Loading

brandond commented Dec 15, 2022 •

edited

Loading

vardhaman22 commented Dec 15, 2022 •

edited

Loading

vardhaman22 commented Dec 22, 2022 •

edited

Loading

brandond commented Dec 22, 2022 •

edited

Loading

vardhaman22 commented Dec 23, 2022 •

edited

Loading

brandond commented Dec 23, 2022 •

edited

Loading

brandond commented Jan 10, 2023 •

edited

Loading

vardhaman22 commented Mar 10, 2023 •

edited

Loading

prachidamle commented Mar 28, 2023 •

edited

Loading