Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CIS scan on k3s clusters running for too long before it gets completed. #39839

Closed
rishabhmsra opened this issue Dec 9, 2022 · 22 comments
Closed
Assignees
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/infracloud
Milestone

Comments

@rishabhmsra
Copy link
Contributor

Rancher Server Setup

  • Rancher version: v2.6-head(8048eee)
  • Installation option (Docker install/Helm Chart): Docker
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):

Information about the Cluster

  • Kubernetes version: v1.24.8+k3s1
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): Custom

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) Admin
    • If custom, define the set of permissions:

Describe the bug

  • CIS k3s-cis-1.23-profile scan gets stuck into running state for about 25 mins before it gets completed.
  • security-scan-runner-scan-* pod logs:
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""
level=warning msg="no pods found with label \"sonobuoy-component=aggregator\" in namespace cis-operator-system"
level=warning msg="retrying with deprecated label \"run=sonobuoy-master\""

To Reproduce

  • Provision a k3s v1.24.8+k3s1 cluster(1-cp, 1-etcd, 1-w).
  • Install CIS chart version 2.1.1-rc1
  • Run k3s-cis-1.23-profile scans.

Result

  • Scan gets stuck into running state for around 25 mins before it gets completed.
    k3s-long
@rishabhmsra rishabhmsra added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/area3 labels Dec 9, 2022
@rishabhmsra rishabhmsra added this to the 2023-Q1-v2.6x milestone Dec 9, 2022
@rishabhmsra rishabhmsra self-assigned this Dec 9, 2022
@vardhaman22 vardhaman22 self-assigned this Dec 13, 2022
@vardhaman22
Copy link
Contributor

vardhaman22 commented Dec 13, 2022

This issue is not present on kubernetes v1.24.7+k3s1 version.
On current version: v1.24.8+k3s1 older versions of cis chart(tested on 2.1.0 and 2.0.4) have this issue as well. so this issue seems to be related to k3s. This also occurs on kubernetes v1.23.14+k3s1.

Issue is occuring due to following error in sonobuoy pods on some of the nodes:

time="2022-12-13T12:32:19Z" level=trace msg="Invoked command single-node with args [] and flags [level=trace logtostderr=true sleep=-1 v=6]"
time="2022-12-13T12:32:19Z" level=info msg="Waiting for waitfile" waitfile=/tmp/sonobuoy/done
time="2022-12-13T12:32:19Z" level=info msg="Starting to listen on port 8099 for progress updates and will relay them to https://service-rancher-cis-benchmark/api/v1/progress/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench"
time="2022-12-13T12:49:22Z" level=trace msg="Detected done file but sleeping for 5s then checking again for file. This allows other containers to intervene if desired."
time="2022-12-13T12:49:27Z" level=info msg="Detected done file, transmitting result file" resultFile=/tmp/sonobuoy/kb.tar.gz
time="2022-12-13T12:50:17Z" level=error msg="error entry for attempt: 1, verb: PUT, time: 2022-12-13 12:50:17.121624279 +0000 UTC m=+1078.028491965, URL: https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench: Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout" trace="Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout\nerror entry for attempt: 1, verb: PUT, time: 2022-12-13 12:50:17.121624279 +0000 UTC m=+1078.028491965, URL: https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\ngit.luolix.top/vmware-tanzu/sonobuoy/pkg/worker.DoRequest.func1\n\t/go/src/github.com/vmware-tanzu/sonobuoy/pkg/worker/request.go:44\ngit.luolix.top/sethgrid/pester.(*Client).log\n\t/go/pkg/mod/github.com/sethgrid/pester@v0.0.0-20190127155807-68a33a018ad0/pester.go:398\ngit.luolix.top/sethgrid/pester.(*Client).pester.func2\n\t/go/pkg/mod/github.com/sethgrid/pester@v0.0.0-20190127155807-68a33a018ad0/pester.go:289\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1571"

on one of the nodes sonobuoy pod results in success and only this node's result appears to be there in the generated scan report.
For other nodes there are no records in the report.
cc: @prachidamle

@prachidamle
Copy link
Member

Can you compare what version of sonobuoy image is getting used by the k3s profiles v1.24.7+k3s1 version Vs v1.24.8+k3s1 ?

@brandond
Copy link
Member

brandond commented Dec 15, 2022

Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout

Looks like a timeout accessing the DNS service. Assuming there are no problems with the corends pods, this usually indicates an error with the kernel dropping CNI traffic between nodes, see rancher/rke2#1541 (comment). What sort of infra is this on? EC2, vsphere, etc. And what Linux distribution?

Could you please try disabling the offloading in all nodes? Execute this command on all the nodes before running the scan: sudo ethtool -K flannel.1 tx-checksum-ip-generic off

@vardhaman22
Copy link
Contributor

Can you compare what version of sonobuoy image is getting used by the k3s profiles v1.24.7+k3s1 version Vs v1.24.8+k3s1 ?

the sonobuoy version is same v0.56.7. the released chart version 2.1.0 is working fine on 1.24.7 but not on even 1.24.8. i tested older versions as well 2.0.3 and 2.0.4 so i think it is an issue with k8s version only.

@vardhaman22
Copy link
Contributor

vardhaman22 commented Dec 15, 2022

Put \"https://service-rancher-cis-benchmark/api/v1/results/by-node/vardhaman-pool1-826cdb4a-jkp97/rancher-kube-bench\": dial tcp: lookup service-rancher-cis-benchmark on 10.43.0.10:53: read udp 172.31.9.213:34582->10.43.0.10:53: i/o timeout

Looks like a timeout accessing the DNS service. Assuming there are no problems with the corends pods, this usually indicates an error with the kernel dropping CNI traffic between nodes, see rancher/rke2#1541 (comment). What sort of infra is this on? EC2, vsphere, etc. And what Linux distribution?

Could you please try disabling the offloading in all nodes? Execute this command on all the nodes before running the scan: sudo ethtool -K flannel.1 tx-checksum-ip-generic off

created an Amazon EC2 RKE1 cluster with ubuntu image which is used by default by rancher. I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI. So ideally i guess user should be able to run the scans after creating the cluster from rancher using 1.24.8 also.

@vardhaman22
Copy link
Contributor

vardhaman22 commented Dec 22, 2022

tested this with kubernetes December patch version v1.24.9+k3s1. The chart is working fine on that version. So it seems there was some issue with November patches.
@prachidamle @brandond what should be done next for this ?

@brandond
Copy link
Member

brandond commented Dec 22, 2022

I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI.

I don't see the results from testing running the ethtool commands on the node before executing the scan. The results of this test will suggest where we look next. Please do that and report back.

@vardhaman22
Copy link
Contributor

vardhaman22 commented Dec 23, 2022

I will try running the suggested command but the thing is with 1.24.7 additional steps are not required after creating the cluster from UI.

I don't see the results from testing running the ethtool commands on the node before executing the scan. The results of this test will suggest where we look next. Please do that and report back.

Tried the ethtool command on all the nodes before running the scans. This command resolves the issue (the scan gets completed within a minute).
sample output of the command on the node.

ubuntu@vardhaman-cis-test-pool1-b513cb13-kqb88:~$ sudo ethtool -K flannel.1 tx-checksum-ip-generic off
Actual changes:
tx-checksumming: off
	tx-checksum-ip-generic: off
tcp-segmentation-offload: off
	tx-tcp-segmentation: off [requested on]
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: off [requested on]
	tx-tcp6-segmentation: off [requested on]
ubuntu@vardhaman-cis-test-pool1-b513cb13-kqb88:~$ 

@brandond

@brandond
Copy link
Member

brandond commented Dec 23, 2022

OK. So that confirms that there is a kernel bug on these hosts that is corrupting the ip checksum on vxlan packets when offload is enabled. See flannel-io/flannel#1279 for context. If you cannot update to a kernel that does not contain this bug, you should probably apply this ethtool change to the nodes as part of the provisioning process.

@brandond
Copy link
Member

@manuelbuil @thomasferrandiz did Flannel have a workaround for this at some point that got reverted? I'm wondering why this is showing up more frequently again.

@brandond
Copy link
Member

brandond commented Jan 5, 2023

I'm not sure why I was assigned this - is someone expecting me to fix something? This is an OS issue.

@mitulshah-suse
Copy link

Should this issue be documented? Does not look like we can fix it from the chart or K3s.
The only option is to use the workaround sudo ethtool -K flannel.1 tx-checksum-ip-generic off on all the nodes before running the scan.
@brandond @prachidamle

@zube zube bot added the release-note Note this issue in the milestone's release notes label Jan 10, 2023
@prachidamle
Copy link
Member

Since it is an OS issue, should be ok to document the problem and the workaround found. Does that sound ok @brandond

@brandond
Copy link
Member

brandond commented Jan 10, 2023

sure but I'm not sure which docs it should go in. The issue is with the underlying kernel, and there are issues discussing it across multiple repos (kubernetes, rke2, flannel, ubuntu, RHEL...)

@zube zube bot removed the team/area3 label Jan 19, 2023
@niusmallnan
Copy link
Contributor

Based on this comment: flannel-io/flannel#1679 (comment)
I think K3s with Flannel 0.20.2 embedded has fixed this issue, like v1.24.9+k3s2.

@ronhorton ronhorton modified the milestones: v2.6.11, 2023-Q2-v2.6.x Mar 8, 2023
@ronhorton
Copy link

2.6.11 released
this ticket needs to be documented (determine where)
moving to next milestone for tracking.

@prachidamle
Copy link
Member

@mitulshah-suse Can we retest this considering the upstream might have closed the issue.

@vardhaman22
Copy link
Contributor

vardhaman22 commented Mar 10, 2023

@prachidamle tested this on 2.6-head,

  1. Custom cluster with k8s 1.24.8+k3s1 (EC2 instances), the issue is present.
  2. Custom cluster with k8s 1.24.9+k3s2 (EC2 instances), the issue is not present.

@prachidamle
Copy link
Member

@cwayne18 @brandond As noted above, team has revalidated after upstream fix is available and the issue is not found in k8s 1.24.9+k3s2. But it is still present for 1.24.8+k3s1 - just confirming if there is any fix available for 1.24.8+k3s1 version from k3s or should this stay in release-note for this version?

@brandond
Copy link
Member

@prachidamle the fix was made available in the new version. The older release will always be affected by that issue.

@prachidamle
Copy link
Member

prachidamle commented Mar 28, 2023

Thanks for confirming @brandond

I am moving this to-test for the Q2 2.6 milestone to verfiy with latest versions.

@prachidamle prachidamle modified the milestones: 2023-Q2-v2.6.x, 2.6.12 Mar 28, 2023
@prachidamle prachidamle removed the release-note Note this issue in the milestone's release notes label Mar 28, 2023
@vivek-shilimkar
Copy link
Contributor

Validated this issue on rancher v2.6.11.

  1. Provisioned a k3s v1.24.10+k3s1 cluster(1-cp, 1-etcd, 1-w).
  2. Installed CIS chart version 2.1.1.
  3. Ran k3s-cis-1.23-profile scans.

Scan passed successfully.
Screenshot from 2023-03-29 14-45-13

@MKlimuszka MKlimuszka modified the milestones: 2.6.12, 2023-Q2-v2.6.x Apr 17, 2023
@zube zube bot removed the [zube]: Done label Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/infracloud
Projects
None yet
Development

No branches or pull requests

9 participants