Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update default windows.kubelet value to be compatible with containerd runtime #521

Merged
merged 6 commits into from
Aug 2, 2022

Conversation

luborpetr
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR is addressing malfunctioning liveness probes on Windows Kubernetes Nodes with containerd runtime.

Which issue(s) this PR fixes:

When previous default value of windows.kubelet == C:\var\lib\kubelet was used
and liveness probe was triggered on Windows contained host,
the containerd interprets probe command as follows (notice missing backslash):

/csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\\plugins\\smb.csi.k8s.io\\csi.sock --mode=kubelet-registration-probe

When this command is passed into the node-driver-registrar container the probe will fail.

Example error from kubelet log:

E0729 09:36:05.100541    4448 remote_runtime.go:394] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = failed to exec in container: failed to start exec
\"d85d05ac9a0871ddf8e6a79d40a8ca1a9cd288cdde257aabcc3972534255735c\": hcs::System::CreateProcess 2e431d1020565485c98bf768f800ebc25f30fe3152d59d04a6ad06d6320eab0d: The RPC server is unavailable.:
unknown" containerID="2e431d1020565485c98bf768f800ebc25f30fe3152d59d04a6ad06d6320eab0d" cmd=[/csi-node-driver-registrar.exe
--kubelet-registration-path=C:\var\lib\kubelet\\plugins\\smb.csi.k8s.io\\csi.sock --mode=kubelet-registration-probe]

When I triggered the command manually, to see exact reason of the failure,
I have noticed that the csi-node-driver-registrar.exe is misinterpreting single backslashes

k exec -it csi-smb-node-windows-6jfw7 -n kube-system -c node-driver-registrar -- /csi-node-driver-registrar.exe --kubelet-registration-path=C:\var\lib\kubelet\\plugins\\smb.csi.k8s.io\\csi.sock --mode=kubelet-registration-probe

F0729 11:57:16.262829  235816 main.go:159] Kubelet plugin registration hasn't succeeded yet, file=C:varlibkubelet\plugins\smb.csi.k8s.io\registration doesn't exist. 
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0x1)
        /workspace/vendor/k8s.io/klog/v2/klog.go:1038 +0x8a
k8s.io/klog/v2.(*loggingT).output(0x1681b60, 0x3, 0x0, 0xc000300d20, 0x0, {0x1342787, 0x1}, 0xc000394d90, 0x0)  
        /workspace/vendor/k8s.io/klog/v2/klog.go:987 +0x5fd
k8s.io/klog/v2.(*loggingT).printf(0x1155ee0, 0x4, 0x0, {0x0, 0x0}, {0x117ef3b, 0x48}, {0xc000394d90, 0x1, 0x1}) 
        /workspace/vendor/k8s.io/klog/v2/klog.go:753 +0x1c5
k8s.io/klog/v2.Fatalf(...)
        /workspace/vendor/k8s.io/klog/v2/klog.go:1532
main.main()
        /workspace/cmd/csi-node-driver-registrar/main.go:159 +0x48e

goroutine 34 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x0)
        /workspace/vendor/k8s.io/klog/v2/klog.go:1181 +0x6a
created by k8s.io/klog/v2.init.0
        /workspace/vendor/k8s.io/klog/v2/klog.go:420 +0xfb
command terminated with exit code 255

Fixes #

Fortunately the fix is pretty simple, we need to change windows.kubelet to C:\\var\\lib\\kubelet.
And this is all this PR is about.

Special notes for your reviewer:

The issue has been discovered on Windows GKE v1.21, where it (in combination with other internal containerd issue) lead to cascade failures, bringing down entire nodes.

Release note:

Update default `windows.kubelet` value to be compatible with containerd runtime

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jul 29, 2022
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 29, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 29, 2022
@k8s-ci-robot
Copy link
Contributor

Welcome @luborpetr!

It looks like this is your first PR to kubernetes-csi/csi-driver-smb 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/csi-driver-smb has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @luborpetr. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 29, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 29, 2022
Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 29, 2022
@coveralls
Copy link

coveralls commented Jul 29, 2022

Pull Request Test Coverage Report for Build 2777248508

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 85.131%

Totals Coverage Status
Change from base Build 2704965701: 0.0%
Covered Lines: 813
Relevant Lines: 955

💛 - Coveralls

@andyzhangx
Copy link
Member

/retest

@andyzhangx
Copy link
Member

thanks for the contribution.
@luborpetr what's your containerd runtime on Windows, we have e2e test on containerd Windows containerd://1.5.11+azure-2, the default path(C:\var\lib\kubelet\) works well, I would like to know which containerd version is broken, thanks.

NAME                    STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME
3019k8s000              Ready    agent    30m   v1.23.9   10.240.0.35    <none>        Windows Server 2019 Datacenter   10.0.17763.1879    containerd://1.4.4+unknown
3019k8s001              Ready    agent    30m   v1.23.9   10.240.0.4     <none>        Windows Server 2019 Datacenter   10.0.17763.1879    containerd://1.4.4+unknown
k8s-master-30190497-0   Ready    master   30m   v1.23.9   10.255.255.5   <none>        Ubuntu 18.04.6 LTS               5.4.0-1074-azure   containerd://1.5.11+azure-2

@luborpetr
Copy link
Contributor Author

Hello @andyzhangx, we are currently using containerd 1.6.2, cluster v1.21.14-gke.700 and Windows build 10.0.17763.3046.

NAME                                                  STATUS   ROLES    AGE    VERSION            INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
gke-79ac90-6nny                                       Ready    <none>   3d8h   v1.21.14-gke.700   x.23.48.95    <none>        Windows Server 2019 Datacenter   10.0.17763.3046   containerd://1.6.2-gke.2
gke-79ac90-xj8y                                       Ready    <none>   3d8h   v1.21.14-gke.700   x.23.48.96    <none>        Windows Server 2019 Datacenter   10.0.17763.3046   containerd://1.6.2-gke.2
gke-b4d464-5l2i                                       Ready    <none>   3d8h   v1.21.14-gke.700   x.23.48.49    <none>        Windows Server 2019 Datacenter   10.0.17763.3046   containerd://1.6.2-gke.2

@andyzhangx
Copy link
Member

/test pull-csi-driver-smb-e2e-windows

@andyzhangx
Copy link
Member

/retest

@andyzhangx
Copy link
Member

andyzhangx commented Jul 31, 2022

this config does not work on Windows docker node:

Events:
  Type     Reason          Age                   From               Message
  ----     ------          ----                  ----               -------
  Normal   Scheduled       15m                   default-scheduler  Successfully assigned kube-system/csi-smb-node-win-qgx9x to 3575k8s001
  Normal   Pulling         14m                   kubelet            Pulling image "registry.k8s.io/sig-storage/livenessprobe:v2.7.0"
  Normal   Pulled          14m                   kubelet            Successfully pulled image "registry.k8s.io/sig-storage/livenessprobe:v2.7.0" in 24.9767576s
  Normal   Pulling         14m                   kubelet            Pulling image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.5.1"
  Normal   Pulled          14m                   kubelet            Successfully pulled image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.5.1" in 3.9732219s
  Normal   Pulling         14m                   kubelet            Pulling image "k8sprow.azurecr.io/smb-csi:e2e-88df0e2d4e8eb46cd50893baa59dec9fcbcd5af7"
  Normal   Pulled          13m                   kubelet            Successfully pulled image "k8sprow.azurecr.io/smb-csi:e2e-88df0e2d4e8eb46cd50893baa59dec9fcbcd5af7" in 25.7766282s
  Normal   SandboxChanged  13m                   kubelet            Pod sandbox changed, it will be killed and re-created.
  Warning  Failed          13m (x3 over 14m)     kubelet            Error: Error response from daemon: invalid volume specification: 'C:\\var\\lib\\kubelet\\plugins\\smb.csi.k8s.io\\:C:\csi'
  Normal   Pulled          13m (x2 over 13m)     kubelet            Container image "k8sprow.azurecr.io/smb-csi:e2e-88df0e2d4e8eb46cd50893baa59dec9fcbcd5af7" already present on machine
  Warning  Failed          13m (x3 over 13m)     kubelet            Error: Error response from daemon: invalid volume specification: 'C:\\var\\lib\\kubelet\\:C:\\var\\lib\\kubelet\'
  Normal   Pulled          13m (x3 over 13m)     kubelet            Container image "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.5.1" already present on machine
  Warning  Failed          13m (x4 over 14m)     kubelet            Error: Error response from daemon: invalid volume specification: 

@luborpetr
Copy link
Contributor Author

Hello @andyzhangx ,

I don't know the test case but, why it it trying to mount C:\\var\\lib\\kubelet\\:C:\\var\\lib\\kubelet\ ? That does not look like something my change could introduce...

@andyzhangx
Copy link
Member

Hello @andyzhangx ,

I don't know the test case but, why it it trying to mount C:\\var\\lib\\kubelet\\:C:\\var\\lib\\kubelet\ ? That does not look like something my change could introduce...

@luborpetr the e2e test only runs against changes in charts/latest, now e2e test only fails on windows docker node
https://github.com/kubernetes-csi/csi-driver-smb/pull/521/files#diff-f3812b85232f025db59dd4dd62074c651ec8eeaf0bdf899a2d5d1dccd6489bd4

@andyzhangx
Copy link
Member

andyzhangx commented Jul 31, 2022

btw, I have made a new commit to apply similar change on charts/latest in this PR.

@luborpetr
Copy link
Contributor Author

Hello @andyzhangx,
I did reproduce the issue internally.
Containerd host is accepting new value properly and all works fine, but dockerd host is completely misinterpreting it.

Problem is that dockerd on Windows rejects volumes and mounts definition containing double backslashes in path. This is the reason for invalid volume specification: 'C:\\var\\lib\\kubelet\\plugins kind of errors

Because of that limitation and backward compatibility requirement, I can think of only one suitable solution now.
I would revert the default windows.kubelet value back to C:\var\lib\kubelet and template the problematic definition of DRIVER_REG_SOCK_PATH only.

The solution would look like (added replace function):

          livenessProbe:
            .
            .
            .
          env:
            - name: CSI_ENDPOINT
              value: unix://C:\\csi\\csi.sock
            - name: DRIVER_REG_SOCK_PATH
              value: {{ .Values.windows.kubelet | replace "\\" "\\\\" }}\\plugins\\{{ .Values.driver.name }}\\csi.sock

Tell me what you think @andyzhangx


Btw, I have noticed other stability issue on contained Windows hosts, the node-driver-registrar and liveness-probe are time to time OOM killed. Very likely it would be resolved by increasing current 100Mi limit to 128Mi. I'll open separate PR for the memory limit.

@andyzhangx
Copy link
Member

@luborpetr that lgtm, pls make changes under chart/latest, we have both docker and containerd Windows e2e tests to cover.
about the OOM issue, as I know, k8s won't kill windows pod when it's exceeding requet limits, that's a limitation on k8s Windows.

@luborpetr
Copy link
Contributor Author

You are right, I was not precise with my statement. Windows is not restarting pods but the individual containers got killed.
I'll share details in new PR.

@luborpetr
Copy link
Contributor Author

/retest

@luborpetr
Copy link
Contributor Author

@andyzhangx can you please point me, where is the script, that you used for helm packages creation yesterday?
I was surprised that the pipeline is still failing. Turned out, it's because I did not push the latest packages.

@luborpetr
Copy link
Contributor Author

Never mind, did that manually, looks like the tests are passing now...

Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, luborpetr, vilovgh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 50166e6 into kubernetes-csi:master Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants