High cpu usage of powershell processes triggered by csi-proxy #193

dazhiw · 2022-01-13T23:51:24Z

What happened:
We are doing some performance tests for vsphere-csi-driver (https://github.com/kubernetes-sigs/vsphere-csi-driver) on k8s cluster with 20 windows worker nodes. Each node has 4 CPUs and 16GB of memory. k8s version is 1.22.3, windows version is Windows Server 2019. the vsphere csi node plugin uses csi-proxy for privileged operations on host devices. The test repeatedly issues CreatePod followed by DeletePod. Each Pod just starts one container with image "mcr.microsoft.com/oss/kubernetes/pause:1.4.1", and mounts one persistent volume.

During the test we observed high cpu usage on the windows worker nodes, and most of the cpu usage came from the powershell processes triggered by csi-proxy. With 1.4 Pod creations per second, the average cpu usage on each node is about 75% of the total cpu capacity, the CPU cost of powershell processes triggered by csi-proxy is about 45% of the total CPU capacity; With 2.4 Pod creations per second, the cpu usage on each node reached almost 100%, and the CPU cost of powershell processes triggered by csi-proxy is about 68%. As a comparison, for vsphere-csi-driver on Linux nodes, with 3 Pod creations per second the cpu usage on each node is constantly below 10%.

What you expected to happen:
Reduce the cpu cost of csi-proxy.

How to reproduce it:
deploy vsphere-csi-driver on k8s cluster with windows nodes, and create Pod with PV mount, then delete Pod.

Anything else we need to know?:

Environment:

CSI Driver version: vsphere CSI 2.5
Kubernetes version (use kubectl version): 1.22.3
OS (e.g. from /etc/os-release): Windows Server 2019
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

divyenpatel · 2022-06-30T18:17:32Z

/reopen

k8s-ci-robot · 2022-06-30T18:17:36Z

@divyenpatel: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

divyenpatel · 2022-06-30T18:38:47Z

@dazhiw

@pradeep-hegde change is partially fixing the issue and not getting much-needed performance improvement.
Can we keep this issue open?

k8s-ci-robot · 2022-06-30T19:10:56Z

@wdazhi2020: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dazhiw · 2022-06-30T19:13:57Z

/reopen

k8s-ci-robot · 2022-06-30T19:14:02Z

@dazhiw: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

msau42 · 2022-07-20T03:01:57Z

@divyenpatel did we compare with intree Windows? It is known that Windows consumes more resources than Linux. The question is do we think there's a regression between intree vs csi Windows?

k8s-triage-robot · 2022-10-18T03:33:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mauriciopoppe · 2022-10-21T21:31:18Z

/remove-lifecycle stale

mauriciopoppe · 2022-11-19T21:21:32Z

@alexander-ding please post the results of your experiments here

alexander-ding · 2022-11-21T04:34:44Z

@alexander-ding please post the results of your experiments here

The experiments I ran were not related to resource usage unfortunately. I was measuring latency and throughput for CSI Proxy. The resource usage related experiments were on Linux nodes.

k8s-triage-robot · 2023-02-19T04:52:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-21T05:10:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-04-20T05:32:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-04-20T05:32:38Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mauriciopoppe · 2024-04-15T16:56:08Z

/remove-lifecycle rotten
/lifecycle frozen

knabben · 2024-04-16T18:06:35Z

Circling this back.. tracked from a lot of calls to Get-Item -Path $Env:mountpath).Target that seems to hang in process explorer, even when all the pods with PVCs are deleted.

I0416 10:23:52.948472    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\83a8a951-2b74-4cee-bafc-b6cb6790ddbd\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.948937    9092 server.go:274] GetVolumeIDFromTargetPath: Request: &{TargetPath:c:\var\lib\kubelet\pods\0d746ae0-6a44-4b78-9a5b-d05fa146fd06\volumes\kubernetes.io~csi\pvc-45bca5d0-2244-43de-983a-90e6dbc3322e\mount}
I0416 10:23:52.953077    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:52.955659    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.322869    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.483584    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.486178    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:53.519391    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:55.167002    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:56.050764    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:57.028808    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:58.569111    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.037196    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.052682    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.142338    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:23:59.402459    9092 server.go:204] VolumeStats: returned: Capacity 5350879232 Used 29339648
I0416 10:23:59.629165    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"
I0416 10:24:00.037973    9092 utils.go:15] Executing command: "C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\powershell.exe -Mta -NoProfile -Command (Get-Item -Path $Env:mountpath).Target"

Dumping the stacktrace it shows a very odd call to the same function, what is very different from the normal behavior.

runtime/debug.Stack()                                                                                                                                                                                                                                                   [117/1942]
        /home/aknabben/.asdf/installs/golang/1.22.1/go/src/runtime/debug/stack.go:24 +0x5e
github.com/kubernetes-csi/csi-proxy/pkg/utils.RunPowershellCmd({0x1756b3d, 0x26}, {0xc0002e3c40, 0x1, 0x1?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/utils/utils.go:17 +0xee
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eb200?, 0xc000406400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:283 +0xa5
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eaf00?, 0xc0000d1c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218c00?, 0xc0001de400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000eac00?, 0xc0000d1400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0003ca300?, 0xc00007d000?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218900?, 0xc000117c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc0000ea900?, 0xc0000d0c00?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.getTarget({0xc000218600?, 0xc000117400?})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:289 +0x205
...
github.com/kubernetes-csi/csi-proxy/pkg/os/volume.VolumeAPI.GetVolumeIDFromTargetPath({}, {0xc0000f3290, 0x85})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/os/volume/api.go:271 +0x25
github.com/kubernetes-csi/csi-proxy/pkg/server/volume.(*Server).GetVolumeIDFromTargetPath(0xc0003b80a0, {0x0?, 0xc000161008?}, 0xc000415360, {0x1, 0x2, 0x0, {0x173b627, 0x2}})
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/server.go:281 +0x133
github.com/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1.(*versionedAPI).GetVolumeIDFromTargetPath(0xc0003b81c0, {0x180b870, 0xc0004f45d0}, 0xc0005c0c40)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/pkg/server/volume/impl/v1/server_generated.go:74 +0xbc
github.com/kubernetes-csi/csi-proxy/client/api/volume/v1._Volume_GetVolumeIDFromTargetPath_Handler({0x170c140, 0xc0003b81c0}, {0x180b870, 0xc0004f45d0}, 0xc00011ee00, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/github.com/kubernetes-csi/csi-proxy/client/api/volume/v1/api.pb.go:1799 +0x1a6
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003f1a40, {0x180b870, 0xc0004f4420}, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0, 0xc0003b34d0, 0x1c09520, 0x0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1343 +0xdd1
google.golang.org/grpc.(*Server).handleStream(0xc0003f1a40, {0x180f1c0, 0xc0003feb60}, 0xc0001417a0)
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:1737 +0xc47
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 122
        /home/aknabben/go/src/kubernetes-csi/csi-proxy/vendor/google.golang.org/grpc/server.go:997 +0x136

andyzhangx · 2024-05-13T04:05:04Z

for VolumeStats, we could use windows.GetDiskFreeSpaceEx api instead, that could avoid such issue, check the refined PR here:
kubernetes-sigs/azuredisk-csi-driver#2287

randomvariable · 2024-10-10T13:42:50Z

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

mauriciopoppe · 2024-10-10T17:45:15Z

Moving from powershell to win32 API would be a nice change at the expense of making the code harder to read e.g.

csi-proxy/pkg/os/disk/api.go

Line 190 in 07be14d

func (DiskAPI) GetDiskPage83ID(disk syscall.Handle) (string, error) {

is harder to read than using a Powershell command, for this case I think there wasn't a powershell command so that's why the team implemented it with the Win32 API directly.

For the current issue, @knabben did a great analysis in #193 (comment) about the current implementation, and also opened a PR fixing it in #336 but we haven't been able to merge it because presubmit tests are broken because of a Windows module that was disabled in Github Actions. I was trying to fix it in #348 but didn't have luck.

For the next steps, the plan of action was to:

Fix the presubmit error (following up on Update the github action workflow to run the integration tests #348). I asked for help from someone that's more familiar with enabling a module in Github Actions
Then we'd be able to submit Failing closed after maximum retry is achieved to avoid inf recursion #336
Make a new 1.x patch release
Possibly backport it to the library branch https://github.com/kubernetes-csi/csi-proxy/tree/v1.x (shouldn't be that hard)

The real answer to this is to stop using PowerShell, and use Win32 APIs where possible, falling back to WMI invocation. I have POC code for some of this that could be ported. Will take a while to complete, but happy to start taking this on?

Changes would be blocked by the presubmit error, we should try to fix the SMB tests first.

randomvariable · 2024-10-10T20:27:21Z

If we're not keen on using Win32 API directly because of readability, then we should strongly consider using the MSFT_Disk & Win32_DiskPartition WMI classes, which is what PowerShell is wrapping, but without the .NET overhead.

laozc · 2024-10-14T04:08:23Z

You may check how it would be like to switch to use WMI classes from this WIP branch.
laozc@b7483ac

This was referenced Feb 2, 2022

Reduce CSI proxy CPU usage pradeep-hegde/csi-proxy#1

Merged

Reduce CSI proxy CPU usage #197

Merged

k8s-ci-robot closed this as completed in #197 Feb 7, 2022

k8s-ci-robot reopened this Jun 30, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 19, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 21, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 20, 2023

mauriciopoppe reopened this Apr 15, 2024

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 15, 2024

This was referenced Apr 16, 2024

Bumping grpc and protobuf modules #335

Closed

Failing closed after maximum retry is achieved to avoid inf recursion #336

Merged

laozc mentioned this issue Oct 22, 2024

Use WMI instead of PowerShell for OS operations #360

Open

k8s-ci-robot closed this as completed in #336 Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High cpu usage of powershell processes triggered by csi-proxy #193

High cpu usage of powershell processes triggered by csi-proxy #193

dazhiw commented Jan 13, 2022

divyenpatel commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

divyenpatel commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

dazhiw commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

msau42 commented Jul 20, 2022

k8s-triage-robot commented Oct 18, 2022

mauriciopoppe commented Oct 21, 2022

mauriciopoppe commented Nov 19, 2022

alexander-ding commented Nov 21, 2022

k8s-triage-robot commented Feb 19, 2023

k8s-triage-robot commented Mar 21, 2023

k8s-triage-robot commented Apr 20, 2023

k8s-ci-robot commented Apr 20, 2023

mauriciopoppe commented Apr 15, 2024

knabben commented Apr 16, 2024

andyzhangx commented May 13, 2024

randomvariable commented Oct 10, 2024

mauriciopoppe commented Oct 10, 2024

randomvariable commented Oct 10, 2024 •

edited

Loading

laozc commented Oct 14, 2024

High cpu usage of powershell processes triggered by csi-proxy #193

High cpu usage of powershell processes triggered by csi-proxy #193

Comments

dazhiw commented Jan 13, 2022

divyenpatel commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

divyenpatel commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

dazhiw commented Jun 30, 2022

k8s-ci-robot commented Jun 30, 2022

msau42 commented Jul 20, 2022

k8s-triage-robot commented Oct 18, 2022

mauriciopoppe commented Oct 21, 2022

mauriciopoppe commented Nov 19, 2022

alexander-ding commented Nov 21, 2022

k8s-triage-robot commented Feb 19, 2023

k8s-triage-robot commented Mar 21, 2023

k8s-triage-robot commented Apr 20, 2023

k8s-ci-robot commented Apr 20, 2023

mauriciopoppe commented Apr 15, 2024

knabben commented Apr 16, 2024

andyzhangx commented May 13, 2024

randomvariable commented Oct 10, 2024

mauriciopoppe commented Oct 10, 2024

randomvariable commented Oct 10, 2024 • edited Loading

laozc commented Oct 14, 2024

randomvariable commented Oct 10, 2024 •

edited

Loading