-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows nodes become Not Ready occasionally #1765
Comments
Triage required from @Azure/aks-pm |
@keikhara something you can help with? |
I found a similar issue
|
@zhiweiv I suggest you to file an issue in https://github.com/kubernetes/kubernetes |
For the kubelet panic issue, I will file issue in k/k. But for the error: I guesss it is related to Windows, or hcsshim? |
From the logs, it seems like that RemoveContainer is called multiple times for a same containerid.
From the remarks here, the operating system will return ERROR_ACCESS_DENIED except the first request. I guess CreateFile is called in RemoveContainer inside the docker.
|
@craiglpeters FYI |
The fix has been merged to k8s 1.17 branch(kubernetes/kubernetes#94042), the recent release should be 1.17.12 in this month, do you know when this will be included to AKS? |
Do msft have any update on when this fix will be included in AKS? This is regularly causing wasted node resources that require manual clean up on my AKS. This bug is still present on 1.18.6 |
1.18.9 fixed this, I guess it should arrive soon, it has been 20 days since last release. |
Correct: We plan to release it with Windows 10C patch. |
@AbelHu |
@zhiweiv new Kubernetes versions were not enabled in 20201019 for some issues. We are trying to add 1.17.13, 1.18.10 and 1.19.3 in 20201026. |
New k8s versions are available now. |
What happened:
We have Windows pools to run job pods, a lot of job pods run then complete and being deleted on them. Windows nodes become Not Ready occasionally in these Windows pools, pods in these nodes hang at terminating state.
I can see a lot of errors in kubelet logs. For example
E0803 17:19:44.164137 4368 remote_runtime.go:261] RemoveContainer "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64": Error response from daemon: unable to remove filesystem for 2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64: CreateFile C:\ProgramData\docker\containers\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64\2a57a5795b0b4d8fb673696b5525f36070ad20e5cbc09a95dceda6c90272fb64-json.log: Access is denied.
Full logs:
kubelet.err.log.zip
What you expected to happen:
Windows nodes working well
How to reproduce it (as minimally and precisely as possible):
Run high volume of pods and delete them when they completed in Windows nodes.
Anything else we need to know?:
I guess this maybe caused by concurrent pod run or pod delete.
Environment:
Kubernetes version (use
kubectl version
):1.17.7
Size of cluster (how many worker nodes are in the cluster?):
10 nodes with auto scale
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.)
.net framewoker app to to process background tasks
The text was updated successfully, but these errors were encountered: