-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck "Terminating" on Windows nodes #106
Comments
If you manually remove the json.log file, container can be removed and pod can then be ternimated. |
It is seen in GKE. |
We took another look. It seems that the failure was caused by our logging agent handling symlinked log file and symlink commands result in access denied error in Windows containers. So I think the issue I reported here is a duplicate of #97. |
More logs for this issue. 2021-04-15T23:06:15ZHandler for DELETE /v1.40/containers/[CONTAINER-ID] returned error: unable to remove filesystem for [CONTAINER-ID]: CreateFile C:\ProgramData\docker\containers[CONTAINER-ID][CONTAINER-ID]-json.log: Access is denied. 2021-04-15T23:06:15.472679ZSyncLoop (PLEG): "[POD-NAME]_NAME", event: &pleg.PodLifecycleEvent{ID:"[ID]", Type:"ContainerDied", Data:"[DATA]"} 2021-04-15T23:06:15.472679Z[fake cpumanager] RemoveContainer (container id: [CONTAINER-ID]) 2021-04-15T23:06:16.024224Zdetected rotation of /var/log/containers/[POD-NAME][NAMESPACE-NAME][CONTAINER-NAME]-[CONTAINER-ID].log; waiting 5 seconds |
Any update on this issue? |
Below is the docker version used: |
What OS version is your host? Can run |
it was repro on 10.0.17763.1577 but the issue is gone now after upgrading to 10.0.17763.1757. Yeah, we can close the bug now. |
Sad to reopen this issue but we see this issue happens again in our customer's cluster. And this time the cluster has both kubelet fix and OS fix. Kernel version: 10.0.17763.1757 |
Is this reproduced in later versions of kubernetes? |
I saw a related issue in moby: moby/moby#39274 The error seems to also cause our logging agent fail to read the container logs. |
Yes this also repro in 1.19. See the versions for the latest repro in our internal test: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: windows-log-test
labels:
app: windows-log-test
spec:
selector:
matchLabels:
app: windows-log-test
replicas: 2
template:
metadata:
labels:
app: windows-log-test
spec:
containers:
- name: winloggertest
image: gcr.io/gke-release/pause-win:1.6.1
command:
- powershell.exe
- -command
- Write-Host('log some message...')
nodeSelector:
kubernetes.io/os: windows
tolerations:
- key: "node.kubernetes.io/os"
operator: "Equal"
value: "windows"
effect: "NoSchedule" In the kubelet logs you'll see:
After a few moments you'll see this:
|
This repros in 1.20.5 with Docker 19.3.14. @jsturtevant |
I've been able to reproduce with @jeremyje's deployment spec in azure with 1.19 and 19.3.14
|
Can reproduce with AKS 1.20.2 / Node Image Version "AKSWindows-2019-17763.1817.210310" Thanks in Advance! |
@kevpar do you have any insights here? |
It sounds like this issue is about getting The only thing that comes to mind is that if we are trying to open a symlink pointing to a directory, and the symlink was created without the |
The file is C:\ProgramData\docker\containers[ID][ID]-json.log So it seems to be some log file created by docker daemon. |
@kevpar can you provide more details about the workaround? Like the steps to test the workaround. |
I don't know if there is a viable workaround here. I was asking about symlinks because of the earlier comment that this issue seemed like a duplicate of #97. Reading through this again, though, it seems like the issue is internal to Docker: It is failing to delete the container log file at
Assuming my understanding above is correct, how is this symlink in It's not clear what could cause Docker to get an access denied error when deleting the container log file. If it's due to something else holding the file open without sharing, I would expect the error to be |
Maybe the previous fix doesn't fully make the issue described in this comment go away? |
Is this related? moby/moby@2502db6 Will that fix the issue? That fix is only in Docker 20.10 |
I will check if this issue repros for docker 20.10 |
Confirmed the issue still repro for v.1.20.7 with Container Runtime Version: docker://20.10.4 |
If the issue still repro, as far as I can understand, docker just calls This looks like either a golang implementation issue or windows underlying issue (bad error reporting?). In our case, the container directory actually is already gone (at least not visible) on the node, when we get this error. |
Confirmed the issue repros for v1.20.6-gke.1000 but doesn't repro for 1.21.1-gke.1800 and 1.21.1-gke.2200. The main difference between those two versions are the docker runtime. v1.20.6-gke.1000 has docker://19.3.14. 1.21.1-gke.1800 and 1.21.1-gke.2200 has docker://20.10.4. The issue also doesn't repro for containerd. Lantao's guess was right. My previous comment about reproing for v1.20.7 was wrong. |
Close this as the issue is fixed in moby/moby@2502db6. Anyone who experience this shall upgrade to use docker://20.10.4 or switch to use containerd. |
The issue comes back. I think last time I repro I didn't start and restart pod frequent enough. I will try to switch to 20.10.6. |
We have a MS support case opened for this issue and plan to have this case transferred to Mirantis support for Mirantis team to take a look. |
If we disabled logging agent(fluentd or fluentbit) on the node, the "Access is denied" error will go away. So logging agent might be the one holding the container log file handle. |
Both fluentd and fluentbit have some fix for this issue. fluentbit fix is now in 1.5+ version. Some edge case may still trigger the access is denied error but it is much less often with the fix. A restart of fluentbit fixes the pods stuck in terminating. fluentd fix has just been merged and will be out in its next release. |
I tested fluentd v1.13.3 which is the latest fluentd release that just came out and has the fix we need. I found things have much improved. Before the fix, "RemoveContainer" stuck because of the "access is denied" error and after, "RemoveContainer" eventually succeeded with several retries and pod can proceed to terminate. |
Close this issue as the fix is now in the latest release of fluentd and fluentbit. |
We have the same issue with fluentbit 1.8.10 and EKS:
|
@Norfolc - do you have a repro? I have an edge case repro for the issue logged at fluent/fluent-bit#3892 |
@lizhuqi I have a repro:
|
We're seeing pods that fail to terminate and have to manually run "kubectl delete --force" to force terminate the pods.
See the error message:
RemoveContainer "[ID]" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "[ID]": Error response from daemon: unable to remove filesystem for [ID]: CreateFile C:\ProgramData\docker\containers[ID][ID]-json.log.4: Access is denied.
Similar issue was reported in AKS #1765 and fixed by upstream PR which is available for 1.17.12+. However the issues still repros in the 1.17.15.
The text was updated successfully, but these errors were encountered: