Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluent Bit keeps Windows k8s pods from shutting down #2027

Closed
fujimotos opened this issue Mar 18, 2020 · 17 comments
Closed

Fluent Bit keeps Windows k8s pods from shutting down #2027

fujimotos opened this issue Mar 18, 2020 · 17 comments
Assignees

Comments

@fujimotos
Copy link
Member

Bug Report

Describe the bug

When a windows pod gets deleted, the pod stays in Terminating State for ever until we delete the fluent bit pod running on the same node.

To Reproduce

[SERVICE] 
        Flush        5s
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Plugins_File plugins.conf
        HTTP_Server  On
        HTTP_Listen  0.0.0.0
        HTTP_Port    2020
        
[INPUT] 
        Name             tail
        Path             C:\k\kubelet.err.log
        Parser           raw
        Path_Key         kubelet_err_log
        DB               C:\tail2.db
        Tag              host.* 
        Refresh_Interval 60
        Rotate_Wait      5
        Skip_Long_Lines  On
        DB.Sync          Normal

[FILTER]
        Name modify
        Match host.*
        Add _HOSTNAME ${NODE_NAME}

[INPUT]
        Name             tail
        Path             C:\var\log\containers\*
        Parser           docker
        DB               C:\tail.db
        Tag              containers.*
        Refresh_Interval 60
        Rotate_Wait      5
        Skip_Long_Lines  On 
        DB.Sync          Normal

[FILTER]
        Name             kubernetes
        Match            containers.*
        tls.verify       On
        Kube_URL         https://kubernetes.default.svc.cluster.local:443
        Kube_CA_File     /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File  /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix  kube.var.log.containers.
        Merge_Log        On
        Merge_Log_Key    log_processed

[OUTPUT]
        Name forward
        Match *
        Host receiver.svc.cluster.local
        Port 24321
        tls off
        tls.verify on

Expected behavior

No locking

Additional context

This issue was originally reported by @djsly and @titilambert in #1159.

@vishiy
Copy link

vishiy commented Mar 18, 2020

See the same issue from 1.4 dev branch.

@bragi92
Copy link

bragi92 commented Mar 18, 2020

Yes, I see the issue with the following fluent-bit version.

#1888 (comment)

@djsly
Copy link

djsly commented Mar 24, 2020

do we know if this is a kubelet issue or a fluentbit issue at this point ?

@vishiy / @bragi92 are any of you working with the MSFT AKS Windows team to identify if this could be an AKS/Windows Pod (Kubelet) specific ?

@fujimotos any change you were able to reproduce with the provided configuration ?

@vishiy
Copy link

vishiy commented Mar 24, 2020

@mikkelhegn

@mikkelhegn
Copy link

Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next.

@vishiy
Copy link

vishiy commented Mar 25, 2020

Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next.

@mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events.

E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem
on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.

@djsly
Copy link

djsly commented Mar 25, 2020 via email

@vishiy
Copy link

vishiy commented Mar 25, 2020

If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.

On Mar 24, 2020, at 8:37 PM, Vishwanath @.***> wrote:  Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next. @mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events. E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

@djsly - file exists, but access is denied to do anything on it (including read).
image

@djsly
Copy link

djsly commented Mar 25, 2020

ok, thanks I will try to see if this is always this use case that I get. From the looks of it, it seems that Fluentbit is keeping a lock, since as soon as we delete fluent bit, the files disappear and kubelet continue with the deletion process (pod disappear)

I'm guessing we have an issue with the way fluent bit opens the file ?

@vishiy
Copy link

vishiy commented Mar 25, 2020

@djsly Yeah. I see the below in the open code - I see the dwFlagsAndAttributes being 0 weird. I would expect FILE_ATTRIBUTE_NORMAL.
h = CreateFileA(path,
GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
NULL, /* lpSecurityAttributes /
OPEN_EXISTING, /
dwCreationDisposition /
0, /
dwFlagsAndAttributes /
NULL); /
hTemplateFile */

@djsly
Copy link

djsly commented Mar 25, 2020

from the error, I think it might be in Docker the issue ...

This is the Kubelet part, it just relay the RemoveContainer call to the lower level runtimeEngine (in this case Docker)

E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d":

from here: https://github.com/kubernetes/kubernetes/blob/393bb44ea95e4690775bbf7ea73fc695f6ce5e6c/pkg/kubelet/dockershim/docker_container.go#L306-L309

Docker as this issue

Error response from daemon: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: 

the client code returns the daemon error: Error response from daemon
based on the following : unable to remove filesystem for
https://github.com/moby/moby/blob/8921b59d5df10b33f5fd037b8af99aed08e6736a/daemon/delete.go#L130-L133

We see that EnsureRemoveAll is called.

The remaining error seems to be coming from GO

CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.

I cannot find too many CreateFile calls to the system from the os go library, except around SymLink calls. I'm not sure if this is a hint back to the fact that the container logs are indeed symlinks based on our INPUTS

C:\var\log\containers\*

I will try to find time to test with

C:\var\log\pods\*

@fujimotos
Copy link
Member Author

@djsly @titilambert Hmm. I spent several hours this evening trying to
reproduce this issue, but I couldn't.

In particular, I asked kubernets to create two pods on the same
Windows node:

$ kubectl get pod -o wide
NAME                          READY   STATUS    RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES
fluent-bit-84ff694f4f-c7h6x   1/1     Running   0          4m36s   10.240.0.17   akswin32000000   <none>           <none>
dotnet-app-746c4444fd-4mcrr   1/1     Running   0          3m50s   10.240.0.10   akswin32000000   <none>           <none>

And when I deleted the .NET app, it just worked:

$ kubectl delete pod dotnet-app-746c4444fd-4mcrr
pod "dotnet-app-746c4444fd-4mcrr" deleted
$ kubectl get pod
NAME                                         READY   STATUS    RESTARTS   AGE
fluent-bit-84ff694f4f-c7h6x   1/1     Running   0          11m
dotnet-app-746c4444fd-8qg8l   1/1     Running   0          52s

I used Fluent Bit v1.5.0 (master head) and k8s 1.16.7 for the testing.

I think something is missing from my set up; I will recheck tomorrow.
If you have some additional info on your environment (e.g. k8s yaml),
I'd appreciate if you could share.

@fujimotos
Copy link
Member Author

On my second try, I successfully regenerate the issue using v1.4.3.

If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.

So the root issue was the handle managament. Fluent Bit was keeping open
handlers in /var/log/containers, and that kept NTFS from deleting files.
Since a pod waits for shutdown until the deletion, it caused a dead lock.

A proper fix would be teaching Fluent Bit about DeletePending flag
so that it can swiftly close handlers on to-be-deleted files...

... and then I realized that I already implemented that feature in #2133.
And indeed, I could confirm that the dead lock issue has gone since Fluent
Bit v1.4.4.

# Testing with Fluent Bit v1.4.4
$ kubectl delete pod dotnet-app-64db59b858-86qz9
pod "dotnet-app-64db59b858-86qz9" deleted
$

So in short, please use Fuent Bit v1.4.4 or later, and kubectl delete
should just work.

@djsly @titilambert @vishiy I close this ticket now. Please feel free
to reopen this ticket if you experience this issue on newer versons.

@lizhuqi
Copy link

lizhuqi commented Jul 30, 2021

Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7

Will the fix in fluentd#3340 also apply to fluent-bit next release?

@fujimotos
Copy link
Member Author

Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7

@lizhuqi Fluent Bit already had that issue fixed in v1.5.

If you can reproduce the same issue in Flunet Bit v1.7.7, then it must be another bug.
Post a new bug report with your configuration and log files.

@djsly
Copy link

djsly commented Jul 30, 2021 via email

@lizhuqi
Copy link

lizhuqi commented Jul 30, 2021

I opened #3892 to report the edge case which can trigger the error mentioned in #issuecomment-603576125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants