-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polling of prometheus metrics randomly causes SIGSEGV #1755
Comments
Added valgrind to fluent-bit container. Found this output related to cb_metrics_prometheus:
|
@maglun2 I experimented with
|
@maglun2 I also took a deep dive into the metrics module, I found a potential memory leak by inspection, but no obvious race conditions. It seems to me that metrics is using both TLS (thread local storage) |
@fujimotos yes looks like it's always the same stacktrace. Here is our config:
|
@nigels-com I appreciate you digging into the problem though I hope the question isn't directed at me 😄 |
@maglun2. Sometimes the process of putting things into words can help narrow or pinpoint a problem. But not this time. |
Hm. I'm not entirely sure why this is happening.
The stacktrace suggests that metrics_list somehow contains an invalid The only possible route I can think of is I posted #1762 that attempts to fix the issue based on this hypothesis. @maglun2 Can you check out the version and see if it solves your problem? |
Thanks @fujimotos will try it now |
@fujimotos initial tests looks very promising. No sigsegv since I deployed your version. Will deploy to more test clusters and let it run during the day. |
I realize now that your branch is based off master so there might be other changes affecting the behaviour I guess? |
For metrics.c, we have no particular diff between v1.3.2 and master.
So if the patch fixes the issue, that's it. |
Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem. Thanks you so much for looking into and fixing the problem @fujimotos! Should I make a comment in the PR? |
Nice! I appreciate if you make a comment in the PR for reference. Now we can just wait for @edsiper to merge that patch into the main line. |
Fix merged into v1.3 branch also for the next v1.3.4 release (this week). |
Bug Report
Describe the bug
Fluent-bit randomly crashes with SIGSEGV and from the stacktrace it looks like it's the polling of prometheus metrics that causes it:
Additional context
Fluent-bit version: 1.3.2
Running in kubernetes
The text was updated successfully, but these errors were encountered: