Polling of prometheus metrics randomly causes SIGSEGV #1755

maglun2 · 2019-11-19T09:53:15Z

Bug Report

Describe the bug
Fluent-bit randomly crashes with SIGSEGV and from the stacktrace it looks like it's the polling of prometheus metrics that causes it:

[engine] caught signal (SIGSEGV)
#0  0x7f719373053e      in  ???() at ???:0
#1  0x55a0f2c6eb29      in  string_cmp() at src/http_server/api/v1/metrics.c:142
#2  0x7f71936d5d20      in  ???() at ???:0
#3  0x7f71936d5a97      in  ???() at ???:0
#4  0x7f71936d5a81      in  ???() at ???:0
#5  0x7f71936d5a97      in  ???() at ???:0
#6  0x7f71936d5a97      in  ???() at ???:0
#7  0x7f71936d5fde      in  ???() at ???:0
#8  0x55a0f2c6f248      in  cb_metrics_prometheus() at src/http_server/api/v1/metrics.c:287
#9  0x55a0f2f50da5      in  thread_cb_init_vars() at lib/monkey/mk_server/mk_http_thread.c:68
#10 0x55a0f2f5bd86      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117
#11 0xffffffffffffffff  in  ???() at ???:0
2019-11-19 09:45:47,071 INFO exited: fluentbit (terminated by SIGABRT (core dumped); not expected)

Additional context
Fluent-bit version: 1.3.2
Running in kubernetes

The text was updated successfully, but these errors were encountered:

maglun2 · 2019-11-20T09:07:32Z

Added valgrind to fluent-bit container. Found this output related to cb_metrics_prometheus:

==15== Conditional jump or move depends on uninitialised value(s)
==15==    at 0x27785D: metrics_get_latest (metrics.c:46)
==15==    by 0x277C51: cb_metrics_prometheus (metrics.c:196)
==15==    by 0x559DA5: thread_cb_init_vars (mk_http_thread.c:68)
==15==    by 0x564D86: co_init (amd64.c:117)
==15==  Uninitialised value was created by a stack allocation
==15==    at 0x563D83: co_swap_function (in /opt/td-agent-bit/bin/td-agent-bit)
==15==
==15== Use of uninitialised value of size 8
==15==    at 0x277881: metrics_get_latest (metrics.c:54)
==15==    by 0x277C51: cb_metrics_prometheus (metrics.c:196)
==15==    by 0x559DA5: thread_cb_init_vars (mk_http_thread.c:68)
==15==    by 0x564D86: co_init (amd64.c:117)
==15==  Uninitialised value was created by a stack allocation
==15==    at 0x563D83: co_swap_function (in /opt/td-agent-bit/bin/td-agent-bit)

fujimotos · 2019-11-20T10:25:08Z

@maglun2 I experimented with /api/v1/metrics/prometheus today and could
not reproduce your issue, nor could I find the root cause by reading the related
part of the code.

Can you post the configuration you're using?
Does the segfault always occur with the exact same stacktrace?

nigels-com · 2019-11-20T10:50:57Z

@maglun2 I also took a deep dive into the metrics module, I found a potential memory leak by inspection, but no obvious race conditions. It seems to me that metrics is using both TLS (thread local storage) metrics_list = pthread_getspecific(hs_metrics_key) and reference counting buf->users++ which seemed a curious on the face of it. My impression is batches of metrics arrive from a queue and the queue is flushed taking care to check the user count is zero. But it's not clear to me how multiple consumers are all getting copies of the data unless there is pub/sub going on on the queuing side. And if so, why also reference counting?

maglun2 · 2019-11-20T11:27:13Z

@fujimotos yes looks like it's always the same stacktrace. Here is our config:

  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020
        Merge_JSON_Log Off

    @INCLUDE input-kubernetes.conf
    @INCLUDE input-tcp.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-forward.conf

  filter-kubernetes.conf: |
    # merge log must be off otherwise messages logged as json will bloat elasticsearch with lots of fields
    [FILTER]
        Name                kubernetes
        Match               kube_file.*
        Kube_Tag_Prefix     kube_file.var.log.containers.
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Merge_Log           Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              On
        Annotations         Off

    # we need merge log on for nginx ingress controller access logs to work
    [FILTER]
        Name                kubernetes
        Match               kube_file.*nginx-ingress-controller*
        Kube_Tag_Prefix     kube_file.var.log.containers.
        Kube_URL            https://kubernetes.default.svc.cluster.local:443
        Merge_Log           On
        Merge_Log_Trim      On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Labels              On
        Annotations         Off

    # add cluster name
    [FILTER]
        Name   record_modifier
        Match  *
        Record cluster ${K8S_CLUSTER_NAME}

    # we dont know what records from tcp forward source will be tagged with since it is set by the sender
    # assume all records come from tcp forward source
    [FILTER]
        Name   record_modifier
        Match  *
        Record log_source fluentbit_tcp

    # and then fix those that we know come from other source(s)
    [FILTER]
        Name   record_modifier
        Match  kube_file.*
        Record log_source fluentbit_tail

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Flush             2
        Path              /var/log/containers/*.log
        #Exclude_Path      /var/log/containers/fluent*.log
        Tag               kube_file.*
        Parser            docker
        DB                /var/log/flb_kube.db
        Buffer_Chunk_Size 64k
        Buffer_Max_Size   256k
        Mem_Buf_Limit     100m
        Skip_Long_Lines   Off
        Refresh_Interval  10
        Merge_JSON_Log    Off
        Multiline         Off

  input-tcp.conf: |
    [INPUT]
        Name              forward
        Flush             2
        # pretty sure tag isn't supported for forward input
        Tag               kube_tcp.*
        Listen            0.0.0.0
        Log_Level         info
        Port              5500
        Buffer_Chunk_Size 64k
        Buffer_Max_Size   256k
        Mem_Buf_Limit     100m

  output-forward.conf: |
    [OUTPUT]
        Name                forward
        Match               *
        Host                ${FLUENT_HOST}
        Port                ${FLUENT_PORT}
        Retry_Limit         False

  parsers.conf: |
    [PARSER]
        Name        docker
        Format      json
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L
        Time_Keep   On
        # Command      |  Decoder | Field | Optional Action
        # =============|==================|=================
        Decode_Field_As   escaped    log

maglun2 · 2019-11-20T13:13:04Z

@nigels-com I appreciate you digging into the problem though I hope the question isn't directed at me 😄

nigels-com · 2019-11-20T21:30:29Z

@maglun2. Sometimes the process of putting things into words can help narrow or pinpoint a problem. But not this time.

fujimotos · 2019-11-21T04:59:17Z

Hm. I'm not entirely sure why this is happening.

#1  0x55a0f2c6eb29      in  string_cmp() at src/http_server/api/v1/metrics.c:142
...
#8  0x55a0f2c6f248      in  cb_metrics_prometheus() at src/http_server/api/v1/metrics.c:287

The stacktrace suggests that metrics_list somehow contains an invalid
string and it makes strcmp() run over the buffer ... but it's unlikely
since we use flb_sds_cat() that guarantees null termination.

The only possible route I can think of is flb_sds_cat() returning NULL
at metrics.c#L279, and this causes the SIGSEGV in qsort().

I posted #1762 that attempts to fix the issue based on this hypothesis.

@maglun2 Can you check out the version and see if it solves your problem?

maglun2 · 2019-11-21T08:00:39Z

Thanks @fujimotos will try it now

maglun2 · 2019-11-21T10:53:28Z

@fujimotos initial tests looks very promising. No sigsegv since I deployed your version. Will deploy to more test clusters and let it run during the day.

maglun2 · 2019-11-21T14:49:23Z

I realize now that your branch is based off master so there might be other changes affecting the behaviour I guess?

fujimotos · 2019-11-21T22:48:45Z

I realize now that your branch is based off master so there might be other changes affecting the behaviour I guess?

For metrics.c, we have no particular diff between v1.3.2 and master.
You can confirm it by running

$ git log v1.3.2..master http_server/api/v1/metrics.c

So if the patch fixes the issue, that's it.

maglun2 · 2019-11-22T06:47:03Z

Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem. Thanks you so much for looking into and fixing the problem @fujimotos! Should I make a comment in the PR?

fujimotos · 2019-11-22T06:56:53Z

Looks really good, no SIGSEGV at all since I deployed your patched version! Previous version got multiple SIGSEGV a day. I would say that your patch fixed the problem.

Nice! I appreciate if you make a comment in the PR for reference.

Now we can just wait for @edsiper to merge that patch into the main line.

edsiper · 2019-12-10T23:14:22Z

Fix merged into v1.3 branch also for the next v1.3.4 release (this week).

fujimotos mentioned this issue Nov 21, 2019

http_server: fix potential SIGSEGV due to memory allocation errors #1762

Merged

robertodauria mentioned this issue Dec 2, 2019

Consider using fluent-bit instead of fluentd m-lab/k8s-support#233

Closed

edsiper assigned fujimotos Dec 10, 2019

edsiper added bug fixed labels Dec 10, 2019

edsiper closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polling of prometheus metrics randomly causes SIGSEGV #1755

Polling of prometheus metrics randomly causes SIGSEGV #1755

maglun2 commented Nov 19, 2019 •

edited

Loading

maglun2 commented Nov 20, 2019

fujimotos commented Nov 20, 2019

nigels-com commented Nov 20, 2019

maglun2 commented Nov 20, 2019

maglun2 commented Nov 20, 2019

nigels-com commented Nov 20, 2019

fujimotos commented Nov 21, 2019

maglun2 commented Nov 21, 2019

maglun2 commented Nov 21, 2019

maglun2 commented Nov 21, 2019

fujimotos commented Nov 21, 2019

maglun2 commented Nov 22, 2019 •

edited

Loading

fujimotos commented Nov 22, 2019

edsiper commented Dec 10, 2019

Polling of prometheus metrics randomly causes SIGSEGV #1755

Polling of prometheus metrics randomly causes SIGSEGV #1755

Comments

maglun2 commented Nov 19, 2019 • edited Loading

Bug Report

maglun2 commented Nov 20, 2019

fujimotos commented Nov 20, 2019

nigels-com commented Nov 20, 2019

maglun2 commented Nov 20, 2019

maglun2 commented Nov 20, 2019

nigels-com commented Nov 20, 2019

fujimotos commented Nov 21, 2019

maglun2 commented Nov 21, 2019

maglun2 commented Nov 21, 2019

maglun2 commented Nov 21, 2019

fujimotos commented Nov 21, 2019

maglun2 commented Nov 22, 2019 • edited Loading

fujimotos commented Nov 22, 2019

edsiper commented Dec 10, 2019

maglun2 commented Nov 19, 2019 •

edited

Loading

maglun2 commented Nov 22, 2019 •

edited

Loading