Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleunt-Bit crashed with SIGSEV error #6897

Closed
ezienecker opened this issue Feb 22, 2023 · 10 comments
Closed

Fleunt-Bit crashed with SIGSEV error #6897

ezienecker opened this issue Feb 22, 2023 · 10 comments
Labels
Stale waiting-for-release This has been fixed/merged but it's waiting to be included in a release.

Comments

@ezienecker
Copy link

We have recently updated to fluent-bit 2.0.9 (Helm Chart version 0.24.0). Since then we regularly receive the error code 139 SIGSEV.

The following error message appears in the logs:

[2023/02/22 07:27:34] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/02/22 07:29:06] [engine] caught signal (SIGSEGV)
#0  0x55c299f28772      in  edata_arena_ind_get() at lib/jemalloc-5.3.0/include/jemalloc/internal/edata.h:258
#1  0x55c299f28772      in  tcache_bin_flush_impl() at lib/jemalloc-5.3.0/src/tcache.c:350
#2  0x55c299f28772      in  tcache_bin_flush_bottom() at lib/jemalloc-5.3.0/src/tcache.c:519
#3  0x55c299f28772      in  je_tcache_bin_flush_small() at lib/jemalloc-5.3.0/src/tcache.c:529
#4  0x55c299f29cb9      in  tcache_gc_small() at lib/jemalloc-5.3.0/src/tcache.c:148
#5  0x55c299f2bd71      in  ???() at lib/jemalloc-5.3.0/src/tcache.c:414
#6  0x55c299f2e62f      in  je_te_event_trigger() at lib/jemalloc-5.3.0/src/thread_event.c:299
#7  0x55c299ebf6ac      in  te_event_advance() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:287
#8  0x55c299ebf6ac      in  thread_dalloc_event() at lib/jemalloc-5.3.0/include/jemalloc/internal/thread_event.h:293
#9  0x55c299ebf6ac      in  ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2896
#10 0x55c299ebf6ac      in  je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3021
#11 0x55c29a497053      in  map_metric_destroy() at lib/cmetrics/src/cmt_map.c:160
#12 0x55c29a4973f3      in  cmt_map_destroy() at lib/cmetrics/src/cmt_map.c:273
#13 0x55c29a480110      in  cmt_counter_destroy() at lib/cmetrics/src/cmt_counter.c:94
#14 0x55c29a4a57ff      in  cmt_destroy() at lib/cmetrics/src/cmetrics.c:101
#15 0x55c29a016e23      in  collect_metrics() at src/flb_metrics_exporter.c:201
#16 0x55c29a016f57      in  flb_me_fd_event() at src/flb_metrics_exporter.c:253
#17 0x55c299f9d7d0      in  flb_engine_handle_event() at src/flb_engine.c:497
#18 0x55c299f9d7d0      in  flb_engine_start() at src/flb_engine.c:853
#19 0x55c299f44b24      in  flb_lib_worker() at src/flb_lib.c:629
#20 0x7f181e43bea6      in  ???() at ???:0
#21 0x7f181dcefa2e      in  ???() at ???:0
#22 0xffffffffffffffff  in  ???() at ???:0

For other instances, the following error message is seen:

[2023/02/21 20:33:54] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/02/21 20:43:54] [engine] caught signal (SIGSEGV)
#0  0x55c4622aae03      in  atomic_load_p() at lib/jemalloc-5.3.0/include/jemalloc/internal/atomic.h:83
#1  0x55c4622aae03      in  arena_get_from_edata() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:16
#2  0x55c4622aae03      in  je_large_dalloc() at lib/jemalloc-5.3.0/src/large.c:271
#3  0x55c46224e700      in  arena_dalloc_large() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:297
#4  0x55c46224e700      in  arena_dalloc() at lib/jemalloc-5.3.0/include/jemalloc/internal/arena_inlines_b.h:334
#5  0x55c46224e700      in  idalloctm() at lib/jemalloc-5.3.0/include/jemalloc/internal/jemalloc_internal_inlines_c.h:120
#6  0x55c46224e700      in  ifree() at lib/jemalloc-5.3.0/src/jemalloc.c:2887
#7  0x55c46224e700      in  je_free_default() at lib/jemalloc-5.3.0/src/jemalloc.c:3014
#8  0x55c4622e967a      in  flb_free() at include/fluent-bit/flb_mem.h:121
#9  0x55c4622ea913      in  flb_sds_destroy() at src/flb_sds.c:470
#10 0x55c46261b220      in  pack_record() at plugins/out_loki/loki.c:1233
#11 0x55c46261b6de      in  loki_compose_payload() at plugins/out_loki/loki.c:1381
#12 0x55c46261b7bd      in  cb_loki_flush() at plugins/out_loki/loki.c:1408
#13 0x55c4623079ae      in  output_pre_cb_flush() at include/fluent-bit/flb_output.h:528
#14 0x55c462d6e3a6      in  co_init() at lib/monkey/deps/flb_libco/amd64.c:117

This error also occurs in version

  • Helm Chart version 0.23.0 (fluent-bit version 2.0.8)
  • Helm Chart version 0.22.0 (fluent-bit version 2.0.8)
  • Helm Chart version 0.21.0 (fluent-bit version not checked)

Temporarily I have downgraded to version 1.9.9 (Helm Chart version 0.20.11). Everything seems to work so far.

Maybe a bug was introduce with version 2.x?

@patrick-stephens
Copy link
Contributor

Can you provide the full configuration you're using that triggers the error?
More of the FB logs as well could help.

@ezienecker
Copy link
Author

This is the current configuration:

image:
  pullPolicy: IfNotPresent

resources:
  limits:
    cpu: 100m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 128Mi

logLevel: info

config:
  inputs: |
    [INPUT]
        Name                    tail
        Path                    /var/log/containers/*.log
        Parser                  cri
        Tag                     kube.*
        Mem_Buf_Limit           5MB
        Buffer_Chunk_Size       64KB
        Buffer_Max_Size         128KB
        Skip_Long_Lines         On

  filters: |
    [FILTER]
        Name                    kubernetes
        Match                   kube.*
        K8S-Logging.Parser      On
        K8S-Logging.Exclude     On
        Buffer_Size             256KB

    # Only keep logs from namespaces containing 'test1' or 'test2'
    [FILTER]
        Name                    grep
        Match                   kube.*
        Regex                   $kubernetes['namespace_name'] ((?:.+-)?(test1|test2)(?:infra-.+)?)
    
    # Append environment to tag
    [FILTER]
        Name                    rewrite_tag
        Match                   kube.*
        Rule                    $kubernetes['namespace_name'] ((?:.+-)?(test1|test2)(?:infra-.+)?) $2.$TAG false

  outputs: |
    [OUTPUT]
        Name                    loki
        Match                   test1.kube.*
        Host                    loki
        port                    3100
        labels                  environment=test1
        remove_keys             stream,logtag,kubernetes
        drop_single_key         on
        auto_kubernetes_labels  on
    
    [OUTPUT]
        Name                    loki
        Match                   test2.kube.*
        Host                    loki
        port                    3100
        labels                  environment=test2
        line_format             key_value
        remove_keys             stream,logtag,kubernetes,kubernetes_namespace
        drop_single_key         on
        auto_kubernetes_labels  on


  customParsers: |
    [PARSER]
        Name                    cri
        Format                  regex
        Regex                   ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
        Time_Key                time
        Time_Format             %Y-%m-%dT%H:%M:%S.%L%z

@Valt25
Copy link

Valt25 commented Feb 22, 2023

Seems similar error. But with different exception path.
[2023/02/22 15:14:07] [engine] caught signal (SIGSEGV) #0 0x7f2227844ad8 in ???() at 4/multiarch/strlen-evex.S:77 #1 0x7f222773af75 in __vfprintf_internal() at fprintf-internal.c:1688 #2 0x7f222774c9c5 in __vsnprintf_internal() at f.c:114 #3 0x55dd178d36f2 in flb_sds_printf() at src/flb_sds.c:429 #4 0x55dd17a7b452 in debug_event_mask() at plugins/in_tail/tail_fs_inotify.c:69 #5 0x55dd17a7b924 in tail_fs_event() at plugins/in_tail/tail_fs_inotify.c:199 #6 0x55dd178e2d4a in flb_input_collector_fd() at src/flb_input.c:1882 #7 0x55dd179157aa in flb_engine_handle_event() at src/flb_engine.c:490 #8 0x55dd179157aa in flb_engine_start() at src/flb_engine.c:853 #9 0x55dd178bcb24 in flb_lib_worker() at src/flb_lib.c:629 #10 0x7f2227f1aea6 in start_thread() at reate.c:477 #11 0x7f22277cea2e in ???() at sysv/linux/x86_64/clone.S:95 #12 0xffffffffffffffff in ???() at ???:0

That happens only with FLB_LOG_LEVEL env as debug.

And everything fine with 2.0.8 version of fluent-bit

@nokute78
Copy link
Collaborator

@Valt25 Your error seems to be different from original issue and be same #6797. Could you check it ?

@ezienecker
Copy link
Author

Is there any progress on this topic?

@leonardo-albertovich
Copy link
Collaborator

@nokute78 fixed an issue that could be related to it, yesterday I generated the test containers for master which I think you should be able to grab from ghcr.io/fluent/fluent-bit/test/master or you could build it yourself. It would be great if you could give it a try.

I'm also working on a PR to fix issue #6911 so if you use the dynamic tenant id feature I would really appreciate your input. The branch name is leonardo-master-loki-tenant_id-race-fix but I can build test containers for it once I'm done with the improvement I'm currently working on.

@payparain
Copy link

That happens only with FLB_LOG_LEVEL env as debug.

And everything fine with 2.0.8 version of fluent-bit

This was the situation for our Fluent-Bit installation. Setting the logLevel to info resolved the segfault.

@patrick-stephens
Copy link
Contributor

See #6958 as well

@patrick-stephens patrick-stephens added waiting-for-release This has been fixed/merged but it's waiting to be included in a release. and removed status: waiting-for-triage labels Mar 15, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

@github-actions github-actions bot added the Stale label Jun 14, 2023
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale waiting-for-release This has been fixed/merged but it's waiting to be included in a release.
Projects
None yet
Development

No branches or pull requests

6 participants