Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 output key formatting inconsistent rendering of $TAG[n] #376

Closed
ssplatt opened this issue Jun 30, 2022 · 27 comments
Closed

S3 output key formatting inconsistent rendering of $TAG[n] #376

ssplatt opened this issue Jun 30, 2022 · 27 comments

Comments

@ssplatt
Copy link

ssplatt commented Jun 30, 2022

Bug Report

Describe the bug
S3 output has a key_format property which allows for using $TAG, or $TAG[n], (along with tag_delimiters) to create a path for objects based on tag properties. Sometimes using $TAG[n] in the path works fine, other times you get $TAG[n-1] and still other times you get full.tag.output.without.parsing[n] rendered.

To Reproduce

  • use the s3 output plugin
  • have logs with a tag like my.awesome.experimental.tag
  • set the tag_delimiter to .
  • set a key_format to something like /$TAG[2]/some_value=$TAG[3]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
  • see that sometimes it works perfectly. sometimes it treats tag[2] as tag[3] and sometimes it doesn't parse the tag at all so the path results in /my.awesome.experimental.tag[3]/ all without changing the fluent-bit version, the app version, the tags, or anything else.

Expected behavior
given my.awesome.experimental.tag, $TAG[2] should always parse to experimental and the s3 key path would then always render as /experimental/some_value=tag/

Your Environment
aws-for-fluent-bit 2.25.1, fluentbit 1.9.3, Fargate and Firelens.

duplicate of fluent/fluent-bit#5662 since we're using aws-for-fluent-bit, wanted to raise awareness here as well.

@zhonghui12
Copy link
Contributor

@ssplatt Do you mean the issue is random? May I know the frequency?

@ssplatt
Copy link
Author

ssplatt commented Jun 30, 2022

I mean that over the last month since we upgraded to 2.25.1, some logs have made it through to the correct location ($TAG[2] parsed correctly), some have gone to fulltag[2], some have gone to $TAG[3], and some have gone to $TAG[1]. No other change has happened in the month. The output tag from our app is the same. On some days, all 4 appear. On same days, only one of those permutations shows up. It seems random because I cannot find a pattern to it other than an upgrade to 2.25.1.

@ssplatt
Copy link
Author

ssplatt commented Jul 1, 2022

I have several thousand examples of this. I can try to get counts of all the different permutations but it might be more difficult for some. For example, the case where $TAG[2] actually parsed as $TAG[3] and ~100 different subdirectories were created.

@raxidex
Copy link

raxidex commented Jul 1, 2022

I can confirm it's happening to me too. In my case it happens when I use more than one OUTPUT, it seems that the plugin gets lost and mixes the tags.

@ssplatt
Copy link
Author

ssplatt commented Jul 1, 2022

I have 6 or 8 using similar patterns. And like I said it worked fine until upgrading to 2.25.1.

@zhonghui12
Copy link
Contributor

Thanks for all the report. We will try to reproduce the issue and get back. In the meanwhile, it would be much appreciated if anyone can test our latest version (2.26.0) to see the problem also happens.

Thanks

@ssplatt
Copy link
Author

ssplatt commented Jul 1, 2022

I can share more details in an AWS support case if that'd be helpful

@ssplatt
Copy link
Author

ssplatt commented Jul 5, 2022

Thanks for all the report. We will try to reproduce the issue and get back. In the meanwhile, it would be much appreciated if anyone can test our latest version (2.26.0) to see the problem also happens.

Thanks

This issue does appear to continue in 2.26.0

@ssplatt
Copy link
Author

ssplatt commented Jul 6, 2022

to add more general detail, we are splitting logs into a bucket for use with Athena and a "raw log" bucket for backup in case we decide to repartition or need to re-ingest, or whatever. In one instance we are also forwarding all logs to our "legacy" fluentd cluster to make sure we have a backup in case the "new" Firelens Fluent Bit path doesn't work.

The tag takes the form of filtered.application.logtype.clientid so the output s3 object path takes the format of bucket/${application}/${ENV}/<logtype>/<clientid> and a second output sends to athenabucket/${application}/${ENV}/type/clientid=<clientid>/year=<year>/month=<month>/day=<day>. $application and $env are set with environment variables; logtype is $TAG[2] and clientid is $TAG[3]. It appears wherever we have $TAG[n] it can take the form of $TAG, $TAG[n-1], $TAG[n], $TAG[n+1]. So a correct example would be similar to: bucket/myapp/prod/errorlog/clientid=myclient but I see examples of :
bucket/myapp/prod/myclient/clientid=errorlog,
bucket/myapp/prod/filtered.myapp.errorlog.myclient[2]/clientid=filtered.myapp.errorlog.myclient[3],
bucket/myapp/prod/myclient/clientid=filtered.myapp.errorlog.myclient[3],
bucket/myapp/prod/myapp/clientid=filtered.myapp.errorlog.myclient[3],
bucket/myapp/prod/myapp/clientid=errorlog
bucket/myapp/prod/myapp/clientid=clientid.

So it seems that $TAG sometimes doesn't parse at all and other times the actual tag components get nudged to have a different index.

@ssplatt
Copy link
Author

ssplatt commented Jul 12, 2022

@PettitWesley any update on this? We're using the output to seed an Athena database so the broken object path is hurting our partitioning.

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented Jul 13, 2022

@ssplatt i am oncall this week and will take a look for this issue.
Follow up actions would be:

  1. try to reproduce it in my local.
  2. debug and see if i could find the root cause.

I will have an update by end of tomorrow based on my findings.

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented Jul 14, 2022

@ssplatt I tried to reproduce it in my local but didn't make it.
Below is the config I have setup.

[SERVICE]
    flush        1
    daemon       Off
    log_level    info
[INPUT]
    name cpu
    tag  my.awesome.experimental.tag
    Interval_Sec 0
    Interval_NSec 10
[OUTPUT]
    Name                         s3
    Match                        *
    bucket                       fluent-bit-output-s3
    region                       us-east-1
    total_file_size              1M
    s3_key_format                /$TAG[2]/some_value=$TAG[3]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .

I have ran fluent bit for 1.5 hour without any inconsistent rendering of $TAG[n] appearing at my side.
I may need more details for your task definition and the whole Fluent bit config file to check if any suspicious parts.
Can you create an AWS support case to put those details there and help us reproduce it?

@ssplatt
Copy link
Author

ssplatt commented Jul 14, 2022

According to one of the other comments, it only appears when there is more than one output in the same configuration using the tag sub indexes as a part of the key format. Can you add another S3 output (or 2 or 3) to your test setup?

@raxidex
Copy link

raxidex commented Jul 14, 2022

I'm hitting the same error, and it's happening when I configure multiple OUTPUTS. Here's a sample:

    [OUTPUT]
        Name           s3
        Match          kube-infra*
        bucket         bucket_logs
        s3_key_format  /logs-infra/$TAG[1]/$TAG[3]/$TAG[2]/%Y%m%d-%H%M%S-$UUID.gz
        s3_key_format_tag_delimiters _
        endpoint       https://s3.i02.estaleiro.serpro.gov.br
        upload_timeout 10s
        total_file_size 15M
        json_date_key false
        compression gzip
        use_put_object true
    [OUTPUT]
        Name           s3
        Match          kube-systemd.*
        bucket         bucket_logs
        s3_key_format  /systemd/$TAG[1]/%Y%m%d-%H%M%S-$UUID.gz
        s3_key_format_tag_delimiters ._
        endpoint       https://s3.i02.estaleiro.serpro.gov.br
        upload_timeout 10s
        total_file_size 15M
        json_date_key false
        compression gzip
        use_put_object true

@DrewZhang13
Copy link
Contributor

I see. I'll try multiple output today. Thanks for clarification

@ssplatt
Copy link
Author

ssplatt commented Jul 14, 2022

here's a few examples from my configuration, highlighting that I have numerous outputs and many different sub-indexes used in different outputs.

[OUTPUT]
    Match                       applicationtwo.logger.*
    Name                        s3
    Retry_Limit                 False
    bucket                      ${ATHENA_LOG_BUCKET}
    s3_key_format               /${SERVICE_NAME}/${ENVIRONMENT}/$TAG[3]/client_id=$TAG[4]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
    auto_retry_requests         true
    net.dns.mode                TCP
[OUTPUT]
    Match                       applicationone.*
    Name                        s3
    Retry_Limit                 False
    bucket                      ${ATHENA_LOG_BUCKET}
    s3_key_format               /${SERVICE_NAME}/${ENVIRONMENT}/$TAG[2]/client_id=$TAG[3]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
    auto_retry_requests         true
    net.dns.mode                TCP
[OUTPUT]
    Match                       special_logger*
    Name                        s3
    Retry_Limit                 False
    bucket                      ${RAW_LOG_BUCKET}
    s3_key_format               /${SERVICE_NAME}/${ENVIRONMENT}/$TAG/%Y%m%d%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
    auto_retry_requests         true
    net.dns.mode                TCP
[OUTPUT]
    Match                       special_logger*
    Name                        s3
    Retry_Limit                 False
    bucket                      ${ATHENA_LOG_BUCKET}
    s3_key_format               /${SERVICE_NAME}/${ENVIRONMENT}/$TAG/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
    auto_retry_requests         true
    net.dns.mode                TCP

@DrewZhang13
Copy link
Contributor

DrewZhang13 commented Jul 14, 2022

Cool, I can reproduce it in my local now with below configuration,


[SERVICE]
    flush        1
    daemon       Off
    log_level    info
[INPUT]
    name cpu
    tag  my.awesome.experimental.tag
    Interval_Sec 0
    Interval_NSec 1
[OUTPUT]
    Name                         s3
    Match                        *
    bucket                       fluent-bit-output-s3
    region                       us-east-1
    total_file_size              1M
    s3_key_format                /$TAG[2]/some_value=$TAG[3]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
[OUTPUT]
    Match                       *
    Name                        s3
    region                      us-east-1
    Retry_Limit                 False
    bucket                      fluent-bit-output-s3
    s3_key_format               /$TAG[2]/some_value=$TAG[1]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
[OUTPUT]
    Match                       *
    Name                        s3
    region                      us-east-1
    Retry_Limit                 False
    bucket                      fluent-bit-output-s3
    s3_key_format               /$TAG[2]/some_value=$TAG[0]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On
[OUTPUT]
    Match                       *
    Name                        s3
    region                      us-east-1
    Retry_Limit                 False
    bucket                      fluent-bit-output-s3
    s3_key_format               /$TAG[2]/some_value=$TAG[2]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz
    s3_key_format_tag_delimiters .
    compression                 gzip
    total_file_size             1M
    upload_timeout              30s
    use_put_object              On

The log path appears like below which doesn't right.

path Folder
awesome/ Folder
cpu/ Folder
experimental/ Folder
my.awesome.experimental.tag[2]/ Folder

will take some debug and see what i can find. I don't have a ETA for now but will keep posted there if i find anything.

@PettitWesley
Copy link
Contributor

@ssplatt @DrewZhang13 I am now working on repro and investigation for this bug. Do you folks see lots of these warnings when it fails, because I do:

[2022/07/21 22:21:35] [ warn] [s3_key] Invalid / Out of bounds tag part: At most 10 tag parts ($TAG[0] - $TAG[9]) can be processed. tag=my.awesome.experimental.tag, format=/$TAG[2]/some_value=$TAG[0]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz, delimiters=.
[2022/07/21 22:21:35] [ warn] [s3_key] Invalid / Out of bounds tag part: At most 10 tag parts ($TAG[0] - $TAG[9]) can be processed. tag=my.awesome.experimental.tag, format=/$TAG[2]/some_value=$TAG[2]/year=%Y/month=%m/day=%d/%H%M%S-$UUID.gz, delimiters=.
[2022/07/21 22:21:35] [ info] [output:s3:s3.2] Successfully uploaded object /my.awesome.experimental.tag[2]/some_value=my/year=2022/month=07/day=21/222105-NhlnhFSp.gz

@ssplatt
Copy link
Author

ssplatt commented Jul 21, 2022

I'll double check tomorrow but I do not believe I've seen logs like that

1 similar comment
@ssplatt
Copy link
Author

ssplatt commented Jul 21, 2022

I'll double check tomorrow but I do not believe I've seen logs like that

@PettitWesley
Copy link
Contributor

I also think there must be a way to repro this with a single output... all of the code is entirely thread safe and has no context outside of a single call...

@PettitWesley
Copy link
Contributor

PettitWesley commented Jul 22, 2022

Oooooo I think I might have figured out what it is... the strtok function we use to parse the tag parts isn't thread safe... there's a strtok_r for that...

https://man7.org/linux/man-pages/man3/strtok.3.html

I remember skimming that man page when we built that feature and thinking that this function was way too magical sounding.

This is one of those times when we must remember that all AWS integrations with Fluent Bit were led/reviewed/written by an engineer who was not a real C programmer before he started Fluent Bit and previously had mostly just written really simple single-threaded Golang tools (me).

@PettitWesley
Copy link
Contributor

fluent/fluent-bit#5771

@PettitWesley
Copy link
Contributor

The fix can be used via this image:

144718711470.dkr.ecr.us-west-2.amazonaws.com/strtok_r_s3_tag_part_fix:1.9

@kyos3
Copy link

kyos3 commented Aug 1, 2022

@PettitWesley gmtime is also thread unsafe. The returned pointer is shared from threads.
https://github.com/fluent/fluent-bit/blob/v1.9.6/src/aws/flb_aws_util.c#L813

@PettitWesley
Copy link
Contributor

@kyos3 Thanks!

@zwj102030
Copy link
Contributor

https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.28.0 , the change in the release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants