S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

jackgill · 2023-02-28T23:46:01Z

Bug Report

Describe the bug
I have Fluent Bit deployed in a kubernetes cluster sending a large volume of logs to an S3 bucket. Most logs are transmitted successfully, however Fluent Bit regularly logs the "PutObject request failed" error. (This is odd because use_put_object is set to false) Fluent Bit logs the HTTP 403 response it got from S3, which has this error text: "The request signature we calculated does not match the signature you provided. Check your key and signing method."

There are also a lot of "broken pipe" errors although it is unclear if they are related.

Even more bizarrely, there are intermittent HTTP 505 Version not supported errors. I have no idea how these could be intermittent, surely the HTTP version is always the same?

To Reproduce

Rubular link if applicable: N/A
Example log message if applicable:

[2023/02/28 23:16:52] [error] [output:s3:s3.1] PutObject request failed
[2023/02/28 23:16:52] [error] [output:s3:s3.1] PutObject API responded with error='SignatureDoesNotMatch', message='The request signature we calculated does not match the signature you provided. Check your key and signing method.'

[2023/02/28 23:16:52] [error] [/src/fluent-bit/src/flb_http_client.c:1201 errno=32] Broken pipe

[2023/02/28 23:31:05] [error] [output:s3:s3.1] PutObject API responded with error='HttpVersionNotSupported', message='The HTTP version specified is not supported.'
[2023/02/28 23:31:05] [error] [output:s3:s3.1] Raw PutObject response: HTTP/1.1 505 HTTP Version not supported

Steps to reproduce the problem:
I'm afraid I don't have a concise set of steps to reproduce, our environment is fairly large and complex and this issue only seems to appear under heavy load.

Expected behavior
Fluent Bit sends log files to S3 bucket

Screenshots
N/A

Your Environment

Version used: 2.0.9
Configuration:

    [OUTPUT]
        Name s3
        Match *
        bucket ${log_bucket_name}
        region ${log_bucket_region}
        total_file_size 10M
        s3_key_format /${cluster_name}/$TAG/%Y/%m/%d/%H/%M/%S
        use_put_object false

Environment name and version (e.g. Kubernetes? What version?): AWS EKS cluster, Kubernetes 1.22
Server type and version: EC2 instances of various types
Operating System and version: Bottlerocket 1.12.0
Filters and plugins: tail input, modify filter, kubernetes filter

Additional context
I know that Fluent Bit retries requests to S3 but I am seeing occasional messages like this:

[2023/02/28 23:30:56] [ warn] [output:s3:s3.1] Chunk file failed to send 5 times, will not retry

So I am concerned that I am losing log messages.

I realize that these could be 3 different issues however they seem to occur together and I'm wondering if they could have a common cause.

Incidentally, I found this comment (over a year old) which reports the same behavior: #4505 (comment)

The text was updated successfully, but these errors were encountered:

jackgill · 2023-03-07T23:51:08Z

I've stopped seeing these errors after I made a change to the s3_key_format config value. I originally had /${cluster_name}/$TAG/%Y/%m/%d/%H/%M/%S which is very close to the default value (/fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/%S). After I changed this to /${cluster_name}/%Y/%m/%d/%H/$TAG[4]/$UUID.gz, all of the errors went away. (I also tested this without compression to eliminate that variable)

It occurs to me that the primary effect of the change to s3_key_format was to make the keys somewhat shorter. The keys are around 200 characters, well under the S3 max key length of 1024 characters. However the nature of the error messages suggests that the S3 output plugin was sending malformed HTTP requests somehow.

The cause of these problems seems likely to be difficult to track down and apparently it isn't very common since others aren't reporting it. If I have time I will try to capture some of the raw HTTP requests to see if they shed any light.

georgantasp · 2023-03-08T15:51:34Z

I'm seeing the same issue. I'm also suspecting the s3_key_format and more specifically, using $TAG within.

[OUTPUT]
    Name             s3
    Match            *
    region           ${AWS_REGION}
    bucket           ${ORGANIZATION}-${ENVIRONMENT}-event-data
    total_file_size  32M
    upload_timeout   1m
    retry_limit      2
    use_put_object   On
    compression      gzip
    json_date_key    false
    s3_key_format    /${ORGANIZATION}-${ENVIRONMENT}-ads-server-$TAG/%Y/%m/%d/%H.%M.%S.$UUID.gz

My suspicion on $TAG comes from seeing these results. click and impression are the only expected tags. Possibly other control chars are being introduced to the object prefix which makes S3 calls fail?

org-prod-ads-server-click/
org-prod-ads-server-clickn/
org-prod-ads-server-impression%02%01/ <- actually control chars
org-prod-ads-server-impression/
org-prod-ads-server-impression00Z/
org-prod-ads-server-impression28Z/
org-prod-ads-server-impression59Z/

johnl · 2023-03-09T14:00:26Z

I've just come across the same issue as @georgantasp too, and am getting similar upload problems to an S3 compatible endpoint. I have:

s3_key_format /$TAG/%Y/%m/%d/%H_%M_%S.$UUID.gz

$TAG should be project.example but sometimes it is being written out by the s3 plugin as project.exampler, project.exampled, project.exampleost etc. I have the same output going to the file plugin that also uses the $TAG and none of these corrupted tags end up on-disk, so my tag variable does actually old the expected value! Looks to me like a memory corruption bug!

leonardo-albertovich · 2023-03-09T14:12:15Z

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

I think modifying that function to expect the caller to provide the length and make a local copy would fix the problem.

At the end of the chain there's flb_get_s3_key who also expects the buffer to be NULL terminated so we could fix it at the root or at the middle, it doesn't really matter in this case because it's a PoC.

I doubt I'll be able to dedicate time to this issue myself but if anyone wants to take a stab feel free to contact me for assistance either in the issue or in slack.

georgantasp · 2023-03-09T14:39:50Z

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

I think this tracks. The control chars and date fragments seemed like they were coming from raw fluent message pack. I originally suspected the encoding in my client.

For what it's worth, I switched back to the aws-for-fluent-bit image in the meantime.

leonardo-albertovich · 2023-03-09T14:57:20Z

@PettitWesley would you be able to patch that?

johnl · 2023-03-09T16:31:47Z

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

That looks right to me yeah - presumably the only reason this works at all in some circumstances is because the buffer is initially zeroed. But that suggests that the buffer is reused at some point and not zeroed, right? It looks to be using calloc, so doesn't seem so.

Is this actually related to rewrite_tag? Is everyone seeing this using rewrite_tag?

I think modifying that function to expect the caller to provide the length and make a local copy would fix the problem.

if I'm following it correctly, the tag buffer is expected to be NULL terminated in a few other place too, such as error messages:

fluent-bit/plugins/out_s3/s3.c

Lines 1211 to 1214 in 6ee3b8a

    
           flb_plg_warn(ctx->ins, 
        
                        "Chunk for tag %s failed to send %i times, " 
        
                        "will not retry", 
        
                        (char *) fsf->meta_buf, MAX_UPLOAD_ERRORS);

So this is going to be a little more complicated.

PettitWesley · 2023-03-09T19:21:59Z

@leonardo-albertovich Ack.

PettitWesley · 2023-03-09T19:28:56Z

I will investigate if this report is related: aws/aws-for-fluent-bit#541

PettitWesley · 2023-03-09T21:30:46Z

@leonardo-albertovich since we switched to event_chunk, tags are now always null terminated because they are SDS strings:

I think something else in the flb_get_s3_key code is probably making a wrong assumption... haven't figure out what it is yet...

leonardo-albertovich · 2023-03-09T21:43:50Z

Could it be that when a chunk is added to the upload queue by add_to_queue a raw copy of the tag is made in line 1584 with a buffer that's insufficient to hold the terminator?

PettitWesley · 2023-03-09T21:49:07Z

@leonardo-albertovich Oh my... thanks for finding this for me... I would have spent hours and probably not found it...

I will submit a fix ASAP.

leonardo-albertovich · 2023-03-09T21:51:24Z

Glad to be of help, it seems like we need to patch this in master, 2.0 and 1.9, will you send three patches or do you want someone else to backport it once you send it to master?

PettitWesley · 2023-03-09T21:53:43Z

@leonardo-albertovich I will submit both PRs ASAP

PettitWesley · 2023-03-09T21:55:15Z

On a slightly related note, I'm also working on completely deleting and rewriting the s3 upload queue code... since its buggy and could be simplified a lot

leonardo-albertovich · 2023-03-09T21:56:22Z

Awesome, let me know if you need anything on my end =)

PettitWesley · 2023-03-09T21:57:04Z

Oh so this came up in 2.x because I made preserve_data_ordering default to On.

leonardo-albertovich · 2023-03-09T21:58:17Z

Yes but I manually checked the code in 1.9 and that part matches so I think we should patch it as well.

Edit: Sorry, I think i misunderstood you, it's too late, I should be sleeping.

PettitWesley · 2023-03-09T22:37:11Z

@leonardo-albertovich Here you go.

I also bundled in some other minor improvements that I thought of, let me know if you want me to remove them, they are:

always use flb_calloc instead of flb_malloc
use strncpy instead of strcpy in flb_get_s3_key code

Pull Requests:

leonardo-albertovich · 2023-03-09T22:44:24Z

Quick question, will you backport from master to 2.0?

leonardo-albertovich · 2023-03-10T12:26:16Z

Sorry, the PR merge automatically closed this issue but I'd like to have confirmation before if possible so I'll leave the same message I left in #6981.

PRs #6987, #6988 and #6990 which fix what we suspect to be the root cause of this in the master, 1.9 and 2.0 branches have been merged and the test containers are scheduled for creation, please let us know if you are interested in testing them, once they are ready you will be able to fetch them from :

ghcr.io/fluent/fluent-bit/test/master
ghcr.io/fluent/fluent-bit/test/1.9
ghcr.io/fluent/fluent-bit/test/2.0

Any feedback would be greatly appreciated.

johnl · 2023-03-12T00:02:18Z

I'm currently testing "version=2.0.10, commit=cf57255c59" on a staging system - previously, this would break within a few hours of running so if it's still all good come Monday I'd say it's fixed!

johnl · 2023-03-13T12:25:14Z

36 hours of runtime, 4500+ uploads, 3 tags, not one corrupted tag and not one upload error. So LGTM!

No reoccurrences of #6981 either. That is a bit of a strange error message for a situation like this, but I suppose a memory corruption bug can do strange things, so that can probably be closed too.

Thanks everyone!

PettitWesley · 2023-03-15T05:31:14Z

Fixed in AWS release: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.7

leonardo-albertovich · 2023-03-15T12:25:57Z

Tip top! On our side this fix will be available in the next release which should be sooner rather than later.

I'll close this issue now then, once again, thanks a lot folks!

jackgill added the status: waiting-for-triage label Feb 28, 2023

PettitWesley self-assigned this Mar 9, 2023

This was referenced Mar 9, 2023

out_s3: fix tag corruption with preserve_data_ordering #6987

Merged

out_s3: fix tag corruption with preserve_data_ordering #6988

Merged

PettitWesley mentioned this issue Mar 9, 2023

Unrelated data injected into s3 key via $TAG[] aws/aws-for-fluent-bit#541

Closed

PettitWesley mentioned this issue Mar 10, 2023

[s3] [2.31.1+] S3 preserve_data_ordering instability causing crash [under investigation] aws/aws-for-fluent-bit#552

Closed

leonardo-albertovich closed this as completed in #6988 Mar 10, 2023

leonardo-albertovich reopened this Mar 10, 2023

leonardo-albertovich closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

jackgill commented Feb 28, 2023

jackgill commented Mar 7, 2023

georgantasp commented Mar 8, 2023 •

edited

Loading

johnl commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

georgantasp commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

johnl commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023 •

edited

Loading

PettitWesley commented Mar 9, 2023 •

edited

Loading

leonardo-albertovich commented Mar 9, 2023

leonardo-albertovich commented Mar 10, 2023

johnl commented Mar 12, 2023

johnl commented Mar 13, 2023

PettitWesley commented Mar 15, 2023

leonardo-albertovich commented Mar 15, 2023

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

Comments

jackgill commented Feb 28, 2023

Bug Report

jackgill commented Mar 7, 2023

georgantasp commented Mar 8, 2023 • edited Loading

johnl commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

georgantasp commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

johnl commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023

PettitWesley commented Mar 9, 2023

leonardo-albertovich commented Mar 9, 2023 • edited Loading

PettitWesley commented Mar 9, 2023 • edited Loading

leonardo-albertovich commented Mar 9, 2023

leonardo-albertovich commented Mar 10, 2023

johnl commented Mar 12, 2023

johnl commented Mar 13, 2023

PettitWesley commented Mar 15, 2023

leonardo-albertovich commented Mar 15, 2023

georgantasp commented Mar 8, 2023 •

edited

Loading

leonardo-albertovich commented Mar 9, 2023 •

edited

Loading

PettitWesley commented Mar 9, 2023 •

edited

Loading