Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933

Closed
jackgill opened this issue Feb 28, 2023 · 25 comments · Fixed by #6988

Comments

@jackgill
Copy link

Bug Report

Describe the bug
I have Fluent Bit deployed in a kubernetes cluster sending a large volume of logs to an S3 bucket. Most logs are transmitted successfully, however Fluent Bit regularly logs the "PutObject request failed" error. (This is odd because use_put_object is set to false) Fluent Bit logs the HTTP 403 response it got from S3, which has this error text: "The request signature we calculated does not match the signature you provided. Check your key and signing method."

There are also a lot of "broken pipe" errors although it is unclear if they are related.

Even more bizarrely, there are intermittent HTTP 505 Version not supported errors. I have no idea how these could be intermittent, surely the HTTP version is always the same?

To Reproduce

  • Rubular link if applicable: N/A
  • Example log message if applicable:
[2023/02/28 23:16:52] [error] [output:s3:s3.1] PutObject request failed
[2023/02/28 23:16:52] [error] [output:s3:s3.1] PutObject API responded with error='SignatureDoesNotMatch', message='The request signature we calculated does not match the signature you provided. Check your key and signing method.'

[2023/02/28 23:16:52] [error] [/src/fluent-bit/src/flb_http_client.c:1201 errno=32] Broken pipe

[2023/02/28 23:31:05] [error] [output:s3:s3.1] PutObject API responded with error='HttpVersionNotSupported', message='The HTTP version specified is not supported.'
[2023/02/28 23:31:05] [error] [output:s3:s3.1] Raw PutObject response: HTTP/1.1 505 HTTP Version not supported
  • Steps to reproduce the problem:
    I'm afraid I don't have a concise set of steps to reproduce, our environment is fairly large and complex and this issue only seems to appear under heavy load.

Expected behavior
Fluent Bit sends log files to S3 bucket

Screenshots
N/A

Your Environment

  • Version used: 2.0.9
  • Configuration:
    [OUTPUT]
        Name s3
        Match *
        bucket ${log_bucket_name}
        region ${log_bucket_region}
        total_file_size 10M
        s3_key_format /${cluster_name}/$TAG/%Y/%m/%d/%H/%M/%S
        use_put_object false
  • Environment name and version (e.g. Kubernetes? What version?): AWS EKS cluster, Kubernetes 1.22
  • Server type and version: EC2 instances of various types
  • Operating System and version: Bottlerocket 1.12.0
  • Filters and plugins: tail input, modify filter, kubernetes filter

Additional context
I know that Fluent Bit retries requests to S3 but I am seeing occasional messages like this:

[2023/02/28 23:30:56] [ warn] [output:s3:s3.1] Chunk file failed to send 5 times, will not retry

So I am concerned that I am losing log messages.

I realize that these could be 3 different issues however they seem to occur together and I'm wondering if they could have a common cause.

Incidentally, I found this comment (over a year old) which reports the same behavior: #4505 (comment)

@jackgill
Copy link
Author

jackgill commented Mar 7, 2023

I've stopped seeing these errors after I made a change to the s3_key_format config value. I originally had /${cluster_name}/$TAG/%Y/%m/%d/%H/%M/%S which is very close to the default value (/fluent-bit-logs/$TAG/%Y/%m/%d/%H/%M/%S). After I changed this to /${cluster_name}/%Y/%m/%d/%H/$TAG[4]/$UUID.gz, all of the errors went away. (I also tested this without compression to eliminate that variable)

It occurs to me that the primary effect of the change to s3_key_format was to make the keys somewhat shorter. The keys are around 200 characters, well under the S3 max key length of 1024 characters. However the nature of the error messages suggests that the S3 output plugin was sending malformed HTTP requests somehow.

The cause of these problems seems likely to be difficult to track down and apparently it isn't very common since others aren't reporting it. If I have time I will try to capture some of the raw HTTP requests to see if they shed any light.

@georgantasp
Copy link

georgantasp commented Mar 8, 2023

I'm seeing the same issue. I'm also suspecting the s3_key_format and more specifically, using $TAG within.

[OUTPUT]
    Name             s3
    Match            *
    region           ${AWS_REGION}
    bucket           ${ORGANIZATION}-${ENVIRONMENT}-event-data
    total_file_size  32M
    upload_timeout   1m
    retry_limit      2
    use_put_object   On
    compression      gzip
    json_date_key    false
    s3_key_format    /${ORGANIZATION}-${ENVIRONMENT}-ads-server-$TAG/%Y/%m/%d/%H.%M.%S.$UUID.gz

My suspicion on $TAG comes from seeing these results. click and impression are the only expected tags. Possibly other control chars are being introduced to the object prefix which makes S3 calls fail?

org-prod-ads-server-click/
org-prod-ads-server-clickn/
org-prod-ads-server-impression%02%01/ <- actually control chars
org-prod-ads-server-impression/
org-prod-ads-server-impression00Z/
org-prod-ads-server-impression28Z/
org-prod-ads-server-impression59Z/

@johnl
Copy link

johnl commented Mar 9, 2023

I've just come across the same issue as @georgantasp too, and am getting similar upload problems to an S3 compatible endpoint. I have:

s3_key_format /$TAG/%Y/%m/%d/%H_%M_%S.$UUID.gz

$TAG should be project.example but sometimes it is being written out by the s3 plugin as project.exampler, project.exampled, project.exampleost etc. I have the same output going to the file plugin that also uses the $TAG and none of these corrupted tags end up on-disk, so my tag variable does actually old the expected value! Looks to me like a memory corruption bug!

@leonardo-albertovich
Copy link
Collaborator

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

I think modifying that function to expect the caller to provide the length and make a local copy would fix the problem.

At the end of the chain there's flb_get_s3_key who also expects the buffer to be NULL terminated so we could fix it at the root or at the middle, it doesn't really matter in this case because it's a PoC.

I doubt I'll be able to dedicate time to this issue myself but if anyone wants to take a stab feel free to contact me for assistance either in the issue or in slack.

@georgantasp
Copy link

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

I think this tracks. The control chars and date fragments seemed like they were coming from raw fluent message pack. I originally suspected the encoding in my client.

For what it's worth, I switched back to the aws-for-fluent-bit image in the meantime.

@leonardo-albertovich
Copy link
Collaborator

@PettitWesley would you be able to patch that?

@johnl
Copy link

johnl commented Mar 9, 2023

I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated.

That looks right to me yeah - presumably the only reason this works at all in some circumstances is because the buffer is initially zeroed. But that suggests that the buffer is reused at some point and not zeroed, right? It looks to be using calloc, so doesn't seem so.

Is this actually related to rewrite_tag? Is everyone seeing this using rewrite_tag?

I think modifying that function to expect the caller to provide the length and make a local copy would fix the problem.

if I'm following it correctly, the tag buffer is expected to be NULL terminated in a few other place too, such as error messages:

fluent-bit/plugins/out_s3/s3.c

Lines 1211 to 1214 in 6ee3b8a

flb_plg_warn(ctx->ins,
"Chunk for tag %s failed to send %i times, "
"will not retry",
(char *) fsf->meta_buf, MAX_UPLOAD_ERRORS);
So this is going to be a little more complicated.

@PettitWesley
Copy link
Contributor

@leonardo-albertovich Ack.

@PettitWesley
Copy link
Contributor

I will investigate if this report is related: aws/aws-for-fluent-bit#541

@PettitWesley PettitWesley self-assigned this Mar 9, 2023
@PettitWesley
Copy link
Contributor

@leonardo-albertovich since we switched to event_chunk, tags are now always null terminated because they are SDS strings:

I think something else in the flb_get_s3_key code is probably making a wrong assumption... haven't figure out what it is yet...

@leonardo-albertovich
Copy link
Collaborator

Could it be that when a chunk is added to the upload queue by add_to_queue a raw copy of the tag is made in line 1584 with a buffer that's insufficient to hold the terminator?

@PettitWesley
Copy link
Contributor

@leonardo-albertovich Oh my... thanks for finding this for me... I would have spent hours and probably not found it...

I will submit a fix ASAP.

@leonardo-albertovich
Copy link
Collaborator

Glad to be of help, it seems like we need to patch this in master, 2.0 and 1.9, will you send three patches or do you want someone else to backport it once you send it to master?

@PettitWesley
Copy link
Contributor

@leonardo-albertovich I will submit both PRs ASAP

@PettitWesley
Copy link
Contributor

On a slightly related note, I'm also working on completely deleting and rewriting the s3 upload queue code... since its buggy and could be simplified a lot

@leonardo-albertovich
Copy link
Collaborator

Awesome, let me know if you need anything on my end =)

@PettitWesley
Copy link
Contributor

Oh so this came up in 2.x because I made preserve_data_ordering default to On.

@leonardo-albertovich
Copy link
Collaborator

leonardo-albertovich commented Mar 9, 2023

Yes but I manually checked the code in 1.9 and that part matches so I think we should patch it as well.

Edit: Sorry, I think i misunderstood you, it's too late, I should be sleeping.

@PettitWesley
Copy link
Contributor

PettitWesley commented Mar 9, 2023

@leonardo-albertovich Here you go.

I also bundled in some other minor improvements that I thought of, let me know if you want me to remove them, they are:

  • always use flb_calloc instead of flb_malloc
  • use strncpy instead of strcpy in flb_get_s3_key code

Pull Requests:

@leonardo-albertovich
Copy link
Collaborator

Quick question, will you backport from master to 2.0?

@leonardo-albertovich
Copy link
Collaborator

Sorry, the PR merge automatically closed this issue but I'd like to have confirmation before if possible so I'll leave the same message I left in #6981.

PRs #6987, #6988 and #6990 which fix what we suspect to be the root cause of this in the master, 1.9 and 2.0 branches have been merged and the test containers are scheduled for creation, please let us know if you are interested in testing them, once they are ready you will be able to fetch them from :

  • ghcr.io/fluent/fluent-bit/test/master
  • ghcr.io/fluent/fluent-bit/test/1.9
  • ghcr.io/fluent/fluent-bit/test/2.0

Any feedback would be greatly appreciated.

@johnl
Copy link

johnl commented Mar 12, 2023

I'm currently testing "version=2.0.10, commit=cf57255c59" on a staging system - previously, this would break within a few hours of running so if it's still all good come Monday I'd say it's fixed!

@johnl
Copy link

johnl commented Mar 13, 2023

36 hours of runtime, 4500+ uploads, 3 tags, not one corrupted tag and not one upload error. So LGTM!

No reoccurrences of #6981 either. That is a bit of a strange error message for a situation like this, but I suppose a memory corruption bug can do strange things, so that can probably be closed too.

Thanks everyone!

@PettitWesley
Copy link
Contributor

@leonardo-albertovich
Copy link
Collaborator

Tip top! On our side this fix will be available in the next release which should be sooner rather than later.

I'll close this issue now then, once again, thanks a lot folks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants