-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 output intermittently fails with errors SignatureDoesNotMatch, broken pipe, or HTTP version error #6933
Comments
I've stopped seeing these errors after I made a change to the It occurs to me that the primary effect of the change to The cause of these problems seems likely to be difficult to track down and apparently it isn't very common since others aren't reporting it. If I have time I will try to capture some of the raw HTTP requests to see if they shed any light. |
I'm seeing the same issue. I'm also suspecting the
My suspicion on
|
I've just come across the same issue as @georgantasp too, and am getting similar upload problems to an S3 compatible endpoint. I have:
$TAG should be |
I wonder if the issue is that s3_put_object treats tag as a NULL terminated string whereas its callers clearly receive a buffer with its length explicitly stated. I think modifying that function to expect the caller to provide the length and make a local copy would fix the problem. At the end of the chain there's I doubt I'll be able to dedicate time to this issue myself but if anyone wants to take a stab feel free to contact me for assistance either in the issue or in slack. |
I think this tracks. The control chars and date fragments seemed like they were coming from raw fluent message pack. I originally suspected the encoding in my client. For what it's worth, I switched back to the aws-for-fluent-bit image in the meantime. |
@PettitWesley would you be able to patch that? |
That looks right to me yeah - presumably the only reason this works at all in some circumstances is because the buffer is initially zeroed. But that suggests that the buffer is reused at some point and not zeroed, right? It looks to be using calloc, so doesn't seem so. Is this actually related to rewrite_tag? Is everyone seeing this using rewrite_tag?
if I'm following it correctly, the tag buffer is expected to be NULL terminated in a few other place too, such as error messages: fluent-bit/plugins/out_s3/s3.c Lines 1211 to 1214 in 6ee3b8a
|
I will investigate if this report is related: aws/aws-for-fluent-bit#541 |
@leonardo-albertovich since we switched to event_chunk, tags are now always null terminated because they are SDS strings:
I think something else in the |
Could it be that when a chunk is added to the upload queue by add_to_queue a raw copy of the tag is made in line 1584 with a buffer that's insufficient to hold the terminator? |
@leonardo-albertovich Oh my... thanks for finding this for me... I would have spent hours and probably not found it... I will submit a fix ASAP. |
Glad to be of help, it seems like we need to patch this in master, 2.0 and 1.9, will you send three patches or do you want someone else to backport it once you send it to master? |
@leonardo-albertovich I will submit both PRs ASAP |
On a slightly related note, I'm also working on completely deleting and rewriting the s3 upload queue code... since its buggy and could be simplified a lot |
Awesome, let me know if you need anything on my end =) |
Oh so this came up in 2.x because I made |
Yes but I manually checked the code in 1.9 and that part matches so I think we should patch it as well. Edit: Sorry, I think i misunderstood you, it's too late, I should be sleeping. |
@leonardo-albertovich Here you go. I also bundled in some other minor improvements that I thought of, let me know if you want me to remove them, they are:
Pull Requests: |
Quick question, will you backport from master to 2.0? |
Sorry, the PR merge automatically closed this issue but I'd like to have confirmation before if possible so I'll leave the same message I left in #6981. PRs #6987, #6988 and #6990 which fix what we suspect to be the root cause of this in the master, 1.9 and 2.0 branches have been merged and the test containers are scheduled for creation, please let us know if you are interested in testing them, once they are ready you will be able to fetch them from :
Any feedback would be greatly appreciated. |
I'm currently testing "version=2.0.10, commit=cf57255c59" on a staging system - previously, this would break within a few hours of running so if it's still all good come Monday I'd say it's fixed! |
36 hours of runtime, 4500+ uploads, 3 tags, not one corrupted tag and not one upload error. So LGTM! No reoccurrences of #6981 either. That is a bit of a strange error message for a situation like this, but I suppose a memory corruption bug can do strange things, so that can probably be closed too. Thanks everyone! |
Fixed in AWS release: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.31.7 |
Tip top! On our side this fix will be available in the next release which should be sooner rather than later. I'll close this issue now then, once again, thanks a lot folks! |
Bug Report
Describe the bug
I have Fluent Bit deployed in a kubernetes cluster sending a large volume of logs to an S3 bucket. Most logs are transmitted successfully, however Fluent Bit regularly logs the "PutObject request failed" error. (This is odd because
use_put_object
is set tofalse
) Fluent Bit logs the HTTP 403 response it got from S3, which has this error text: "The request signature we calculated does not match the signature you provided. Check your key and signing method."There are also a lot of "broken pipe" errors although it is unclear if they are related.
Even more bizarrely, there are intermittent HTTP 505 Version not supported errors. I have no idea how these could be intermittent, surely the HTTP version is always the same?
To Reproduce
I'm afraid I don't have a concise set of steps to reproduce, our environment is fairly large and complex and this issue only seems to appear under heavy load.
Expected behavior
Fluent Bit sends log files to S3 bucket
Screenshots
N/A
Your Environment
Additional context
I know that Fluent Bit retries requests to S3 but I am seeing occasional messages like this:
So I am concerned that I am losing log messages.
I realize that these could be 3 different issues however they seem to occur together and I'm wondering if they could have a common cause.
Incidentally, I found this comment (over a year old) which reports the same behavior: #4505 (comment)
The text was updated successfully, but these errors were encountered: