Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: minimal synchronous scheduler #6339

Merged
merged 2 commits into from
Nov 25, 2022

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Nov 3, 2022

Minimal Synchronous Scheduler - 1.9x

Master branch PR #6413
Please leave review comments on this 1.9x branch.

Summary

Implement a synchronous task scheduler plugin option allowing for the cloudwatch_logs plugin to opt into to allow for migrate to the Async Network stack.

Issue

Due to limitations of the CloudWatch API in processing PutLogEvents network requests synchronously, Fluent Bit sends data to CloudWatch using a less supported “synchronous” networking stack. This stack is prone to indefinite hangs and segfaults and only works well when used with a fine tuned configuration.
See: #6140 and #6329

Investigative Efforts

While at first we tried to resolve the networking hang issues found in the synchronous network stack by adding OpenSSL error handling, DNS Timeouts, and enabling unidirectional TLS shutdowns, these efforts only made fluent bit go from failing once in 5 minutes without the changes to once in 5 hours with the changes - when tested under a high load failure case. We determined that it would take too much effort to isolate synchronous network stack issues and decided to invest efforts switching to the widely used Fluent Bit asynchronous network stack.

Solution

Our proposed solution is to migrate the Cloudwatch Logs output plugin to Fluent Bit’s asynchronous network stack.

Switching to Async Network Stack

CloudWatch API Synchronous Usage Requirements
CloudWatch relies on the Synchronous networking to ensure that CloudWatch Logs PutLogEvents requests are done sequentially.

Normally when the asynchronous network stack is used, Fluent Bit context switches in the next batch of logs into processing when the previous batch yields on a network call. This defeats the desired sequential PutLogEvents execution required by CloudWatch.

Existing Core Synchronous Scheduler
In order to enforce sequential processing of log data when the asynchronous network stack is used, we opt our CloudWatch Logs plugin into a Fluent Bit Core synchronous task scheduler which limits one batch of logs to be processed at a time, essentially using the asynchronous networking stack in a synchronous manner.

A bottleneck was discovered in the Fluent Bit Core Synchronous scheduler, which limits processing logs to 1 batch per second (or per flush interval).

New Performant Core Synchronous Scheduler
A performant new core scheduler was written by the FireLens team that removes this 1 batch per second restriction while keeping the one batch at a time processing restriction in place. The CloudWatch Plugin opts into the performant Synchronous Scheduler implementation and uses the asynchronous network stack.

For plugins that opt into FLB_OUTPUT_SYNCHRONOUS by setting that as a plugin flag, there will be a limit of 1 task per output_instance worker group.

Testing and Results

Unit Testing

A series of 24 hour tests were conducted on Fluent Bit 1.9x with the patch with and without Valgrind. No network hangs were observed on 1.9 and no memory leaks were introduced by the patch.

A 24 hour test was conducted on Fluent Bit 2.0x with the patch. No network hangs were observed.

Parallel Long Running Durability Tests

To simulate the customer’s long running execution of Fluent Bit, 40-100 ECS FireLens test tasks per test were run in parallel to accumulate cumulative running time and gain confidence in the patch.

The following is a stability matrix outlining the patches impact on Fluent Bit’s durability rating which is described lowerbounded average hours to failure (HTF)

Fluent 1.9x (AWS For Fluent Bit Official Release)

_ Patch No Patch
Keepalive On [2] Very Stable (+3500h) [1,8] Segfault on some network errors after throttling limits (~80h)
Keepalive Off [3] Somewhat stable (~2000h) Cloudwatch Hang(~0.08h)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@matthewfala matthewfala temporarily deployed to pr November 3, 2022 01:46 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 01:46 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 01:54 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 01:54 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 02:08 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 02:08 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 02:17 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 02:17 Inactive
src/flb_output.c Outdated Show resolved Hide resolved
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 10:40 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 17:55 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 17:55 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 18:03 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 18:03 Inactive
src/flb_output.c Outdated Show resolved Hide resolved
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 18:14 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 18:14 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 18:28 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:29 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:30 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:43 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:43 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:47 Inactive
@matthewfala matthewfala temporarily deployed to pr November 3, 2022 19:48 Inactive
@matthewfala matthewfala temporarily deployed to pr November 10, 2022 22:12 Inactive
@matthewfala matthewfala temporarily deployed to pr November 10, 2022 22:12 Inactive
@matthewfala matthewfala temporarily deployed to pr November 10, 2022 22:43 Inactive
@matthewfala matthewfala marked this pull request as draft November 11, 2022 01:24
@matthewfala matthewfala temporarily deployed to pr November 11, 2022 21:30 Inactive
@matthewfala matthewfala temporarily deployed to pr November 11, 2022 21:30 Inactive
@matthewfala matthewfala temporarily deployed to pr November 11, 2022 21:46 Inactive
*/
upstream->flags &= ~(FLB_IO_ASYNC);
// upstream->flags &= ~(FLB_IO_ASYNC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just remove this commented code? And put the comment over the FLB_OUTPUT_SYNCHRONOUS invocation at teh bottom of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good!

PettitWesley
PettitWesley previously approved these changes Nov 11, 2022
@matthewfala matthewfala temporarily deployed to pr November 12, 2022 00:32 Inactive
@matthewfala matthewfala temporarily deployed to pr November 12, 2022 00:32 Inactive
@matthewfala matthewfala temporarily deployed to pr November 12, 2022 00:47 Inactive
@matthewfala matthewfala marked this pull request as ready for review November 16, 2022 23:10
Signed-off-by: Matthew Fala <falamatt@amazon.com>
… cloudwatch

Signed-off-by: Matthew Fala <falamatt@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants