Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Outputs throughput handling #963

Closed
aleksmaus opened this issue Aug 9, 2024 · 5 comments
Closed

Improve Outputs throughput handling #963

aleksmaus opened this issue Aug 9, 2024 · 5 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@aleksmaus
Copy link
Contributor

Hi 👋

I was playing with Falco Sidekick with Elasticsearch Output specifically, but I think this issue would be common for all http outputs and possibly others.

Currently there is not limiters on the number of the outgoing requests from the Sidekick:
for each document from Falco the Go routine is created
https://github.com/falcosecurity/falcosidekick/blob/master/handlers.go#L267
that creates an http request and the connection to the server.

In many different Outputs the mutex lock is used for the Auth headers, for example:
https://github.com/falcosecurity/falcosidekick/blob/master/outputs/elasticsearch.go#L60

So we see the different runtime characteristics depending how the Output configured.
For Elasticsearch output:

  1. If the username and password specified, all requests are serialized, ran one after another. Which limits the number of outgoing requests to one at a time, which is probably good, but would hurt the data throughput.
  2. If the customHeaders for API Key auth specified, all the requests are executed at the "same time". Where with the high rate of incoming data you will see unlimited number connections to Elasticsearch, that ends up with the large number of outgoing connections and TLS handshake or IO timeouts errors etc.

The issue overall is that depending on the rate of incoming data and configuration of the output it is possible to destabilize the Sidekick and the environment it runs on. Need more predictable, configurable resources and networking utilization.

Possible steps to address.

Short term

  1. The PR for http client reuse is proposed here. This should help with the throughput and connections utilization a bit.
  2. Refactor http headers handling so that it doesn't require locking and at the same time maybe limit the number of outgoing requests at a time. In many cases it's already limited to one, but it can be improved if allow to configure few at a time. Short term can be as simple as configurable semaphore on the http request code path. I can look into it.

Mid term

  1. For Elasticsearch output specifically start supporting configurable batching. Currently working on this and seeing 1000x throughput improvement so far, depending on the batching configuration.

Long term

  1. Better management of outgoing requests from Falco Sidekick: configurable queue/ringbuffer for incoming data, requests parallelization.

Please let me know if you have any thoughts/feedback or if somebody is already addressing these issues. Meanwhile I'll start working on this.

@aleksmaus aleksmaus added the kind/bug Something isn't working label Aug 9, 2024
@Issif Issif self-assigned this Aug 17, 2024
@Issif Issif added this to the 2.30 milestone Aug 17, 2024
@Issif
Copy link
Member

Issif commented Aug 21, 2024

Hi @aleksmaus,

Thanks for this issue and your very clear comments and feedback.

I was already aware of most of the issues you mention. I will comment here to explain the "why" to avoid to pollute the PR. The config and client have been developed with a "naive" vision from the beginning. First because I started the project years ago, my Go skills were much lower (are they better now? not so sure...) and because I had in mind to give a very explicit and easy to understand code base which could allow any one, even with a small knowledge of Go, to contribute and add a new output.

I still have in mind to create a v3 some day, up to date with the best practices, but as Falcosidekick is very stable, gets very few bug report, my focus is on other projects (I still maintain it, and will continue, the project lives and will) like https://github.com/falco-talon/falco-talon which is way better designed (I hope) and may be used as base for the next Falcosidekick.

Anyway, your PR are very welcome and valuable, I'm happy to review them. Thanks again.

@aleksmaus
Copy link
Contributor Author

Hi @Issif,
Thank you for the explanation and understanding. I more than happy to help with refactoring for v3.

@Issif
Copy link
Member

Issif commented Aug 21, 2024

Hi @Issif, Thank you for the explanation and understanding. I more than happy to help with refactoring for v3.

the project is not started, don't know it will be, so your improvements for the v2.x are welcome 😉

@aleksmaus
Copy link
Contributor Author

I'll close this issue now, since the short and mid term goals are addressed now. The long term is probably a candidate for v3 that you mentioned, so can revisit the details when it comes down to it.

@github-project-automation github-project-automation bot moved this from To do to Done in Falcosidekick 2.x Sep 3, 2024
@Issif
Copy link
Member

Issif commented Sep 3, 2024

Thanks a lot for your help 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants