Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Collection Proof of Concept: Validate Chained Approach #955

Closed
8 tasks done
tigrannajaryan opened this issue May 12, 2020 · 15 comments
Closed
8 tasks done

Log Collection Proof of Concept: Validate Chained Approach #955

tigrannajaryan opened this issue May 12, 2020 · 15 comments
Assignees
Milestone

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented May 12, 2020

We want to validate that chaining FluentBit and OpenTelemetry Collector
is a viable approach to support collection of logs in formats that FluentBit
supports and which we would like to avoid implementing directly in Collector.

Chained approach implies that FluentBit collects logs from file logs (and
possibly from other sources), sends them to the Collector, then Collector sends
the logs to the backend.

We will use Fluentd Forward Protocol v1 to send logs from FluentBit to Collector.
We will use OTLP to send logs from Collector to the backend.

The primary concerns that need to be clarified are:

  • What is the performance impact of chained approach? How much more CPU and RAM
    is used by Collector when logs are passed through it compared to the scenario
    when the same logs are collected by FluentBit and send to the backend?

  • How much latency (delay in log delivery) is added by Collector?

  • What is the impact of Collector crashing? How much logs are queued in memory
    and will be lost? Is queuing necessary of can be minimized/avoided to minimize
    the losses? Can we run with 0-sized queue in Collector and rely on queuing and
    batching done by FluentBit?

  • If queuing is necessary what is the impact of adding persistent queues that
    survive the crash and restart of Collector?

In order to clarify these concerns the following tasks need to be performed:

Note: logsproto is the experimental version of OTLP Logs Protocol. Receiver and exporter for this protocol will be added to existing otlp receiver and exporter, but the implementation is experimental and subject to change, so we will not document for end users how it is configured and enabled (developer documentation is still necessary).

@PettitWesley
Copy link

One option for log file tail benchmarking: https://github.com/awslabs/amazon-log-agent-benchmark-tool

@PettitWesley
Copy link

Fluent Forward can be used over TCP or a unix socket, we'd probably want the latter, since it is supposed to be more efficient. fluent/fluent-bit#2181

@tigrannajaryan
Copy link
Member Author

We may still want to use TCP since unix sockets are not available on Windows (or we can choose depending on the platform).

@jkowall
Copy link

jkowall commented May 20, 2020

Vector seems like a better and more flexible lightweight forwarder than fluentbit IMO. Is there a reason we want to go with fluentbit?

@PettitWesley
Copy link

Is there a reason we want to go with fluentbit?

@jkowall I’m not an OpenTelemetry maintainer; I’m not in charge here. I’m just trying to help. But I think that the reasoning is that Fluent Bit is part of an established, graduated CNCF project, is widely used in production (160 million image downloads on Docker Hub for the main distro, considerable usage in other builds too), and has a good set of logging inputs that can be used.

As I understand it, the plan here is not certify Fluent Bit as the OT Logging implementation. The OT Collector will be the unified telemetry agent for OT. However, right now it is primarily focused on metrics and tracing, and that work will continue to be prioritized for some time.

We still want it to be a unified collector for telemetry on the timeline of the OT GA at the end of this year. So the proposal is that we implement support for the log data model/data type in the OT collector, but use Fluent Bit for its input plugins (forward to the collector). That way, we don’t have to spend effort on writing a bunch of log receivers right away.

I suspect (and propose) that the OT Collector Logging implementation will fall into phases:

  1. The Collector will implement the OT log data model and an OTLP log receiver to support the OT SDKs. There will be a build(s) of the collector that bundle Fluent Bit and allow folks to collect logs from common sources. Its uncertain how many people will choose that build, and how many will use other builds of the Collector that are just for traces/metrics at GA.
  2. As the Collector gains adoption, the most used logging inputs (file tailing and syslog) will be implemented as native receivers. Fewer users will need the build that bundles Fluent Bit.
  3. In the very long term, the integration with Fluent Bit will probably be entirely deprecated. If the Collector achieves widespread adoption, folks will contribute a lot of logging features. I think this will happen- lots of key players are converging on OT in a big way. Recently in one of the OT Collector SIGs, some folks from Amazon presented the results of an agent investigation and benchmark that shows that a Golang agent can achieve our performance requirements. It’ll never be as efficient as Fluent Bit, but it can get close enough.

All of that being said, Tigran and the other OT folks are very data driven. I have immense respect for everyone whom I have met thus far in this community. I am sure they will happily review and consider alternate proposals. This idea is just a proposal- it may or may not actually happen.

@PettitWesley
Copy link

We will use Fluentd Forward Protocol v1 to send logs from FluentBit to Collector

I quite like the idea of using Fluent forward protocol. I think that's very useful as a receiver even if Fluent Bit is not bundled with the collector. It means that you could run Fluent Bit or Fluentd on a set of hosts, and forward to the Collector running as an "Aggregator" on another host.

It also means you can integrate with the Fluentd Docker Log Driver.

@tigrannajaryan
Copy link
Member Author

@jkowall FluentBit was selected as the most suitable currently available option as a companion to OpenTelemetry Collector for multiple reasons (Wesley listed several). This is not intended to be a logging agent comparison initiative. We do not have a capacity to do a PoC with multiple logging agents in parallel so we went with the one which looks most promising based on preliminary research. If the PoC shows FluentBit is not a good fit we will consider another agent.

We also have no desire or intent to prevent any other logging agents to be integrated and used in conjunction with OpenTelemetry Collector. Collector explicitly is built with extensibility in mind. Anyone who has the desire can replicate the PoC and the steps listed but using a different logging agent and is welcome to share the results.

@jkowall
Copy link

jkowall commented May 22, 2020

Sounds reasonable to me @tigrannajaryan and @PettitWesley I just could see that having better data pipelines available in a heavy forwarder (fluentd) versus a lightweight forwarder (fluent bit) would seem like a beneficial tradeoff. Keeping things in CNCF is fine, but not a good technical reason to select one technology to standardize upon versus another. I think both solutions would work fine and likely would fit into this pipeline easily.

@rahulchheda
Copy link

rahulchheda commented May 22, 2020

Did anybody took a look at Grafana/loki (https://grafana.com/oss/loki/) with Grafana/Promtail?
Just a suggestion.

@bogdandrutu
Copy link
Member

@tigrannajaryan I think the majority of this is done. Any progress update here? Maybe consider to close this and open smaller issues if only small things are left.

@bogdandrutu bogdandrutu added this to the Backlog milestone Aug 4, 2020
@keitwb
Copy link
Contributor

keitwb commented Aug 4, 2020

I'm going to do a bit more performance testing on the fluentbit standalone vs fluentbit+collector combination to get some more exact numbers on performance. That is the main thing remaining from the list in the description.

@keitwb
Copy link
Contributor

keitwb commented Aug 25, 2020

Performance test results are outlined in https://docs.google.com/document/d/1uMO0DRlesOMGTjla4Ucq1W0wlnepc8DLe3tPJBvUZo4/edit?usp=sharing. Comments welcome.

@tigrannajaryan
Copy link
Member Author

Just to summarize a few important points from the test results document:

  • Addition of Fluent Bit to the Collector increases the total memory usage by about 20MB. This may be a concern for very resource limited use-cases which process very low log volumes. In these use-cases we potentially can do better if this functionality is implemented in the Collector instead. We will need to look into this, but only if we hear concerns about this from end users. For most other uses cases the additional memory usage is either not an issue or we cannot do much better even if Fluent Bit is eliminated and we do all processing in the Collector (we still need memory for queuing/batching/etc). I suggest that we consider the additional memory usage acceptable for now and address this as needed. For each use case where this is a problem we will consider if implementing that supporting that use case directly in the Collector (without Fluent Bit) is feasible.

  • The CPU usage by Fluent Bit is likely close to the increase we would see in the Collector if performed the same processing without Fluent Bit.

  • Overal Fluent Bit+Collector is capable of handling very high rates of logs (up to 150,000 record/sec tested).

I believe the tests prove the validity of the approach. It certainly is usable for many use cases and we can recommend it as the default OpenTelemetry approach for now. Over time we may gradually add more log collection capabilities directly in the Collector, thus eliminating the need to use Fluent Bit in certain cases where it is important.

@sl1316
Copy link

sl1316 commented Nov 17, 2021

Hi,
I am interested in the combination of the fluentbit(log collection agent in the application) and send the logs to openTelemetry collector to preprocessing and redirection. I saw there is already a fluentforwardreceiver in the opentelemetry library https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/fluentforwardreceiver but I didn't see how to export data in fluentbit side. do you mind to share what libraries/setting you were using in the client side(fluentbit)?

@PettitWesley
Copy link

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
Bumps [peter-evans/create-issue-from-file](https://github.com/peter-evans/create-issue-from-file) from 2 to 3.
- [Release notes](https://github.com/peter-evans/create-issue-from-file/releases)
- [Commits](peter-evans/create-issue-from-file@v2...v3)

---
updated-dependencies:
- dependency-name: peter-evans/create-issue-from-file
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants