Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat] Implement early event encoding for the Elasticsearch output #38572

Merged
merged 69 commits into from
Apr 10, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented Mar 23, 2024

Add early-encoding support to the queue and the Elasticsearch output.

Early event encoding lets outputs provide helpers that can perform output serialization on an event while it is still in the queue. This early encoding is done in the memory queue's producers, and in the disk queue's reader loop. Benchmarks while writing to Elasticsearch showed significant improvements (reported numbers were measured on Filebeat with a filestream input).

Memory reduction in each preset relative to past versions:

Preset main 8.13.2 8.12.2
balanced 21% 31% 41%
throughput 43% 47% 56%
scale 23% 35% 46%
latency 24% 24% 41%

CPU reduction for each preset relative to main (earlier versions had negligible difference):

Preset CPU reduction
balanced 7%
throughput 19%
scale 7%
latency 9%

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Any Beats or Agent configuration that uses the Elasticsearch output goes through the early encoding process and tests this PR (and other outputs still indirectly test the API changes). Some special cases to exercise are:

  • Enable the disk queue
  • Disabling output compression
  • Configure Elasticsearch with a low limit on batch size (to test retries / batch splitting)

faec added 29 commits February 28, 2024 14:59
@faec faec marked this pull request as ready for review April 8, 2024 20:49
@faec faec requested a review from a team as a code owner April 8, 2024 20:49
@faec faec requested review from rdner and leehinman April 8, 2024 20:49
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@faec
Copy link
Contributor Author

faec commented Apr 8, 2024

There's still one integration test I'm looking at, in the logstash output, which calls through to the elasticsearch output with unencoded events... I'm just making sure there's no real dependency there. Otherwise things should be ready though. resolved


batch := outest.NewBatch(beat.Event{
batch := encodeBatch(client, outest.NewBatch(beat.Event{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This linter error (and the similar ones below) is mistaken, there is no typecheck error here.

What's worse, if I add a nolint directive to skip it, it gives an error because the directive is "unused", so something is maybe wrong with the linter config...

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some of the benchmark data? It would be very nice to have for historical purposes.

Since we can't toggle back and forth for this change, do you have any tests that show documents ingested with it are exactly the same as without?

@faec
Copy link
Contributor Author

faec commented Apr 9, 2024

@leehinman

Can you add some of the benchmark data?

I'm refreshing the benchmark data right now and I'll add it to the PR when done

do you have any tests that show documents ingested with it are exactly the same as without?

Only inspection on manual tests... I will try something more systematic, though I'm not sure how much variation arises just from ingestion metadata etc. (Do you have any suggestions for how to test this? I know you did something similar with the shipper at one point.)

@leehinman
Copy link
Contributor

Only inspection on manual tests... I will try something more systematic, though I'm not sure how much variation arises just from ingestion metadata etc. (Do you have any suggestions for how to test this? I know you did something similar with the shipper at one point.)

https://github.com/leehinman/docdiff is what I did for shipper. I don't think we need a ton of testing here, the encoding function is the same between the two. I'm probably just being paranoid.

@faec
Copy link
Contributor Author

faec commented Apr 9, 2024

@leehinman Ok I rigged up some generated data in python (depth-3 json objects with many random keys of varying datatypes), ingested with and without this PR and otherwise identical filebeat configs (except an identifying es_encoding field added via processor, to distinguish the two cases). Then I rigged up some jq calls to sort them by an id field and take the diff. The only event fields that differed between the two versions (on 100 events with ~100 fields each) were:

@timestamp
fields.es_encoding [intentional by configuration]
agent.ephemeral_id
agent.id

I'm probably just being paranoid

Better a little paranoia than an escalation :-)

@pierrehilbert
Copy link
Collaborator

Better a little paranoia than an escalation :-)

I really like this

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@faec faec merged commit c9e768d into elastic:main Apr 10, 2024
205 of 215 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants