Cherry-pick #23484 to 7.x: [libbeat] Fix Kafka output "circuit breaker is open" errors #23528
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick of PR #23484 to 7.x branch. Original message:
What does this PR do?
This is a near-fix for #22437. It doesn't strictly respect exponential backoff configuration during a connection error (since the whole nature of the bug is that Sarama in some contexts ignores exponential backoff configuration), but it brings our error reporting and backoff behavior in line with Sarama's and prevents the CPU explosion we were seeing on connection-level Sarama errors.
The particular approach of applying back pressure to Sarama is a little questionable: sleeping on the error reporting thread when we detect that Sarama's circuit breaker has gone off. Most of this PR is an extended comment explaining why that works and why I settled on that approach. Ideally in the future we can get Sarama's error handling behavior better defined / documented, so we can use a more official / supported API mechanism in a future release.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Run Filebeat with any input and the following output configuration (and no kafka server):
Before this PR is applied, this produces an infinite loop of error messages, consuming as much CPU as the system allows. With the PR applied, this merely produces the following error at ~10s intervals: