Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry-pick #23484 to 7.x: [libbeat] Fix Kafka output "circuit breaker is open" errors #23528

Merged
merged 1 commit into from
Jan 15, 2021

Conversation

faec
Copy link
Contributor

@faec faec commented Jan 15, 2021

Cherry-pick of PR #23484 to 7.x branch. Original message:

What does this PR do?

This is a near-fix for #22437. It doesn't strictly respect exponential backoff configuration during a connection error (since the whole nature of the bug is that Sarama in some contexts ignores exponential backoff configuration), but it brings our error reporting and backoff behavior in line with Sarama's and prevents the CPU explosion we were seeing on connection-level Sarama errors.

The particular approach of applying back pressure to Sarama is a little questionable: sleeping on the error reporting thread when we detect that Sarama's circuit breaker has gone off. Most of this PR is an extended comment explaining why that works and why I settled on that approach. Ideally in the future we can get Sarama's error handling behavior better defined / documented, so we can use a more official / supported API mechanism in a future release.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run Filebeat with any input and the following output configuration (and no kafka server):

output.kafka:
  hosts: ["localhost:9092"]
  topic: 'foo'

Before this PR is applied, this produces an infinite loop of error messages, consuming as much CPU as the system allows. With the PR applied, this merely produces the following error at ~10s intervals:

2021-01-13T09:44:17.329-0500    ERROR   [kafka] kafka/client.go:314     Kafka (topic=foo): kafka: client has run out of available brokers to talk to (Is your cluster reachable?)

@faec faec added [zube]: In Review backport Team:Elastic-Agent Label for the Agent team Team:Integrations Label for the Integrations team labels Jan 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jan 15, 2021
@faec faec requested review from urso and kvch January 15, 2021 13:21
Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backport lgtm

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23528 opened

    • Start Time: 2021-01-15T13:21:46.020+0000
  • Duration: 97 min 36 sec

  • Commit: ddc21eb

Test stats 🧪

Test Results
Failed 0
Passed 17343
Skipped 1372
Total 18715

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 17343
Skipped 1372
Total 18715

@faec faec merged commit 162c4fc into elastic:7.x Jan 15, 2021
@zube zube bot removed the [zube]: Done label Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:Elastic-Agent Label for the Agent team Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants