Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

Merged
merged 6 commits into from
Jan 15, 2021

Conversation

faec
Copy link
Contributor

@faec faec commented Jan 13, 2021

What does this PR do?

This is a near-fix for #22437. It doesn't strictly respect exponential backoff configuration during a connection error (since the whole nature of the bug is that Sarama in some contexts ignores exponential backoff configuration), but it brings our error reporting and backoff behavior in line with Sarama's and prevents the CPU explosion we were seeing on connection-level Sarama errors.

The particular approach of applying back pressure to Sarama is a little questionable: sleeping on the error reporting thread when we detect that Sarama's circuit breaker has gone off. Most of this PR is an extended comment explaining why that works and why I settled on that approach. Ideally in the future we can get Sarama's error handling behavior better defined / documented, so we can use a more official / supported API mechanism in a future release.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run Filebeat with any input and the following output configuration (and no kafka server):

output.kafka:
  hosts: ["localhost:9092"]
  topic: 'foo'

Before this PR is applied, this produces an infinite loop of error messages, consuming as much CPU as the system allows. With the PR applied, this merely produces the following error at ~10s intervals:

2021-01-13T09:44:17.329-0500    ERROR   [kafka] kafka/client.go:314     Kafka (topic=foo): kafka: client has run out of available brokers to talk to (Is your cluster reachable?)

@faec faec added bug libbeat needs_backport PR is waiting to be backported to other branches. labels Jan 13, 2021
@faec faec requested review from urso and kvch January 13, 2021 15:05
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 13, 2021
@faec faec added the Team:Integrations Label for the Integrations team label Jan 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 13, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 13, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23484 updated

    • Start Time: 2021-01-14T20:04:46.784+0000
  • Duration: 101 min 41 sec

  • Commit: 574f5c0

Test stats 🧪

Test Results
Failed 0
Passed 17387
Skipped 1345
Total 18732

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 17387
Skipped 1345
Total 18732

// a circuit breaker error, because it might be an old error still
// sitting in the channel from 10 seconds ago. So we only end up
// sleeping every _other_ reported breaker error.
if breakerOpen {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be multiple breaker errors in the channel? If so we might consider to add a timestamp in fail and ignore errors based on the timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but no, the channel is synchronous, and there's no way to tell when the error was "really" sent except to read from it continually. The timestamp of a breaker error would need to be tracked in the client struct itself rather than say msgRef, which would require more synchronization in updating it etc since it's shared between all batches. And if we didn't want to block the error channel anyway at the end of a batch, we'd need to spawn a new goroutine to dispatch delayed retries, and in the meantime there would be nothing stopping Publish from continuing to spam new data / errors...

The upshot is, we can't just ignore the errors (because reading them at all is what enables the infinite loop), and adding a timestamp doesn't help much since, invariably, the only thing we ever want to do is wait precisely 10 seconds before allowing the next input, starting as soon as we become aware of the error. I tried timestamp-based approaches while diagnosing this, and ended up with worse performance and more complicated code.

if msg.ref.err != nil {
c.log.Errorf("Kafka (topic=%v): %v", msg.topic, msg.ref.err)
}
time.Sleep(10 * time.Second)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case sleeps accumulate we would like some mechanism to return early if the input is stopped. Right now we only return once the error channel is closed.

e.g. https://github.com/elastic/go-concert/blob/master/timed/timed.go#L39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

c.log.Errorf("Kafka (topic=%v): %v", msg.topic, msg.ref.err)
}
select {
case <-time.After(10 * time.Second):
Copy link
Contributor

@kvch kvch Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't using time.After going to cause memory issues? According to its documentation:

The underlying Timer is not recovered by the garbage collector until the timer fires. If efficiency is a concern, use NewTimer instead and call Timer.Stop if the timer is no longer needed.

I assume no, but I am wondering if you have considered this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't a concern here since this can happen at most once globally every 10sec, after multiple other failures, and heap-allocating a timer at that granularity is negligible.

@faec faec changed the title Kafka circuit breaker [libbeat] Fix Kafka output "circuit breaker is open" errors Jan 13, 2021
@andresrc andresrc added the Team:Elastic-Agent Label for the Agent team label Jan 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

Copy link
Contributor

@kvch kvch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will make our lives much easier. Thank you so much for working on it!

@faec faec merged commit c11d12c into elastic:master Jan 15, 2021
@faec faec deleted the kafka-circuit-breaker branch January 15, 2021 01:04
@faec faec added v7.11.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jan 15, 2021
@st4r-fish
Copy link

st4r-fish commented Oct 19, 2022

@faec I just installed Filebeat 7.11.2 (I need a couple of months before upgrading to 8.x) to avoid this issue. However, it seems that while Kafka (or the target topic) isn't available, each new log entry will still trigger a warning and all those events will be skipped when Kafka (or the target topic) will be available again.
Is there any solution for that?
Thank you

P.s.: I followed up in this issue, because the Elastic community doesn't answer for any Kafka related Qs nor did I got an answer on GH to my original issue until I found this one.

Edit: ACKs are set to "-1"
I tried v7.17.6 too and Filebeat logs connection established even if I ad a bogus hostname, and I still losing events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug libbeat Team:Elastic-Agent Label for the Agent team Team:Integrations Label for the Integrations team v7.11.0 v7.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants