[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

faec · 2021-01-13T15:05:05Z

What does this PR do?

This is a near-fix for #22437. It doesn't strictly respect exponential backoff configuration during a connection error (since the whole nature of the bug is that Sarama in some contexts ignores exponential backoff configuration), but it brings our error reporting and backoff behavior in line with Sarama's and prevents the CPU explosion we were seeing on connection-level Sarama errors.

The particular approach of applying back pressure to Sarama is a little questionable: sleeping on the error reporting thread when we detect that Sarama's circuit breaker has gone off. Most of this PR is an extended comment explaining why that works and why I settled on that approach. Ideally in the future we can get Sarama's error handling behavior better defined / documented, so we can use a more official / supported API mechanism in a future release.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run Filebeat with any input and the following output configuration (and no kafka server):

output.kafka:
  hosts: ["localhost:9092"]
  topic: 'foo'

Before this PR is applied, this produces an infinite loop of error messages, consuming as much CPU as the system allows. With the PR applied, this merely produces the following error at ~10s intervals:

2021-01-13T09:44:17.329-0500    ERROR   [kafka] kafka/client.go:314     Kafka (topic=foo): kafka: client has run out of available brokers to talk to (Is your cluster reachable?)

elasticmachine · 2021-01-13T15:11:02Z

Pinging @elastic/integrations (Team:Integrations)

elasticmachine · 2021-01-13T15:12:12Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #23484 updated
- Start Time: 2021-01-14T20:04:46.784+0000
Duration: 101 min 41 sec
Commit: 574f5c0

Test stats 🧪

Test	Results
Failed	0
Passed	17387
Skipped	1345
Total	18732

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	17387
Skipped	1345
Total	18732

urso · 2021-01-13T15:45:30Z

libbeat/outputs/kafka/client.go

+			// a circuit breaker error, because it might be an old error still
+			// sitting in the channel from 10 seconds ago. So we only end up
+			// sleeping every _other_ reported breaker error.
+			if breakerOpen {


Can there be multiple breaker errors in the channel? If so we might consider to add a timestamp in fail and ignore errors based on the timestamp.

I considered that, but no, the channel is synchronous, and there's no way to tell when the error was "really" sent except to read from it continually. The timestamp of a breaker error would need to be tracked in the client struct itself rather than say msgRef, which would require more synchronization in updating it etc since it's shared between all batches. And if we didn't want to block the error channel anyway at the end of a batch, we'd need to spawn a new goroutine to dispatch delayed retries, and in the meantime there would be nothing stopping Publish from continuing to spam new data / errors...

The upshot is, we can't just ignore the errors (because reading them at all is what enables the infinite loop), and adding a timestamp doesn't help much since, invariably, the only thing we ever want to do is wait precisely 10 seconds before allowing the next input, starting as soon as we become aware of the error. I tried timestamp-based approaches while diagnosing this, and ended up with worse performance and more complicated code.

urso · 2021-01-13T15:48:52Z

libbeat/outputs/kafka/client.go

+				if msg.ref.err != nil {
+					c.log.Errorf("Kafka (topic=%v): %v", msg.topic, msg.ref.err)
+				}
+				time.Sleep(10 * time.Second)


In case sleeps accumulate we would like some mechanism to return early if the input is stopped. Right now we only return once the error channel is closed.

e.g. https://github.com/elastic/go-concert/blob/master/timed/timed.go#L39

kvch · 2021-01-13T17:17:17Z

libbeat/outputs/kafka/client.go

+					c.log.Errorf("Kafka (topic=%v): %v", msg.topic, msg.ref.err)
+				}
+				select {
+				case <-time.After(10 * time.Second):


Isn't using time.After going to cause memory issues? According to its documentation:

The underlying Timer is not recovered by the garbage collector until the timer fires. If efficiency is a concern, use NewTimer instead and call Timer.Stop if the timer is no longer needed.

I assume no, but I am wondering if you have considered this.

That isn't a concern here since this can happen at most once globally every 10sec, after multiple other failures, and heap-allocating a timer at that granularity is negligible.

elasticmachine · 2021-01-13T21:33:38Z

Pinging @elastic/agent (Team:Agent)

kvch

This change will make our lives much easier. Thank you so much for working on it!

…23484) (cherry picked from commit c11d12c)

…23528) (cherry picked from commit c11d12c)

…23527) (cherry picked from commit c11d12c)

st4r-fish · 2022-10-19T18:04:35Z

@faec I just installed Filebeat 7.11.2 (I need a couple of months before upgrading to 8.x) to avoid this issue. However, it seems that while Kafka (or the target topic) isn't available, each new log entry will still trigger a warning and all those events will be skipped when Kafka (or the target topic) will be available again.
Is there any solution for that?
Thank you

P.s.: I followed up in this issue, because the Elastic community doesn't answer for any Kafka related Qs nor did I got an answer on GH to my original issue until I found this one.

Edit: ACKs are set to "-1"
I tried v7.17.6 too and Filebeat logs connection established even if I ad a bogus hostname, and I still losing events.

faec added 2 commits January 13, 2021 09:42

sarama backoff fix

d0514da

log the error when the kafka error handler goes to sleep

78c5b18

faec added bug libbeat needs_backport PR is waiting to be backported to other branches. labels Jan 13, 2021

faec requested review from urso and kvch January 13, 2021 15:05

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 13, 2021

update changelog

da3bec0

faec added the Team:Integrations Label for the Integrations team label Jan 13, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 13, 2021

urso reviewed Jan 13, 2021

View reviewed changes

allow early cancellation of circuit breaker timeout during shutdown

aa259e1

kvch reviewed Jan 13, 2021

View reviewed changes

faec changed the title ~~Kafka circuit breaker~~ [libbeat] Fix Kafka output "circuit breaker is open" errors Jan 13, 2021

andresrc added the Team:Elastic-Agent Label for the Agent team label Jan 13, 2021

andresrc assigned faec Jan 13, 2021

kvch approved these changes Jan 14, 2021

View reviewed changes

urso approved these changes Jan 14, 2021

View reviewed changes

faec added 2 commits January 14, 2021 12:38

Merge branch 'master' into kafka-circuit-breaker

41f1533

Merge branch 'master' into kafka-circuit-breaker

574f5c0

faec merged commit c11d12c into elastic:master Jan 15, 2021

faec deleted the kafka-circuit-breaker branch January 15, 2021 01:04

faec mentioned this pull request Jan 15, 2021

Cherry-pick #23484 to 7.11: [libbeat] Fix Kafka output "circuit breaker is open" errors #23527

Merged

6 tasks

faec added v7.11.0 and removed needs_backport PR is waiting to be backported to other branches. labels Jan 15, 2021

faec added a commit to faec/beats that referenced this pull request Jan 15, 2021

[libbeat] Fix Kafka output "circuit breaker is open" errors (elastic#…

a042393

…23484) (cherry picked from commit c11d12c)

faec mentioned this pull request Jan 15, 2021

Cherry-pick #23484 to 7.x: [libbeat] Fix Kafka output "circuit breaker is open" errors #23528

Merged

6 tasks

faec added a commit to faec/beats that referenced this pull request Jan 15, 2021

[libbeat] Fix Kafka output "circuit breaker is open" errors (elastic#…

ddc21eb

…23484) (cherry picked from commit c11d12c)

faec added the v7.12.0 label Jan 15, 2021

faec added a commit that referenced this pull request Jan 15, 2021

[libbeat] Fix Kafka output "circuit breaker is open" errors (#23484) (#…

162c4fc

…23528) (cherry picked from commit c11d12c)

faec added a commit that referenced this pull request Jan 15, 2021

[libbeat] Fix Kafka output "circuit breaker is open" errors (#23484) (#…

274d078

…23527) (cherry picked from commit c11d12c)

This was referenced Jan 18, 2021

Update NOTICE.txt #23549

Merged

Fix typo in input-httpjson.asciidoc #23536

Merged

Cherry-pick #23549 to 7.x: Update NOTICE.txt #23550

Merged

Cherry-pick #23549 to 7.11: Update NOTICE.txt #23551

Merged

v1v mentioned this pull request Jan 18, 2021

[CI] Run make notice linting #23553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

faec commented Jan 13, 2021 •

edited

Loading

elasticmachine commented Jan 13, 2021

elasticmachine commented Jan 13, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Test stats 🧪

urso Jan 13, 2021

faec Jan 13, 2021

urso Jan 13, 2021

faec Jan 13, 2021

kvch Jan 13, 2021 •

edited

Loading

faec Jan 13, 2021

elasticmachine commented Jan 13, 2021

kvch left a comment

st4r-fish commented Oct 19, 2022 •

edited

Loading

[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

[libbeat] Fix Kafka output "circuit breaker is open" errors #23484

Conversation

faec commented Jan 13, 2021 • edited Loading

What does this PR do?

Checklist

How to test this PR locally

elasticmachine commented Jan 13, 2021

elasticmachine commented Jan 13, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

urso Jan 13, 2021

Choose a reason for hiding this comment

faec Jan 13, 2021

Choose a reason for hiding this comment

urso Jan 13, 2021

Choose a reason for hiding this comment

faec Jan 13, 2021

Choose a reason for hiding this comment

kvch Jan 13, 2021 • edited Loading

Choose a reason for hiding this comment

faec Jan 13, 2021

Choose a reason for hiding this comment

elasticmachine commented Jan 13, 2021

kvch left a comment

Choose a reason for hiding this comment

st4r-fish commented Oct 19, 2022 • edited Loading

faec commented Jan 13, 2021 •

edited

Loading

elasticmachine commented Jan 13, 2021 •

edited by jenkins-beats-ci bot

Loading

kvch Jan 13, 2021 •

edited

Loading

st4r-fish commented Oct 19, 2022 •

edited

Loading