[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

afharo · 2020-08-25T10:05:57Z

What does this PR do?

When metricbeat collects stats from Kibana. If the attempt to collect Usage fails, it backs off the next Usage collection attempt by 1h so the next attempt will not collect usage (only metrics).

Why is it important?

Usage stats collection is a CPU-intensive process in Kibana. We've noticed some clusters timing out and metricbeat repeatedly trying again after 10s, increasing the load in Kibana and taking even longer to respond. It makes the platform to get stuck in a loop where Kibana is loaded but Monitoring don't receive any data to alert about that because metricbeat can't get the basic metrics.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

I wish I could overwrite usageCollectionBackoff for the tests to force a retry (maybe there's a way but not a Go dev myself 😅)

How to test this PR locally

Deploy for local Kibana and add a delay on any Usage Collector that takes longer than 10s to respond. You'll notice the metricbeat requests will timeout and will avoid getting the usage on the next attempt and it will try again in 1 hour.

Related issues

Closes Kibana stats: less aggressive usage collection if failed #20771

Use cases

This change will affect to the consistency we ship the Kibana usage. But, on the other hand, it will improve the consistency in which we deliver Kibana metrics which, to me, it's more important for monitoring purposes.

Screenshots

Logs

No new logs added in this implementation

elasticmachine · 2020-08-25T10:23:48Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request #20772 updated]
Start Time: 2020-09-17T17:49:05.255+0000
Duration: 75 min 12 sec

Test stats 🧪

Test	Results
Failed	0
Passed	3176
Skipped	689
Total	3865

elasticmachine · 2020-08-25T13:23:44Z

Pinging @elastic/stack-monitoring (Stack monitoring)

elasticmachine · 2020-08-25T13:23:44Z

Pinging @elastic/integrations-services (Team:Services)

ycombinator · 2020-08-25T14:58:41Z

@afharo Thanks for the PR. Adding a backoff is a good idea!

I have a couple of recommendations for the implementation and one food for thought.

On the implementation

m.usageLastCollectedOn is meant to be the time when usage was last collected. So manipulating it if there is an error response from Kibana, just so the shouldCollectUsage() method would return false for the next usageCollectionBackoff duration seems a bit hacky. Instead, I would suggest storing a new field in the stats.MetricSet struct (as a sibling of usageLastCollectedOn), something like usageNextCollectOn. If there is an error response from Kibana, I would set this field to now + usageCollectionBackoff. Then, in the shouldCollectUsage() method, add another condition to check if now >= m.usageNextCollectOn.
Given the use case for this PR, in an ideal world Kibana would return an HTTP 503 (Service Unavailable) or 429 (Too Many Requests) when it is overloaded (maybe this is already the case — I didn't check). If so, then I would check for this specific error response and only then set m.usageNextCollectOn.

Food for thought

In an ideal world, Kibana would be telling any client how long to backoff for, rather than the client hardcoding a value or even deciding on a dynamic backoff strategy. This is because Kibana will always have at least as much, if not more, information than any client about how long Kibana might need to recover from it's current state. So it would be ideal if Kibana could just tell the client how long to backoff for as part of a 503 or 429 response. Indeed, the HTTP protocol provides a header exactly for this purpose: Retry-After. Even in a basic implementation, Kibana could just hardcode this value to now+1h and then in the future make it more sophisticated based on load, etc. if desired.

All that said, I'm okay not going down this path right now. This is more just something for you to consider in the medium/long-term.

afharo · 2020-08-25T15:41:48Z

Wow! Thank you @ycombinator for all the insightful comments! I'll go ahead and implement the m.usageNextCollectOn suggestion.

Regarding Kibana, it is currently quite dumb, and, to be fair, it's not failing, just taking too long to respond, so metricbeat is timing out on its end. I've created an issue in the Kibana repo to be more self-protective in that regard (and added a comment about the Retry-After suggestion).

…edOn`

afharo · 2020-09-02T16:19:55Z

Sounds good to me 👍

ycombinator · 2020-09-09T09:57:51Z

Based on recent comments in #20956 I think we can revert 428b503 from this PR once #20956 is closed (umerged). And at that point this PR should be ready for a final review.

afharo · 2020-09-09T13:54:11Z

@ycombinator, after reverting the changes, all the tests have gone through this time (flaky tests 😅)

pgayvallet · 2020-09-14T15:16:19Z

@elastic/kibana-platform might be able to help on identifying if there's anything that has changed on Kibana that might cause this effect?

Sorry for the late answer.

I've noticed Kibana may return "green" status before having any "metrics" information, so the tests where failing.

Almost all the legacy plugins are no more, and the legacy status API only depends on that, so it might be causing the status to go green faster, and before the initial metrics are actually collected.

Were you able to see the approx duration between the status is green and the endpoint actually returns metrics?

afharo · 2020-09-16T16:23:42Z

@pgayvallet thank you for coming back with these answers.

Were you able to see the approx duration between the status is green and the endpoint actually returns metrics?

I didn't measure it exactly. Just a couple of seconds (but enough for these tests to fail sometimes, the health check starts after 60s and then polls every second the /api/status, waiting for the green status. Kibana usually boots up faster than 60s so, by the time it starts polling, everything is safely in place. But when it takes a little longer to start, these flaky tests will likely go pass the health-check and start when /api/status returns green and fail the tests that are expecting the metrics to be there.

AFAIK, elastic/kibana#76054 will solve that issue, warrantying the metrics exist (using a ReplayObserver).

…ana/usage_collection

ycombinator

LGTM. Thanks for the fix!

elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…t getting usage stats (#20772) (#21163) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…ne-2.0 * upstream/master: Add Cloud Foundry dashboards for metricbeat (elastic#21124) [Metricbeat][Kibana] Apply backoff when errored at getting usage stats (elastic#20772) Update input-log.asciidoc (elastic#20965) (elastic#21153) Remove redirects page (elastic#19574) [Ingest Manager] Fixed input types for filebeat (elastic#21131) docs: add beat specific install widget (elastic#21050) docs: link to APM privs for API keys (elastic#20911) Fix index out of range error when getting AWS account name (elastic#21101) Agent info docs and ci test pr (elastic#19805) Handling missing counters in application_pool metricset (elastic#21071)

elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…t getting usage stats (#20772) (#21162) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

…t getting usage stats (elastic#20772) (elastic#21162) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

afharo added bug Metricbeat Metricbeat Team:Observability labels Aug 25, 2020

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 25, 2020

afharo force-pushed the metricbeat/kibana/usage_collection branch 2 times, most recently from a67cbc4 to 68a59f2 Compare August 25, 2020 10:19

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats

21b1720

afharo force-pushed the metricbeat/kibana/usage_collection branch from 68a59f2 to 21b1720 Compare August 25, 2020 11:05

afharo marked this pull request as ready for review August 25, 2020 12:20

afharo requested review from adriansr, leehinman and jsoriano August 25, 2020 12:21

afharo added review Feature:Stack Monitoring Team:Services (Deprecated) Label for the former Integrations-Services team labels Aug 25, 2020

afharo removed the Team:Observability label Aug 25, 2020

afharo requested review from a team and removed request for jsoriano, adriansr and leehinman August 25, 2020 13:24

afharo mentioned this pull request Aug 25, 2020

Kibana stats: less aggressive usage collection if failed #20771

Closed

afharo mentioned this pull request Aug 25, 2020

[Status] /api/stats?extended=true throttling/caching elastic/kibana#75886

Closed

Use m.usageNextCollectOn instead of overwriting `m.usageLastCollect…

3f07a61

…edOn`

afharo mentioned this pull request Sep 3, 2020

[Metricbeat][Kibana][status] metrics is optional #20956

Closed

6 tasks

Merge branch 'master' into metricbeat/kibana/usage_collection

5ee02c6

Revert 428b503

c4fe2d7

thbkrkr mentioned this pull request Sep 9, 2020

Avoid panic on Kibana telemetry API call elastic/cloud-on-k8s#3736

Merged

afharo added 2 commits September 17, 2020 14:16

Merge branch 'master' of github.com:elastic/beats into metricbeat/kib…

8d58fbb

…ana/usage_collection

Merge branch 'master' of github.com:elastic/beats into metricbeat/kib…

1bccb64

…ana/usage_collection

ycombinator approved these changes Sep 17, 2020

View reviewed changes

ycombinator added needs_backport PR is waiting to be backported to other branches. v7.10.0 v8.0.0 v7.9.2 labels Sep 17, 2020

afharo merged commit 5332b2d into elastic:master Sep 18, 2020

afharo deleted the metricbeat/kibana/usage_collection branch September 18, 2020 09:41

afharo mentioned this pull request Sep 18, 2020

Cherry-pick to 7.9: [Metricbeat][Kibana] Apply backoff when errored at getting usage stats (#20772) #21162

Merged

afharo mentioned this pull request Sep 18, 2020

Cherry-pick to 7.x: [Metricbeat][Kibana] Apply backoff when errored at getting usage stats (#20772) #21163

Merged

afharo added a commit to afharo/beats that referenced this pull request Sep 18, 2020

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats (

27a9a5a

elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

afharo added a commit to afharo/beats that referenced this pull request Sep 18, 2020

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats (

1801fc5

elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

afharo commented Aug 25, 2020 •

edited

Loading

elasticmachine commented Aug 25, 2020 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

elasticmachine commented Aug 25, 2020

elasticmachine commented Aug 25, 2020

ycombinator commented Aug 25, 2020 •

edited

Loading

afharo commented Aug 25, 2020

afharo commented Sep 2, 2020

ycombinator commented Sep 9, 2020

afharo commented Sep 9, 2020

pgayvallet commented Sep 14, 2020

afharo commented Sep 16, 2020

ycombinator left a comment

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

Conversation

afharo commented Aug 25, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

elasticmachine commented Aug 25, 2020 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

elasticmachine commented Aug 25, 2020

elasticmachine commented Aug 25, 2020

ycombinator commented Aug 25, 2020 • edited Loading

On the implementation

Food for thought

afharo commented Aug 25, 2020

afharo commented Sep 2, 2020

ycombinator commented Sep 9, 2020

afharo commented Sep 9, 2020

pgayvallet commented Sep 14, 2020

afharo commented Sep 16, 2020

ycombinator left a comment

Choose a reason for hiding this comment

afharo commented Aug 25, 2020 •

edited

Loading

elasticmachine commented Aug 25, 2020 •

edited by jenkins-beats-ci bot

Loading

ycombinator commented Aug 25, 2020 •

edited

Loading