Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772

Merged
merged 14 commits into from
Sep 18, 2020

Conversation

afharo
Copy link
Member

@afharo afharo commented Aug 25, 2020

What does this PR do?

When metricbeat collects stats from Kibana. If the attempt to collect Usage fails, it backs off the next Usage collection attempt by 1h so the next attempt will not collect usage (only metrics).

Why is it important?

Usage stats collection is a CPU-intensive process in Kibana. We've noticed some clusters timing out and metricbeat repeatedly trying again after 10s, increasing the load in Kibana and taking even longer to respond. It makes the platform to get stuck in a loop where Kibana is loaded but Monitoring don't receive any data to alert about that because metricbeat can't get the basic metrics.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • I wish I could overwrite usageCollectionBackoff for the tests to force a retry (maybe there's a way but not a Go dev myself 😅)

How to test this PR locally

Deploy for local Kibana and add a delay on any Usage Collector that takes longer than 10s to respond. You'll notice the metricbeat requests will timeout and will avoid getting the usage on the next attempt and it will try again in 1 hour.

Related issues

Use cases

This change will affect to the consistency we ship the Kibana usage. But, on the other hand, it will improve the consistency in which we deliver Kibana metrics which, to me, it's more important for monitoring purposes.

Screenshots

Logs

No new logs added in this implementation

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Aug 25, 2020
@afharo afharo force-pushed the metricbeat/kibana/usage_collection branch 2 times, most recently from a67cbc4 to 68a59f2 Compare August 25, 2020 10:19
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 25, 2020

💚 Build Succeeded

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #20772 updated]

  • Start Time: 2020-09-17T17:49:05.255+0000

  • Duration: 75 min 12 sec

Test stats 🧪

Test Results
Failed 0
Passed 3176
Skipped 689
Total 3865

@afharo afharo force-pushed the metricbeat/kibana/usage_collection branch from 68a59f2 to 21b1720 Compare August 25, 2020 11:05
@afharo afharo marked this pull request as ready for review August 25, 2020 12:20
@afharo afharo added review Feature:Stack Monitoring Team:Services (Deprecated) Label for the former Integrations-Services team labels Aug 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/stack-monitoring (Stack monitoring)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@ycombinator
Copy link
Contributor

ycombinator commented Aug 25, 2020

@afharo Thanks for the PR. Adding a backoff is a good idea!

I have a couple of recommendations for the implementation and one food for thought.

On the implementation

  1. m.usageLastCollectedOn is meant to be the time when usage was last collected. So manipulating it if there is an error response from Kibana, just so the shouldCollectUsage() method would return false for the next usageCollectionBackoff duration seems a bit hacky. Instead, I would suggest storing a new field in the stats.MetricSet struct (as a sibling of usageLastCollectedOn), something like usageNextCollectOn. If there is an error response from Kibana, I would set this field to now + usageCollectionBackoff. Then, in the shouldCollectUsage() method, add another condition to check if now >= m.usageNextCollectOn.

  2. Given the use case for this PR, in an ideal world Kibana would return an HTTP 503 (Service Unavailable) or 429 (Too Many Requests) when it is overloaded (maybe this is already the case — I didn't check). If so, then I would check for this specific error response and only then set m.usageNextCollectOn.

Food for thought

  1. In an ideal world, Kibana would be telling any client how long to backoff for, rather than the client hardcoding a value or even deciding on a dynamic backoff strategy. This is because Kibana will always have at least as much, if not more, information than any client about how long Kibana might need to recover from it's current state. So it would be ideal if Kibana could just tell the client how long to backoff for as part of a 503 or 429 response. Indeed, the HTTP protocol provides a header exactly for this purpose: Retry-After. Even in a basic implementation, Kibana could just hardcode this value to now+1h and then in the future make it more sophisticated based on load, etc. if desired.

    All that said, I'm okay not going down this path right now. This is more just something for you to consider in the medium/long-term.

@afharo
Copy link
Member Author

afharo commented Aug 25, 2020

Wow! Thank you @ycombinator for all the insightful comments! I'll go ahead and implement the m.usageNextCollectOn suggestion.

Regarding Kibana, it is currently quite dumb, and, to be fair, it's not failing, just taking too long to respond, so metricbeat is timing out on its end. I've created an issue in the Kibana repo to be more self-protective in that regard (and added a comment about the Retry-After suggestion).

@afharo
Copy link
Member Author

afharo commented Sep 2, 2020

Sounds good to me 👍

@ycombinator
Copy link
Contributor

Based on recent comments in #20956 I think we can revert 428b503 from this PR once #20956 is closed (umerged). And at that point this PR should be ready for a final review.

@afharo
Copy link
Member Author

afharo commented Sep 9, 2020

@ycombinator, after reverting the changes, all the tests have gone through this time (flaky tests 😅)

@pgayvallet
Copy link
Contributor

@elastic/kibana-platform might be able to help on identifying if there's anything that has changed on Kibana that might cause this effect?

Sorry for the late answer.

I've noticed Kibana may return "green" status before having any "metrics" information, so the tests where failing.

Almost all the legacy plugins are no more, and the legacy status API only depends on that, so it might be causing the status to go green faster, and before the initial metrics are actually collected.

Were you able to see the approx duration between the status is green and the endpoint actually returns metrics?

@afharo
Copy link
Member Author

afharo commented Sep 16, 2020

@pgayvallet thank you for coming back with these answers.

Were you able to see the approx duration between the status is green and the endpoint actually returns metrics?

I didn't measure it exactly. Just a couple of seconds (but enough for these tests to fail sometimes, the health check starts after 60s and then polls every second the /api/status, waiting for the green status. Kibana usually boots up faster than 60s so, by the time it starts polling, everything is safely in place. But when it takes a little longer to start, these flaky tests will likely go pass the health-check and start when /api/status returns green and fail the tests that are expecting the metrics to be there.

AFAIK, elastic/kibana#76054 will solve that issue, warrantying the metrics exist (using a ReplayObserver).

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix!

@ycombinator ycombinator added needs_backport PR is waiting to be backported to other branches. v7.10.0 v8.0.0 v7.9.2 labels Sep 17, 2020
@afharo afharo merged commit 5332b2d into elastic:master Sep 18, 2020
@afharo afharo deleted the metricbeat/kibana/usage_collection branch September 18, 2020 09:41
afharo added a commit to afharo/beats that referenced this pull request Sep 18, 2020
elastic#20772)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
afharo added a commit that referenced this pull request Sep 18, 2020
…t getting usage stats (#20772) (#21163)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
v1v added a commit to v1v/beats that referenced this pull request Sep 18, 2020
…ne-2.0

* upstream/master:
  Add Cloud Foundry dashboards for metricbeat (elastic#21124)
  [Metricbeat][Kibana] Apply backoff when errored at getting usage stats (elastic#20772)
  Update input-log.asciidoc (elastic#20965) (elastic#21153)
  Remove redirects page (elastic#19574)
  [Ingest Manager] Fixed input types for filebeat (elastic#21131)
  docs: add beat specific install widget (elastic#21050)
  docs: link to APM privs for API keys (elastic#20911)
  Fix index out of range error when getting AWS account name (elastic#21101)
  Agent info docs and ci test pr (elastic#19805)
  Handling missing counters in application_pool metricset  (elastic#21071)
afharo added a commit to afharo/beats that referenced this pull request Sep 18, 2020
elastic#20772)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
afharo added a commit that referenced this pull request Sep 18, 2020
…t getting usage stats (#20772) (#21162)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…t getting usage stats (elastic#20772) (elastic#21162)

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>

Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Feature:Stack Monitoring Metricbeat Metricbeat needs_backport PR is waiting to be backported to other branches. review Team:Services (Deprecated) Label for the former Integrations-Services team v7.9.2 v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kibana stats: less aggressive usage collection if failed
4 participants