-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772
[Metricbeat][Kibana] Apply backoff when errored at getting usage stats #20772
Conversation
a67cbc4
to
68a59f2
Compare
68a59f2
to
21b1720
Compare
Pinging @elastic/stack-monitoring (Stack monitoring) |
Pinging @elastic/integrations-services (Team:Services) |
@afharo Thanks for the PR. Adding a backoff is a good idea! I have a couple of recommendations for the implementation and one food for thought. On the implementation
Food for thought
|
Wow! Thank you @ycombinator for all the insightful comments! I'll go ahead and implement the Regarding Kibana, it is currently quite dumb, and, to be fair, it's not failing, just taking too long to respond, so |
Sounds good to me 👍 |
@ycombinator, after reverting the changes, all the tests have gone through this time (flaky tests 😅) |
Sorry for the late answer.
Almost all the legacy plugins are no more, and the legacy status API only depends on that, so it might be causing the status to go green faster, and before the initial metrics are actually collected. Were you able to see the approx duration between the status is green and the endpoint actually returns metrics? |
@pgayvallet thank you for coming back with these answers.
I didn't measure it exactly. Just a couple of seconds (but enough for these tests to fail sometimes, the health check starts after 60s and then polls every second the AFAIK, elastic/kibana#76054 will solve that issue, warrantying the metrics exist (using a |
…ana/usage_collection
…ana/usage_collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the fix!
elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…ne-2.0 * upstream/master: Add Cloud Foundry dashboards for metricbeat (elastic#21124) [Metricbeat][Kibana] Apply backoff when errored at getting usage stats (elastic#20772) Update input-log.asciidoc (elastic#20965) (elastic#21153) Remove redirects page (elastic#19574) [Ingest Manager] Fixed input types for filebeat (elastic#21131) docs: add beat specific install widget (elastic#21050) docs: link to APM privs for API keys (elastic#20911) Fix index out of range error when getting AWS account name (elastic#21101) Agent info docs and ci test pr (elastic#19805) Handling missing counters in application_pool metricset (elastic#21071)
elastic#20772) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
…t getting usage stats (elastic#20772) (elastic#21162) Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com> Co-authored-by: Shaunak Kashyap <ycombinator@gmail.com>
What does this PR do?
When
metricbeat
collects stats from Kibana. If the attempt to collect Usage fails, it backs off the next Usage collection attempt by 1h so the next attempt will not collect usage (only metrics).Why is it important?
Usage stats collection is a CPU-intensive process in Kibana. We've noticed some clusters timing out and
metricbeat
repeatedly trying again after 10s, increasing the load in Kibana and taking even longer to respond. It makes the platform to get stuck in a loop where Kibana is loaded but Monitoring don't receive any data to alert about that becausemetricbeat
can't get the basic metrics.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
usageCollectionBackoff
for the tests to force a retry (maybe there's a way but not a Go dev myself 😅)How to test this PR locally
Deploy for local Kibana and add a delay on any Usage Collector that takes longer than 10s to respond. You'll notice the
metricbeat
requests will timeout and will avoid getting the usage on the next attempt and it will try again in 1 hour.Related issues
Use cases
This change will affect to the consistency we ship the Kibana usage. But, on the other hand, it will improve the consistency in which we deliver Kibana metrics which, to me, it's more important for monitoring purposes.
Screenshots
Logs
No new logs added in this implementation