-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Added cgroup option for APM cpu usage #90873
Conversation
Pinging @elastic/stack-monitoring (Team:Monitoring) |
cc @graphaelli @simitt to help with testing the functionality and verifying the changes look good! |
@cyrille-leclerc and @katrin-freihofner this PR changes the order of the metrics - you previously expressed concerns (#90157 (comment)). Could you please let us know if this design is fine with you. @cyrille-leclerc could you please also review the used terminology (e.g. Elastic Agent group). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this!
From my description in #90157:
For the detection of whether or not cgroup values should be used @chrisronline mentioned that other apps set a flag in the Kibana config options. We could do something similar for APM. I am wondering how this works when using a dedicated monitoring cluster, to which data from multiple other clusters are shipped, where other clusters could partially be running inside containers, partially directly on a host system?
In case there is a dedicated monitoring cluster, I assume the Kibana config option needs to be set in the monitoring cluster, right? In this case, it could happen that multiple APM server ship to the same Kibana, and some of them are running inside a Docker container, while others are't. Have you considered this?
For the hosted Elastic Agent scenario this will be a problem, as there will be a migration phase where it is expected that some deployments run with Elastic Agent, others not. We will probably need to come up with a different solution for deciding whether to show Elastic Agent Group or APM Server.
Elastic Agent will not be supported in cloud for version 7.12
, and we only release beta support for APM Server managed by Elastic Agent. I suggest to remove the Elastic Agent group related changes from this PR, so we can get the cgroup changes in for 7.12
and further look into how to detect Elastic Agent metrics.
usageField: 'cpuacct.total.ns', | ||
periodsField: 'cpu.stats.periods', | ||
quotaField: 'cpu.cfs.quota.us', | ||
field: 'beats_stats.metrics.beat.cpu.total.value', // backup field if quota is not configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this backup field actually do? I don't see any data showing up when no quota is configured, which I believe is not a great experience, especially since this is the only CPU related graph for APM.
For ES I see CPU data without having any quota configured - does it fall back to system metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think ES monitoring documents always contain the backup field, but perhaps APM metrics work differently so we'd have to vary our approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igoristic the mapping was added for 7.11
elastic/elasticsearch#65997. What exactly are you missing from the mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed that ticket
I'm missing the whole cgroup
object:
https://gist.github.com/igoristic/dcf8e5257628c7212553a53b17ead60c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@igoristic maybe pull a new image? Or use 7.11.0 GA? I have a 7.11.0 deployment in Cloud, and it definitely has the cgroup object: https://gist.github.com/axw/c0e0db05ba8756b26a45f40c4f7f697c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I see it on cloud too. I'd double check your local ES is a fresh copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the backup logic is actually working so we might need a change like this:
--- a/x-pack/plugins/monitoring/server/lib/metrics/classes/quota_metric.js
+++ b/x-pack/plugins/monitoring/server/lib/metrics/classes/quota_metric.js
@@ -55,6 +55,7 @@ export class QuotaMetric extends Metric {
const quota = get(bucket, 'quota.value');
const deltaUsageDerivNormalizedValue = get(bucket, 'usage_deriv.normalized_value');
const periodsDerivNormalizedValue = get(bucket, 'periods_deriv.normalized_value');
+ const backupMetricDerivNormalizedValue = get(bucket, 'metric_deriv.normalized_value');
if (deltaUsageDerivNormalizedValue && periodsDerivNormalizedValue && quota > 0) {
// if throttling is configured
@@ -63,7 +64,7 @@ export class QuotaMetric extends Metric {
return factor * 100; // convert factor to percentage
}
// if throttling is NOT configured, show nothing. The user should see that something is not configured correctly
- return null;
+ return backupMetricDerivNormalizedValue || null;
};
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find inlined requests for changes discussed with @simitt :
x-pack/plugins/monitoring/public/components/apm/apm_metrics.tsx
Outdated
Show resolved
Hide resolved
x-pack/plugins/monitoring/public/components/apm/instance/instance.js
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU graph is still empty as mentioned in #90873 (comment) - is this supposed to be fixed?
@simitt What environment are you checking this on? Can you please give me the response to the following: GET _template/.monitoring-beats
GET .monitoring-beats*/_mapping |
Yes, I think that would be the best fallback. |
@chrisronline The only reason why I decided not to implement the backup, was because your solution in #90873 (comment) won't work, since
quota_metric.js is also used by ES to calculate its cgroup utilization which is done differently.
I'm guessing this was meant to work like this by design to avoid connotation. Because in this case the label would show cgroup, but the metrics would be showing none container usage @simitt We can address/discuss this in a different PR to where we have a more consistent backup solution for all three products that use cgroup (ES, LS, APM) |
@igoristic That's fair, but is it possible to implement a new APM-specific metric class to handle this custom logic? And maybe we can possibly update the tooltip to reflect the possibility that we're not showing proper cgroup data? We should only need to show that specific tooltip when the user has explicitly configured to show cgroup data. |
Yes, all of that is possible, but like I said I think we should keep it consistent for all products. I don't want to create separate logic for each product just to accommodate this feature for APM Since, this isn't a regression (but by design), and involves further discussion/investigation, I think we should scope it outside of this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the functionality. Since users have to enable the cgroup
visualization, I'm fine with merging as is. Could you please create a follow up issue and link it.
FYI, elastic/elasticsearch#69317 should resolve the stale template issue. |
@elasticmachine merge upstream |
@cyrille-leclerc could you unblock merging this PR please (the changes you requested will go into a separate PR as the elastic-agent related changes were removed). |
@igoristic could you please update the description - as the PR now only parts of #90157, but doesn't actually resolve it (as per my request to keep PRs separated and put up a dedicated one for the elastic-agent changes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAny thanks for the changes!
* Added APM cpu cgroup * Fixed tests * Fixed i18n * Removed agent logic and fixed tests * Fixed test * Fixed tests and backup field * Removed backup field fix * Fixed cluster tests Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # x-pack/plugins/monitoring/public/components/apm/instance/instance.js # x-pack/plugins/monitoring/public/components/apm/overview/index.js
Backport: |
💛 Build succeeded, but was flaky
Test FailuresKibana Pipeline / general / Chrome UI Functional Tests.test/functional/apps/dashboard/dashboard_filtering·ts.dashboard app using current data dashboard filtering disabling a filter unfilters the data on "before all" hook for "pie charts"Standard Out
Stack Trace
Metrics [docs]Module Count
Async chunks
History
To update your PR or re-run it, just comment with: |
Thank you @igoristic! 🙏 |
Resolves: #79050
This PR adds config options to monitor container metrics:
monitoring.ui.container.apm.enabled: true
With some minor UI/UX adjustments to indicate current state and emphasize resource usage
Docs will be added as part of: #18448 (comment)