-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reporting and showing cgroup-based metrics #291
Comments
What speaks against defaulting to |
A thought: do you think the special handling of the |
Nothing, it's either one way or the other 🙂 . It's the same effort in terms of implementation and negligible in terms of performance (assuming done only once), so I assume could be either way. |
My thought was that because you pointed out that the cgroup-v1 regex might need some iteration and we haven't one for cgroup-v2, the chance of false positives is lower if we try the path that the majority of systems will use anyway first. However, now I'm not sure if a false positive is even possible. |
It depends on what you define as false positives. Extracting non-valid path based on our regex is very likely, at least as we start. However, I don't consider those as false positives, as we should check that such path exists and contains the expected files. |
Probably fine but not 100% how the ui should handle this. Afaiu My mental model in code: const unboundedMemSize = 9223372036854771712
if ( mem.limit.bytes === unboundedMemSize ) {
mem.limit.bytes = ??
} |
Could we convert this issue into a spec PR? See also #192. |
@sqren my mental model is something like: var total = system.process.cgroup.memory.mem.limit.bytes;
if (total == 9223372036854771712 || total == NA) {
total = system.memory.total;
}
var used = system.process.cgroup.memory.mem.usage.bytes;
if (used == NA) {
used = system.memory.total - system.memory.actual.free;
} else if (system.process.cgroup.memory.stats.inactive_file.bytes != NA) {
used -= system.process.cgroup.memory.stats.inactive_file.bytes;
}
var usage = used / total; Makes sense? |
Sure. Since there are discussions already ongoing here, let's conclude them and then create a corresponding spec PR. |
UPDATE: after talking to @sqren offline, we agreed that UI will do the special handling for the |
UPDATE II: @elastic/apm-agent-devs I just learned that in cgroup v2 the special value used for representing unlimited memory for the cgroup is represented by a string - |
Closing as there are follow-up issues for each component |
UPDATE - removed |
Description of the issue
APM agents currently send system metrics that are aligned with Metricbeat's metricset keys, as well as values. These cover
system.
metricsets and some specific platform-related metrics (see Java agent documentation for example).However, these system metrics are inaccurate when monitoring containers. The most obvious miscalculation comes from the fact that agents currently collect host total memory rather than the effective cgroup limitation, but there are also considerable differences in the used bytes, depending on how they are retrieved, as well as CPU usage per cgroup quota.
Proposed solution
Introducing new cgroup metrics
As a first step, the new metrics will include:
system.process.cgroup.memory.mem.limit.bytes
system.process.cgroup.memory.mem.usage.bytes
Both are optional.
Both are numeric representing number of bytes.
When not available, these metrics should not be sent.
In the future, we may extend to collect and show additional memory metrics, as well as cpu metrics.
APM UI
System memory usage values will be calculated based on cgroup metrics if such are available, using
mem.usage.bytes
/mem.limit.bytes
. Otherwise, use the existingsystem.memory
metrics.NOTE: whenever a cgroup is not explicitly limited in memory, the limit read from the corresponding file may be set to
9223372036854771712
(equivalent to0x7ffffffffffff000
), which basically meansunlimited
.Agents conforming to the spec should not send this value (they should omit the
max
cgroup metric in such case)..Formalizing that in pseudocode:
Agent implementation details
https://github.com/elastic/apm/blob/master/specs/agents/metrics.md#cgroup-metrics
Related issues
The text was updated successfully, but these errors were encountered: