-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new JVM runtime environment metrics #3352
Add new JVM runtime environment metrics #3352
Conversation
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
| | | | | | | pool | Name of pool [1] | | Required | | ||
| process.runtime.jvm.memory.allocation | Size of object allocated by thread | Bytes | `By` | Histogram | Int64 | | | JDK 17+ | Required | | ||
| | | | | | | thread | thread ID | | Opt-In | | ||
| | | | | | | class | Fully qualified class name | | Opt-In | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing implementation includes an arena
attribute instead of class
. Class is accessible, but its the class of the object allocated, not the class in which the the allocation occurred, which isn't clear in the description. This could be too high of cardinality even for opt-in
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm in favor of replacing the class
attribute with arena
. Arena should be required
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With respect to process.runtime.jvm.cpu.monitor.blocked
and process.runtime.jvm.cpu.monitor.wait
the class
attribute references the monitor class. Do you agree that class
can remain as opt-in here?
| process.runtime.jvm.cpu.monitor.wait | Time thread time spend waiting at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required | | ||
| | | | | | | thread | thread ID | | Opt-In | | ||
| | | | | | | class | Fully qualified class name | | Opt-In | | ||
| process.runtime.jvm.cpu.monitor.blocked | Time thread spend blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this in the current implementation. Which JFR event produces this? (This question also applies to process.runtime.jvm.cpu.monitor.wait
.)
| process.runtime.jvm.cpu.monitor.blocked | Time thread spend blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required | | |
| process.runtime.jvm.cpu.monitor.blocked | Time thread was blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process.runtime.jvm.cpu.monitor.wait
is actually in the implementation already, just under a different name. I've renamed it here, because I think it could reduce confusion. process.runtime.jvm.cpu.monitor.blocked
is not in the current implementation. I have added it here because I feel it could be useful. jdk.JavaMonitorWait
and jdk.JavaMonitorEnter
produce those metrics. If others agree, I can add them to the implementation.
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
|
82b88b4
to
7e57099
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add note
sections and link to the Java APIs that would (typically) be used to collect these? (related to #3418)
specification/metrics/semantic_conventions/runtime-environment-metrics.md
Outdated
Show resolved
Hide resolved
- id: metric.process.runtime.jvm.memory.allocation | ||
type: metric | ||
metric_name: process.runtime.jvm.memory.allocation | ||
brief: "Size of object allocated by thread. Only available in JDK 17+." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think(?) this could be implemented in Java 8 using https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/ThreadMXBean.html#getThreadAllocatedBytes-long:A-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a little bit different. ThreadMXbean returns the cumulative allocation per thread, while the JFR event ObjectAllocationSample describes a single allocation instance (sampled to reduce overhead. Sampling only happens on the TLAB slow path). But now that I think about it, it might be more useful to know the total allocation per thread rather than have statistical data on allocation sizes per thread. Additionally, the statistical data would be skewed because sampling is only done on the slow path when a new TLAB is required, or allocations won't fit into a TLAB (this is because the events purpose is to show where the allocations are happening, not how big they are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think(?) this could be implemented in Java 8 using https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/ThreadMXBean.html#getThreadAllocatedBytes-long:A-
That would be cool.
the JFR event ObjectAllocationSample describes a single allocation instance (sampled to reduce overhead. Sampling only happens on the TLAB slow path).
If we continue to report this in JFR, we'll want to somehow communicate to users that thee allocations are sampled.
this is because the events purpose is to show where the allocations are happening, not how big they are
Presumably for building out a profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably for building out a profile?
Yup, you can generate flame graphs from the stack traces and other useful things like that.
If we continue to report this in JFR
I think that we should not report allocations with JFR because the purpose of those events is actually a little different than what we want to use them for. Also, the current implementation (jdk.ObjectAllocationInNewTLAB and jdk.ObjectAllocationOutsideTLAB) would result in too high an overhead for people to use in production. Those events are turned off by default in both monitoring and profiling JFR configurations. This is because they aren't throttled like jdk.ObjectAllocationSample is.
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
Hi @trask , do you mean in the note section of attributes? I'm having trouble figuring out how to get the note's to show for the metrics themselves. Additionally, I wasn't sure the best way to denote which metrics are available in JDK 17+ only. Based on https://github.com/open-telemetry/build-tools/blob/v0.17.0/semantic-conventions/syntax.md, maybe |
extends: attributes.process.runtime.jvm.cpu.monitor | ||
brief: "Time thread was waiting at a monitor. Only available in JDK 17+." | ||
instrument: histogram | ||
unit: "ms" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will want to use s
unit for all durations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add bucket recommendation at the same time?
instrument: histogram | ||
unit: "ms" | ||
|
||
- id: metric.process.runtime.jvm.cpu.monitor.blocked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever useful to sum together the time a monitor was blocked and waiting? Trying to think about whether blocked vs waiting makes sense as an attribute rather than a separate metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems similar to process.cpu.time
which has attribute
state
, if specified, SHOULD be one of:system
,user
,wait
so maybe process.runtime.jvm.cpu.monitor.time
with attribute state
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup I think that's a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with suggestion applied
oh, you're right, let's just add a manual "Note" at the end of each metric section in the markdown for now, and I'll open an issue in build-tools about adding "note" to metrics in yaml |
…elemetry-specification into runtime-metrics-jfr
|
||
- id: metric.process.runtime.jvm.cpu.context_switch | ||
type: metric | ||
metric_name: process.runtime.jvm.cpu.context_switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if there's a difference between this and process.context_switches
metric?
metric_name: process.runtime.jvm.cpu.context_switch | |
metric_name: process.runtime.jvm.context_switches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @trask I checked the Hotspot code and it seems to me like the the JFR source of this metric does not account for virtual threads, only platform threads. However, it does look like process.runtime.jvm.context_switches
is a little different because it reports a rate in Hz rather than a count like process.context_switches
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the description for process.context_switches
says: "Number of times the process has been context switched." Does this mean it's referring to process context switches rather than thread context switches? The metrics derived from JFR refers to threads specifically.
attributes: | ||
- ref: thread.id | ||
requirement_level: opt_in | ||
- id: class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkout the code.namespace field as an alternative to defining a new attribute.
attributes: | ||
- ref: thread.id | ||
requirement_level: opt_in | ||
- id: mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once #3431 lands, should change this to network.direction
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ok to change it proactively (that PR could take a while...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I changed it to network.direction
@@ -70,7 +94,7 @@ groups: | |||
metric_name: process.runtime.jvm.gc.duration | |||
brief: "Duration of JVM garbage collection actions." | |||
instrument: histogram | |||
unit: "ms" | |||
unit: "s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Covered in #3458.
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@roberttoyonaga heads up - most likely this PR will be closed, and we'll ask you to resubmit the PR in a new repo, please refer to #3474 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to https://github.com/open-telemetry/semantic-conventions ?
I've copied this PR over to the new repo here: open-telemetry/semantic-conventions#44 @trask @jack-berg @mateuszrzeszutek |
Thanks @roberttoyonaga. Closing this PR and picking up the convo over there! |
Changes
This PR adds
process.runtime.jvm.cpu.monitor.wait
,process.runtime.jvm.cpu.monitor.blocked
,process.runtime.jvm.network.io
,process.runtime.jvm.network.io
, andprocess.runtime.jvm.cpu.context_switch
metrics to the runtime environment metrics.Metric gathering implementations for these new metrics already exist in a basic form in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry-jfr/library
Once the details around these new metrics are decided, the implementations can be updated.
JFR streaming would be used to gather these metrics. This feature has only been available since JDK 14 so these metrics would only be supported for JDK17+.
Please see original discussion in this PR and at the Java + Instrumentation SIG.
Related issues open-telemetry/semantic-conventions#1222