Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new JVM runtime environment metrics #3352

Closed

Conversation

roberttoyonaga
Copy link
Contributor

@roberttoyonaga roberttoyonaga commented Mar 30, 2023

Changes

This PR adds process.runtime.jvm.cpu.monitor.wait, process.runtime.jvm.cpu.monitor.blocked, process.runtime.jvm.network.io, process.runtime.jvm.network.io, and process.runtime.jvm.cpu.context_switch metrics to the runtime environment metrics.

Metric gathering implementations for these new metrics already exist in a basic form in https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/runtime-telemetry-jfr/library
Once the details around these new metrics are decided, the implementations can be updated.

JFR streaming would be used to gather these metrics. This feature has only been available since JDK 14 so these metrics would only be supported for JDK17+.

Please see original discussion in this PR and at the Java + Instrumentation SIG.

Related issues open-telemetry/semantic-conventions#1222

| | | | | | | pool | Name of pool [1] | | Required |
| process.runtime.jvm.memory.allocation | Size of object allocated by thread | Bytes | `By` | Histogram | Int64 | | | JDK 17+ | Required |
| | | | | | | thread | thread ID | | Opt-In |
| | | | | | | class | Fully qualified class name | | Opt-In |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing implementation includes an arena attribute instead of class. Class is accessible, but its the class of the object allocated, not the class in which the the allocation occurred, which isn't clear in the description. This could be too high of cardinality even for opt-in.

Copy link
Contributor Author

@roberttoyonaga roberttoyonaga Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm in favor of replacing the class attribute with arena. Arena should be required I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With respect to process.runtime.jvm.cpu.monitor.blocked and process.runtime.jvm.cpu.monitor.wait the class attribute references the monitor class. Do you agree that class can remain as opt-in here?

| process.runtime.jvm.cpu.monitor.wait | Time thread time spend waiting at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required |
| | | | | | | thread | thread ID | | Opt-In |
| | | | | | | class | Fully qualified class name | | Opt-In |
| process.runtime.jvm.cpu.monitor.blocked | Time thread spend blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this in the current implementation. Which JFR event produces this? (This question also applies to process.runtime.jvm.cpu.monitor.wait.)

Suggested change
| process.runtime.jvm.cpu.monitor.blocked | Time thread spend blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required |
| process.runtime.jvm.cpu.monitor.blocked | Time thread was blocked at a monitor | Seconds | `s` | Histogram | Int64 | | | JDK 17+ | Required |

Copy link
Contributor Author

@roberttoyonaga roberttoyonaga Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

process.runtime.jvm.cpu.monitor.wait is actually in the implementation already, just under a different name. I've renamed it here, because I think it could reduce confusion. process.runtime.jvm.cpu.monitor.blocked is not in the current implementation. I have added it here because I feel it could be useful. jdk.JavaMonitorWait and jdk.JavaMonitorEnter produce those metrics. If others agree, I can add them to the implementation.

@roberttoyonaga roberttoyonaga requested review from a team April 20, 2023 21:26
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 20, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: roberttoyonaga / name: Robert Toyonaga (7e57099)

Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add note sections and link to the Java APIs that would (typically) be used to collect these? (related to #3418)

- id: metric.process.runtime.jvm.memory.allocation
type: metric
metric_name: process.runtime.jvm.memory.allocation
brief: "Size of object allocated by thread. Only available in JDK 17+."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a little bit different. ThreadMXbean returns the cumulative allocation per thread, while the JFR event ObjectAllocationSample describes a single allocation instance (sampled to reduce overhead. Sampling only happens on the TLAB slow path). But now that I think about it, it might be more useful to know the total allocation per thread rather than have statistical data on allocation sizes per thread. Additionally, the statistical data would be skewed because sampling is only done on the slow path when a new TLAB is required, or allocations won't fit into a TLAB (this is because the events purpose is to show where the allocations are happening, not how big they are).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think(?) this could be implemented in Java 8 using https://docs.oracle.com/javase/8/docs/jre/api/management/extension/com/sun/management/ThreadMXBean.html#getThreadAllocatedBytes-long:A-

That would be cool.

the JFR event ObjectAllocationSample describes a single allocation instance (sampled to reduce overhead. Sampling only happens on the TLAB slow path).

If we continue to report this in JFR, we'll want to somehow communicate to users that thee allocations are sampled.

this is because the events purpose is to show where the allocations are happening, not how big they are

Presumably for building out a profile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably for building out a profile?

Yup, you can generate flame graphs from the stack traces and other useful things like that.

If we continue to report this in JFR

I think that we should not report allocations with JFR because the purpose of those events is actually a little different than what we want to use them for. Also, the current implementation (jdk.ObjectAllocationInNewTLAB and jdk.ObjectAllocationOutsideTLAB) would result in too high an overhead for people to use in production. Those events are turned off by default in both monitoring and profiling JFR configurations. This is because they aren't throttled like jdk.ObjectAllocationSample is.

roberttoyonaga and others added 2 commits April 21, 2023 09:06
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
@roberttoyonaga
Copy link
Contributor Author

can you add note sections and link to the Java APIs that would (typically) be used to collect these? (related to #3418)

Hi @trask , do you mean in the note section of attributes? I'm having trouble figuring out how to get the note's to show for the metrics themselves.

Additionally, I wasn't sure the best way to denote which metrics are available in JDK 17+ only. Based on https://github.com/open-telemetry/build-tools/blob/v0.17.0/semantic-conventions/syntax.md, maybe note ?(but notes only seem to be generated for attributes?

extends: attributes.process.runtime.jvm.cpu.monitor
brief: "Time thread was waiting at a monitor. Only available in JDK 17+."
instrument: histogram
unit: "ms"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will want to use s unit for all durations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add bucket recommendation at the same time?

instrument: histogram
unit: "ms"

- id: metric.process.runtime.jvm.cpu.monitor.blocked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ever useful to sum together the time a monitor was blocked and waiting? Trying to think about whether blocked vs waiting makes sense as an attribute rather than a separate metric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems similar to process.cpu.time which has attribute

state, if specified, SHOULD be one of: system, user, wait

so maybe process.runtime.jvm.cpu.monitor.time with attribute state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup I think that's a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with suggestion applied

@trask
Copy link
Member

trask commented Apr 21, 2023

Hi @trask , do you mean in the note section of attributes? I'm having trouble figuring out how to get the note's to show for the metrics themselves.

oh, you're right, let's just add a manual "Note" at the end of each metric section in the markdown for now, and I'll open an issue in build-tools about adding "note" to metrics in yaml


- id: metric.process.runtime.jvm.cpu.context_switch
type: metric
metric_name: process.runtime.jvm.cpu.context_switch
Copy link
Member

@trask trask Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if there's a difference between this and process.context_switches metric?

Suggested change
metric_name: process.runtime.jvm.cpu.context_switch
metric_name: process.runtime.jvm.context_switches

Copy link
Contributor Author

@roberttoyonaga roberttoyonaga Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @trask I checked the Hotspot code and it seems to me like the the JFR source of this metric does not account for virtual threads, only platform threads. However, it does look like process.runtime.jvm.context_switches is a little different because it reports a rate in Hz rather than a count like process.context_switches does.

Copy link
Contributor Author

@roberttoyonaga roberttoyonaga Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the description for process.context_switches says: "Number of times the process has been context switched." Does this mean it's referring to process context switches rather than thread context switches? The metrics derived from JFR refers to threads specifically.

attributes:
- ref: thread.id
requirement_level: opt_in
- id: class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkout the code.namespace field as an alternative to defining a new attribute.

attributes:
- ref: thread.id
requirement_level: opt_in
- id: mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #3431 lands, should change this to network.direction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ok to change it proactively (that PR could take a while...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I changed it to network.direction

@@ -70,7 +94,7 @@ groups:
metric_name: process.runtime.jvm.gc.duration
brief: "Duration of JVM garbage collection actions."
instrument: histogram
unit: "ms"
unit: "s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered in #3458.

Co-authored-by: Trask Stalnaker <trask.stalnaker@gmail.com>
@github-actions
Copy link

github-actions bot commented May 6, 2023

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label May 6, 2023
@trask trask removed the Stale label May 6, 2023
@reyang reyang changed the title Add new runtime environment metrics Add new JVM runtime environment metrics May 8, 2023
@reyang reyang added the area:semantic-conventions Related to semantic conventions label May 9, 2023
@reyang
Copy link
Member

reyang commented May 9, 2023

@roberttoyonaga heads up - most likely this PR will be closed, and we'll ask you to resubmit the PR in a new repo, please refer to #3474 (comment).

Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roberttoyonaga
Copy link
Contributor Author

I've copied this PR over to the new repo here: open-telemetry/semantic-conventions#44 @trask @jack-berg @mateuszrzeszutek

@jack-berg
Copy link
Member

Thanks @roberttoyonaga. Closing this PR and picking up the convo over there!

@jack-berg jack-berg closed this May 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New JVM runtime environment metrics
7 participants