Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Capture thread dumps along with thread-level CPU usage #135

Open
lkonno opened this issue Oct 31, 2023 · 11 comments
Open

[Epic] Capture thread dumps along with thread-level CPU usage #135

lkonno opened this issue Oct 31, 2023 · 11 comments
Labels
feat New feature or request

Comments

@lkonno
Copy link
Contributor

lkonno commented Oct 31, 2023

Describe the feature

For high CPU investigation by the java process, the thread-level CPU usage (with top command) is also captured along to the thread dumps (jstack).

It would be helpful to have an easier way for this capture remotely than executing a script inside the pod.

Anything other information?

No response

@lkonno lkonno added feat New feature or request needs-triage Needs thorough attention from code reviewers labels Oct 31, 2023
@andrewazores andrewazores moved this to Stretch Goals in 3.0.0 release Feb 20, 2024
@andrewazores andrewazores removed the needs-triage Needs thorough attention from code reviewers label Jul 5, 2024
@andrewazores andrewazores moved this to Backlog in 4.0.0 release Jul 5, 2024
@roberttoyonaga
Copy link

roberttoyonaga commented Jul 19, 2024

Instead of using jstack or top, you could use either JFR or JMX to gather this info.

JFR:

  • The jdk.ThreadDump event is enabled by default and periodically dumps all thread stacks. This is available with Hotspot, but not GraalVM or any other JDKs I believe.
  • The ThreadCPULoad event is another periodic event that captures CPU usage on a per-thread basis. It's also enabled by default. Supported in Hotspot and GraalVM too.

These events are a good solution since Cryostat already captures this data . Nothing extra needs to be done. You would just have to inspect the JFR recordings.

JMX

For convenience, maybe it would be helpful if Cryostat had a dedicated view for this data that was automatically updated? Maybe something similar to the other JMX dashboard widgets.

Side note: Somehow giving Cryostat the ability to use jstack or jcmd to gather thread dump or CPU usage is probably not a good approach. Both those tools rely on the Attach-API. On POSIX, this uses Unix domain sockets. So you'd have to somehow set up the sockets in a shared volume that client/server containers can both access. That may not work out of the box with the standard attach flow. And if the containers are on different hosts, it won't work at all.

@andrewazores
Copy link
Member

andrewazores commented Jul 19, 2024

Both those tools rely on the Attach-API. On POSIX, this uses Unix domain sockets. So you'd have to somehow set up the sockets in a shared volume that client/server containers can both access. That may not work out of the box with the standard attach flow. And if the containers are on different hosts, it won't work at all.

I was thinking that this would be done using the Cryostat Agent, so no need for doing it over the network or over any shared volume.

But if @lkonno and her team can get the information they need out of existing JFR events, then that's even better! We do have an overall CPU usage panel on the Grafana dashboard that simply displays high-level CPU timeseries stats:

image

I don't think we have a good way right now to render the thread dumps (stack traces), but given the JFR file maybe it's reasonable to just use JMC to open the file and inspect it.

For ThreadCPULoad, I guess we can do something for per-thread granularity that is similar to how we already treat object allocation events:

image

image

@roberttoyonaga
Copy link

roberttoyonaga commented Jul 19, 2024

I was thinking that this would be done using the Cryostat Agent, so no need for doing it over the network or over any shared volume.

Oh yes, issuing the commands from the agent is good idea! Then you wouldn't really have to worry about the limitations of unix sockets. Maybe that's another option, if the existing JFR events (or JMX MXBeans) aren't good enough.

For ThreadCPULoad, I guess we can do something for per-thread granularity that is similar to how we already treat object allocation events:

Yeah that seems good too. I initially wasn't sure how you'd represent all the threads, because there may be a very high cardinality. But if there's already a way of representing many object types, a similar approach would probably make sense here too.

@andrewazores andrewazores changed the title [Request] Capture thread dumps along with thread-level CPU usage [Epic] Capture thread dumps along with thread-level CPU usage Jul 19, 2024
@andrewazores
Copy link
Member

The visualization for many object types certainly leaves something to be desired, since the allocation types is pretty high cardinality too. As the screenshots show, the aggregate panel can become quite noisy, although it does help with identifying spikes of particular allocation types at particular times, and maybe can help with identifying correlations. The breakout dashboard has each allocation type on an independent chart (with linked x-axis domains for zoom control) and independent y-axes, so this helps to identify trends in particular allocation types rather than comparing types to each other.

I imagine something similar might make sense for the thread visualization. Since it's percentage load based it does mean we know we can fix the y-axis range to [0, 1], which is nice.

@roberttoyonaga
Copy link

That's a good point. Having the aggregate panel to help identify correlations probably makes sense for threads as well as allocations. Yeah, and having the cpu loads already normalized is convenient too.

I'm not sure whether Cryostat already has this feature, but having the ability to filter out allocation size below a specified threshold might help with noise in the aggregated chart. I imagine there's probably a lot of smaller allocations, but in most cases you care more about trends between object types that have large/sustained allocations. This could allow you to do some graph re-scaling as well, spacing out the "important" lines more, making them easier to differentiate.

@andrewazores
Copy link
Member

I think the jfr-datasource can already support a query like that, but we don't currently have any controls on the panel to add that kind of condition to the query. It's a good idea.

@andrewazores
Copy link
Member

The specific thread dump format that the support team receives from other tooling is important to support here, as at least the foundation of this feature. This is the format output by tools like jcmd or jstack, ex:

$ /usr/lib/jvm/java/bin/jcmd 3740249 Thread.print
3740249:
2024-09-10 11:40:58
Full thread dump OpenJDK 64-Bit Server VM (17.0.12+7 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007fb9400018a0, length=15, elements={
0x00007fba080278e0, 0x00007fba082158c0, 0x00007fba08216cb0, 0x00007fba0821c1d0,
0x00007fba0821d600, 0x00007fba0821ea20, 0x00007fba082203c0, 0x00007fba08221900,
0x00007fba08222d80, 0x00007fba0822a4d0, 0x00007fba0822e3d0, 0x00007fba0832b470,
0x00007fba08409550, 0x00007fba08630d50, 0x00007fb940000eb0
}

"main" #1 prio=5 os_prio=0 cpu=395.16ms elapsed=1576.37s tid=0x00007fba080278e0 nid=0x391297 in Object.wait()  [0x00007fba0e5fe000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.12/Native Method)
	- waiting on <0x000000045d468440> (a java.lang.ProcessImpl)
	at java.lang.Object.wait(java.base@17.0.12/Object.java:338)
	at java.lang.ProcessImpl.waitFor(java.base@17.0.12/ProcessImpl.java:434)
	- locked <0x000000045d468440> (a java.lang.ProcessImpl)
	at io.quarkus.cli.build.ExecuteUtil.executeProcess(ExecuteUtil.java:73)
	at io.quarkus.cli.build.BuildSystemRunner.run(BuildSystemRunner.java:50)
	at io.quarkus.cli.Dev.call(Dev.java:46)
	at io.quarkus.cli.Dev.call(Dev.java:18)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
	at picocli.CommandLine.execute(CommandLine.java:2174)
	at io.quarkus.cli.QuarkusCli.run(QuarkusCli.java:116)
	at io.quarkus.runtime.ApplicationLifecycleManager.run(ApplicationLifecycleManager.java:140)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:71)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:44)
	at io.quarkus.cli.Main.main(Main.java:9)

I think this probably matches the file output format of the HotSpotDiagnosticMXBean:
https://docs.oracle.com/en/java/javase/21/docs/api/jdk.management/com/sun/management/HotSpotDiagnosticMXBean.html#dumpThreads(java.lang.String,com.sun.management.HotSpotDiagnosticMXBean.ThreadDumpFormat)

The support team uses other scripts and tooling which ingest this format, such as https://github.com/aogburn/yatda .


The ThreadMXBean#dumpAllThreads format is somewhat different by default - here is the output you get from simply doing ThreadInfo#toString and concatenating those together:

"Reference Handler" daemon prio=10 Id=2 RUNNABLE
	at java.base@17.0.12/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
	at java.base@17.0.12/java.lang.ref.Reference.processPendingReferences(Reference.java:253)
	at java.base@17.0.12/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:215)

"Finalizer" daemon prio=8 Id=3 WAITING on java.lang.ref.ReferenceQueue$Lock@5a8d58c2
	at java.base@17.0.12/java.lang.Object.wait(Native Method)
	-  waiting on java.lang.ref.ReferenceQueue$Lock@5a8d58c2
	at java.base@17.0.12/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
	at java.base@17.0.12/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:176)
	at java.base@17.0.12/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:172)

"Signal Dispatcher" daemon prio=9 Id=4 RUNNABLE

"Common-Cleaner" daemon prio=8 Id=23 TIMED_WAITING on java.lang.ref.ReferenceQueue$Lock@1be5569b
	at java.base@17.0.12/java.lang.Object.wait(Native Method)
	-  waiting on java.lang.ref.ReferenceQueue$Lock@1be5569b
	at java.base@17.0.12/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:155)
	at java.base@17.0.12/jdk.internal.ref.CleanerImpl.run(CleanerImpl.java:140)
	at java.base@17.0.12/java.lang.Thread.run(Thread.java:840)
	at java.base@17.0.12/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:162)

"JDWP Transport Listener: dt_socket" daemon prio=10 Id=24 RUNNABLE (in native)

"JDWP Event Helper Thread" daemon prio=10 Id=25 RUNNABLE

"JFR Recorder Thread" daemon prio=5 Id=26 RUNNABLE

"JFR Periodic Tasks" daemon prio=5 Id=27 TIMED_WAITING on jdk.jfr.internal.JVM$ChunkRotationMonitor@44baf018
	at java.base@17.0.12/java.lang.Object.wait(Native Method)
	-  waiting on jdk.jfr.internal.JVM$ChunkRotationMonitor@44baf018
	at jdk.jfr@17.0.12/jdk.jfr.internal.PlatformRecorder.takeNap(PlatformRecorder.java:544)
	at jdk.jfr@17.0.12/jdk.jfr.internal.PlatformRecorder.periodicTask(PlatformRecorder.java:524)
	at jdk.jfr@17.0.12/jdk.jfr.internal.PlatformRecorder.lambda$startDiskMonitor$1(PlatformRecorder.java:449)
	at jdk.jfr@17.0.12/jdk.jfr.internal.PlatformRecorder$$Lambda$76/0x00007f5b500457f8.run(Unknown Source)
	at java.base@17.0.12/java.lang.Thread.run(Thread.java:840)

"Notification Thread" daemon prio=9 Id=29 RUNNABLE

"RMI TCP Accept-9091" daemon prio=5 Id=31 RUNNABLE (in native)
	at java.base@17.0.12/sun.nio.ch.Net.accept(Native Method)
	at java.base@17.0.12/sun.nio.ch.NioSocketImpl.accept(NioSocketImpl.java:760)
	at java.base@17.0.12/java.net.ServerSocket.implAccept(ServerSocket.java:675)
	at java.base@17.0.12/java.net.ServerSocket.platformImplAccept(ServerSocket.java:641)
	at java.base@17.0.12/java.net.ServerSocket.implAccept(ServerSocket.java:617)
	at java.base@17.0.12/java.net.ServerSocket.implAccept(ServerSocket.java:574)
	at java.base@17.0.12/java.net.ServerSocket.accept(ServerSocket.java:532)
	at java.rmi@17.0.12/sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:413)
	...

	Number of locked synchronizers = 1
	- java.util.concurrent.locks.ReentrantLock$NonfairSync@6861d70

...

Notably, the format of each thread entry is somewhat different and details like the thread CPU usage are missing. The "header" and "footer" sections of the full thread dump file are also missing.

header:

2024-09-10 12:36:07
Full thread dump OpenJDK 64-Bit Server VM (17.0.12+7 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007fb9400018a0, length=15, elements={
0x00007fba080278e0, 0x00007fba082158c0, 0x00007fba08216cb0, 0x00007fba0821c1d0,
0x00007fba0821d600, 0x00007fba0821ea20, 0x00007fba082203c0, 0x00007fba08221900,
0x00007fba08222d80, 0x00007fba0822a4d0, 0x00007fba0822e3d0, 0x00007fba0832b470,
0x00007fba08409550, 0x00007fba08630d50, 0x00007fb940000eb0
}

footer:

...
"G1 Service" os_prio=0 cpu=256.32ms elapsed=4884.90s tid=0x00007fba081ddbf0 nid=0x39129c runnable  

"VM Periodic Task Thread" os_prio=0 cpu=743.47ms elapsed=4884.89s tid=0x00007fba0822be20 nid=0x3912a7 waiting on condition  

JNI global refs: 24, weak refs: 0

So, being able to capture the ThreadInfo[] in a serialized form is probably still helpful, but this is not sufficient to solve the original feature request.

I think from here we have several different features that could be implemented:

  1. Retrieval of jcmd/jstack-style thread dump contents (original support team request). This can certainly be done using the Agent and dumping to a local file, then sending that back to Cryostat. Without the Agent it is a little less clear how to grab this exact thread dump format since the MXBean wants to just dump to file (same as heap dumps), but https://github.com/openjdk/jdk/blob/30645f3309c040deb5bef71b1bd349942b4aa076/src/jdk.management/share/classes/com/sun/management/internal/HotSpotDiagnostic.java#L165 and https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/vm/ThreadDumper.java#L64 may be useful in figuring out or hooking into HotSpot's actual format here.
  2. Retrieval of plain ThreadMXBean ThreadInfo[] dump contents. This is much more straightforward, but does not exactly meet the original requirements and the format may not be recognized by other tooling which expects to ingest thread dump data.
  3. Exposure of ThreadMXBean ThreadInfo[] and/or JFR ThreadDump/ThreadCPULoad events as metrics data, Grafana chart panels, etc.

@andrewazores
Copy link
Member

andrewazores commented Sep 10, 2024

It looks like jstack has the exact same format as jcmd above and it's likely jcmd is just acting like a shell and delegating off to the same jstack implementation:

$ jstack -l -e 3822827
2024-09-10 15:28:36
Full thread dump OpenJDK 64-Bit Server VM (17.0.12+7 mixed mode, sharing):

Threads class SMR info:
_java_thread_list=0x00007f12f40018a0, length=15, elements={
0x00007f13bc0278e0, 0x00007f13bc2158c0, 0x00007f13bc216cb0, 0x00007f13bc21c1d0,
0x00007f13bc21d600, 0x00007f13bc21ea20, 0x00007f13bc2203c0, 0x00007f13bc221900,
0x00007f13bc222d80, 0x00007f13bc22a4d0, 0x00007f13bc22dfe0, 0x00007f13bc32b1a0,
0x00007f13bc40ff30, 0x00007f13bc5ff7a0, 0x00007f12f4000eb0
}

"main" #1 prio=5 os_prio=0 cpu=362.74ms elapsed=757.67s allocated=50568K defined_classes=2034 tid=0x00007f13bc0278e0 nid=0x3a5528 in Object.wait()  [0x00007f13c131d000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.12/Native Method)
	- waiting on <0x000000045d5c5b10> (a java.lang.ProcessImpl)
	at java.lang.Object.wait(java.base@17.0.12/Object.java:338)
	at java.lang.ProcessImpl.waitFor(java.base@17.0.12/ProcessImpl.java:434)
	- locked <0x000000045d5c5b10> (a java.lang.ProcessImpl)
	at io.quarkus.cli.build.ExecuteUtil.executeProcess(ExecuteUtil.java:73)
	at io.quarkus.cli.build.BuildSystemRunner.run(BuildSystemRunner.java:50)
	at io.quarkus.cli.Dev.call(Dev.java:46)
	at io.quarkus.cli.Dev.call(Dev.java:18)
	at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
	at picocli.CommandLine.access$1500(CommandLine.java:148)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
	at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
	at picocli.CommandLine.execute(CommandLine.java:2174)
	at io.quarkus.cli.QuarkusCli.run(QuarkusCli.java:116)
	at io.quarkus.runtime.ApplicationLifecycleManager.run(ApplicationLifecycleManager.java:140)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:71)
	at io.quarkus.runtime.Quarkus.run(Quarkus.java:44)
	at io.quarkus.cli.Main.main(Main.java:9)

   Locked ownable synchronizers:
	- None

"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.30ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc2158c0 nid=0x3a552f waiting on condition  [0x00007f13778f7000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ref.Reference.waitForReferencePendingList(java.base@17.0.12/Native Method)
	at java.lang.ref.Reference.processPendingReferences(java.base@17.0.12/Reference.java:253)
	at java.lang.ref.Reference$ReferenceHandler.run(java.base@17.0.12/Reference.java:215)

   Locked ownable synchronizers:
	- None

"Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.22ms elapsed=757.66s allocated=32B defined_classes=0 tid=0x00007f13bc216cb0 nid=0x3a5530 in Object.wait()  [0x00007f13777f7000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.12/Native Method)
	- waiting on <0x000000045a860000> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@17.0.12/ReferenceQueue.java:155)
	- locked <0x000000045a860000> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@17.0.12/ReferenceQueue.java:176)
	at java.lang.ref.Finalizer$FinalizerThread.run(java.base@17.0.12/Finalizer.java:172)

   Locked ownable synchronizers:
	- None

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.18ms elapsed=757.66s allocated=480B defined_classes=0 tid=0x00007f13bc21c1d0 nid=0x3a5531 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.10ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc21d600 nid=0x3a5532 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Monitor Deflation Thread" #6 daemon prio=9 os_prio=0 cpu=21.10ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc21ea20 nid=0x3a5533 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=114.96ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc2203c0 nid=0x3a5534 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
	- None

"C1 CompilerThread0" #17 daemon prio=9 os_prio=0 cpu=98.63ms elapsed=757.66s allocated=1752B defined_classes=0 tid=0x00007f13bc221900 nid=0x3a5535 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   No compile task

   Locked ownable synchronizers:
	- None

"Sweeper thread" #22 daemon prio=9 os_prio=0 cpu=0.07ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc222d80 nid=0x3a5536 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Notification Thread" #23 daemon prio=9 os_prio=0 cpu=0.07ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc22a4d0 nid=0x3a5537 runnable  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"Common-Cleaner" #24 daemon prio=8 os_prio=0 cpu=0.64ms elapsed=757.66s allocated=0B defined_classes=0 tid=0x00007f13bc22dfe0 nid=0x3a5539 in Object.wait()  [0x00007f1376ef6000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.12/Native Method)
	- waiting on <0x000000045a810178> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@17.0.12/ReferenceQueue.java:155)
	- locked <0x000000045a810178> (a java.lang.ref.ReferenceQueue$Lock)
	at jdk.internal.ref.CleanerImpl.run(java.base@17.0.12/CleanerImpl.java:140)
	at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)
	at jdk.internal.misc.InnocuousThread.run(java.base@17.0.12/InnocuousThread.java:162)

   Locked ownable synchronizers:
	- None

"Reference Reaper #1" #26 daemon prio=5 os_prio=0 cpu=0.33ms elapsed=757.58s allocated=9656B defined_classes=1 tid=0x00007f13bc32b1a0 nid=0x3a553d in Object.wait()  [0x00007f13769b9000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(java.base@17.0.12/Native Method)
	- waiting on <0x000000045a828170> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@17.0.12/ReferenceQueue.java:155)
	- locked <0x000000045a828170> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(java.base@17.0.12/ReferenceQueue.java:176)
	at io.smallrye.common.ref.References$ReaperThread.run(References.java:58)

   Locked ownable synchronizers:
	- None

"executor-thread-2" #30 daemon prio=5 os_prio=0 cpu=0.34ms elapsed=757.50s allocated=1976B defined_classes=0 tid=0x00007f13bc40ff30 nid=0x3a5541 waiting on condition  [0x00007f13765b9000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@17.0.12/Native Method)
	- parking to wait for  <0x000000045a840180> (a org.jboss.threads.EnhancedQueueExecutor)
	at java.util.concurrent.locks.LockSupport.park(java.base@17.0.12/LockSupport.java:211)
	at org.jboss.threads.EnhancedQueueExecutor$PoolThreadNode.park(EnhancedQueueExecutor.java:2259)
	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1553)
	at org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:11)
	at org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:11)
	at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)
	at org.jboss.threads.JBossThread.run(JBossThread.java:483)

   Locked ownable synchronizers:
	- None

"process reaper (pid 3822941)" #31 daemon prio=10 os_prio=0 cpu=0.11ms elapsed=757.30s allocated=248B defined_classes=0 tid=0x00007f13bc5ff7a0 nid=0x3a5560 runnable  [0x00007f13a072d000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.ProcessHandleImpl.waitForProcessExit0(java.base@17.0.12/Native Method)
	at java.lang.ProcessHandleImpl$1.run(java.base@17.0.12/ProcessHandleImpl.java:150)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.12/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.12/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(java.base@17.0.12/Thread.java:840)

   Locked ownable synchronizers:
	- <0x000000045d5d9f88> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"Attach Listener" #32 daemon prio=9 os_prio=0 cpu=0.16ms elapsed=0.10s allocated=0B defined_classes=0 tid=0x00007f12f4000eb0 nid=0x3a6f85 waiting on condition  [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
	- None

"VM Thread" os_prio=0 cpu=10.47ms elapsed=757.66s tid=0x00007f13bc211930 nid=0x3a552e runnable  

"GC Thread#0" os_prio=0 cpu=0.29ms elapsed=757.67s tid=0x00007f13bc05af80 nid=0x3a5529 runnable  

"GC Thread#1" os_prio=0 cpu=1.68ms elapsed=757.35s tid=0x00007f1314008e60 nid=0x3a5547 runnable  

"GC Thread#2" os_prio=0 cpu=1.05ms elapsed=757.35s tid=0x00007f1314009870 nid=0x3a5548 runnable  

"GC Thread#3" os_prio=0 cpu=0.93ms elapsed=757.35s tid=0x00007f131400a280 nid=0x3a5549 runnable  

"GC Thread#4" os_prio=0 cpu=0.87ms elapsed=757.35s tid=0x00007f131400ac90 nid=0x3a554a runnable  

"GC Thread#5" os_prio=0 cpu=0.44ms elapsed=757.35s tid=0x00007f131400b6a0 nid=0x3a554b runnable  

"GC Thread#6" os_prio=0 cpu=0.90ms elapsed=757.35s tid=0x00007f131400c0e0 nid=0x3a554c runnable  

"GC Thread#7" os_prio=0 cpu=2.04ms elapsed=757.35s tid=0x00007f131400cb20 nid=0x3a554d runnable  

"GC Thread#8" os_prio=0 cpu=0.92ms elapsed=757.35s tid=0x00007f131400d560 nid=0x3a554e runnable  

"GC Thread#9" os_prio=0 cpu=0.89ms elapsed=757.35s tid=0x00007f131400dfa0 nid=0x3a554f runnable  

"GC Thread#10" os_prio=0 cpu=0.25ms elapsed=757.35s tid=0x00007f131400ea00 nid=0x3a5550 runnable  

"GC Thread#11" os_prio=0 cpu=0.81ms elapsed=757.34s tid=0x00007f131400f480 nid=0x3a5551 runnable  

"GC Thread#12" os_prio=0 cpu=0.89ms elapsed=757.34s tid=0x00007f131400ff00 nid=0x3a5552 runnable  

"GC Thread#13" os_prio=0 cpu=0.83ms elapsed=757.34s tid=0x00007f1314010980 nid=0x3a5553 runnable  

"GC Thread#14" os_prio=0 cpu=0.87ms elapsed=757.34s tid=0x00007f1314011400 nid=0x3a5554 runnable  

"GC Thread#15" os_prio=0 cpu=1.00ms elapsed=757.34s tid=0x00007f1314011e80 nid=0x3a5555 runnable  

"GC Thread#16" os_prio=0 cpu=0.85ms elapsed=757.34s tid=0x00007f1314012900 nid=0x3a5556 runnable  

"GC Thread#17" os_prio=0 cpu=0.17ms elapsed=757.34s tid=0x00007f1314013380 nid=0x3a5557 runnable  

"GC Thread#18" os_prio=0 cpu=0.17ms elapsed=757.34s tid=0x00007f1314013e00 nid=0x3a5558 runnable  

"GC Thread#19" os_prio=0 cpu=0.93ms elapsed=757.34s tid=0x00007f1314014880 nid=0x3a5559 runnable  

"GC Thread#20" os_prio=0 cpu=0.16ms elapsed=757.34s tid=0x00007f1314015300 nid=0x3a555a runnable  

"GC Thread#21" os_prio=0 cpu=0.82ms elapsed=757.34s tid=0x00007f1314015d80 nid=0x3a555b runnable  

"GC Thread#22" os_prio=0 cpu=0.13ms elapsed=757.34s tid=0x00007f1314016800 nid=0x3a555c runnable  

"G1 Main Marker" os_prio=0 cpu=0.03ms elapsed=757.67s tid=0x00007f13bc06be00 nid=0x3a552a runnable  

"G1 Conc#0" os_prio=0 cpu=0.02ms elapsed=757.67s tid=0x00007f13bc06cd90 nid=0x3a552b runnable  

"G1 Refine#0" os_prio=0 cpu=0.04ms elapsed=757.67s tid=0x00007f13bc1dccf0 nid=0x3a552c runnable  

"G1 Service" os_prio=0 cpu=36.06ms elapsed=757.67s tid=0x00007f13bc1ddbf0 nid=0x3a552d runnable  

"VM Periodic Task Thread" os_prio=0 cpu=103.34ms elapsed=757.66s tid=0x00007f13bc22be20 nid=0x3a5538 waiting on condition  

JNI global refs: 24, weak refs: 0

https://github.com/openjdk/jdk/blob/30645f3309c040deb5bef71b1bd349942b4aa076/src/jdk.jcmd/share/classes/sun/tools/jstack/JStack.java#L130

https://github.com/openjdk/jdk/blob/30645f3309c040deb5bef71b1bd349942b4aa076/src/jdk.attach/share/classes/sun/tools/attach/HotSpotVirtualMachine.java#L281

https://github.com/openjdk/jdk/blob/30645f3309c040deb5bef71b1bd349942b4aa076/src/jdk.attach/linux/classes/sun/tools/attach/VirtualMachineImpl.java#L142 (Attach API, Unix sockets)

The Attach API command is handled on the other side by the VM here: https://github.com/openjdk/jdk/blob/6fd043f1e4423b61cb5b85af9380f75e6a3846a2/src/hotspot/share/services/attachListener.cpp#L178

https://github.com/openjdk/jdk/blob/6fd043f1e4423b61cb5b85af9380f75e6a3846a2/src/hotspot/share/services/diagnosticCommand.hpp#L425

@andrewazores
Copy link
Member

andrewazores commented Sep 10, 2024

Another roadblock:

https://docs.oracle.com/en/java/javase/17/docs/api/jdk.management/com/sun/management/HotSpotDiagnosticMXBean.html

https://docs.oracle.com/en/java/javase/21/docs/api/jdk.management/com/sun/management/HotSpotDiagnosticMXBean.html

The dumpThreads operation is also only present on the HotSpotDiagnosticMXBean since JDK 21, which would likely be a very small proportion of the end users' deployed JVMs that might need this kind of analysis done. So, we cannot/should not rely on the MXBean route for implementing this retrieval.

Worse, using a simple JDK 21 jshell to get the diagnostic MXBean to perform a thread dump to file, I get this result:

3840154
2024-09-10T19:57:54.261421187Z
21.0.4+7

#1 "main"
      java.base/java.lang.Thread.getStackTrace(Thread.java:2450)
      java.base/jdk.internal.vm.ThreadDumper.dumpThread(ThreadDumper.java:165)
      java.base/jdk.internal.vm.ThreadDumper.lambda$dumpThreads$0(ThreadDumper.java:158)
      java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179)
      java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:1024)
      java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
      java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
      java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:310)
      java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
      java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
      java.base/jdk.internal.vm.ThreadDumper.dumpThreads(ThreadDumper.java:158)
      java.base/jdk.internal.vm.ThreadDumper.dumpThreads(ThreadDumper.java:154)
      java.base/jdk.internal.vm.ThreadDumper.dumpThreads(ThreadDumper.java:140)
      jdk.management/com.sun.management.internal.HotSpotDiagnostic.dumpThreads(HotSpotDiagnostic.java:196)
      jdk.management/com.sun.management.internal.HotSpotDiagnostic.lambda$dumpThreads$1(HotSpotDiagnostic.java:178)
      java.base/java.security.AccessController.doPrivileged(AccessController.java:571)
      jdk.management/com.sun.management.internal.HotSpotDiagnostic.dumpThreads(HotSpotDiagnostic.java:182)
      REPL.$JShell$18.do_it$($JShell$18.java:5)
      java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
      java.base/java.lang.reflect.Method.invoke(Method.java:580)
      jdk.jshell/jdk.jshell.execution.DirectExecutionControl.invoke(DirectExecutionControl.java:227)
      jdk.jshell/jdk.jshell.execution.RemoteExecutionControl.invoke(RemoteExecutionControl.java:121)
      jdk.jshell/jdk.jshell.execution.DirectExecutionControl.invoke(DirectExecutionControl.java:125)
      jdk.jshell/jdk.jshell.execution.ExecutionControlForwarder.processCommand(ExecutionControlForwarder.java:148)
      jdk.jshell/jdk.jshell.execution.ExecutionControlForwarder.commandLoop(ExecutionControlForwarder.java:266)
      jdk.jshell/jdk.jshell.execution.Util.forwardExecutionControl(Util.java:78)
      jdk.jshell/jdk.jshell.execution.Util.forwardExecutionControlAndIO(Util.java:148)
      jdk.jshell/jdk.jshell.execution.RemoteExecutionControl.main(RemoteExecutionControl.java:74)

#9 "Reference Handler"
      java.base/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
      java.base/java.lang.ref.Reference.processPendingReferences(Reference.java:246)
      java.base/java.lang.ref.Reference$ReferenceHandler.run(Reference.java:208)

#10 "Finalizer"
      java.base/java.lang.Object.wait0(Native Method)
      java.base/java.lang.Object.wait(Object.java:366)
      java.base/java.lang.Object.wait(Object.java:339)
      java.base/java.lang.ref.NativeReferenceQueue.await(NativeReferenceQueue.java:48)
      java.base/java.lang.ref.ReferenceQueue.remove0(ReferenceQueue.java:158)
      java.base/java.lang.ref.NativeReferenceQueue.remove(NativeReferenceQueue.java:89)
      java.base/java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:173)

#11 "Signal Dispatcher"

#32 "Notification Thread"

#33 "Common-Cleaner"
      java.base/jdk.internal.misc.Unsafe.park(Native Method)
      java.base/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:269)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1847)
      java.base/java.lang.ref.ReferenceQueue.await(ReferenceQueue.java:71)
      java.base/java.lang.ref.ReferenceQueue.remove0(ReferenceQueue.java:143)
      java.base/java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:218)
      java.base/jdk.internal.ref.CleanerImpl.run(CleanerImpl.java:140)
      java.base/java.lang.Thread.run(Thread.java:1583)
      java.base/jdk.internal.misc.InnocuousThread.run(InnocuousThread.java:186)

#34 "output reader"
      java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
      java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
      java.base/sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:256)
      java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307)
      java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
      java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
      java.base/java.net.Socket$SocketInputStream.read(Socket.java:1099)
      java.base/java.net.Socket$SocketInputStream.read(Socket.java:1093)
      java.base/java.io.FilterInputStream.read(FilterInputStream.java:71)
      jdk.jshell/jdk.jshell.execution.DemultiplexInput.run(DemultiplexInput.java:60)

This is again different from the jstack format, and the yatda tool doesn't know how to process it.

So in the end, it seems that the only way to fulfill the support team's original request here is to hook in to the diagnostic command and ask the VM to perform a thread dump to file, either by sending a Unix signal or by using the Attach API. This means that we can only support this exact format if the user's target JVM is already instrumented with the Cryostat Agent. For all other scenarios, the best we can reliably do across various JVM version is probably to return that ThreadInfo[] from the ThreadMXBean.

@andrewazores
Copy link
Member

Removing @Josh-Matsuoka assignment for now, we will need to have some more discussions about this feature to decide the scope and see what, if anything, we want to include in 4.0.

@andrewazores
Copy link
Member

So in the end, it seems that the only way to fulfill the support team's original request here is to hook in to the diagnostic command and ask the VM to perform a thread dump to file, either by sending a Unix signal or by using the Attach API.

There is one more way, and it's probably the best option:

https://docs.oracle.com/en/java/javase/17/docs/api/jdk.management/com/sun/management/DiagnosticCommandMBean.html

We could therefore invoke dcmds remotely over JMX, including requesting the JVM to perform a thread dump (or #134 heap dump), however this still leaves the dump file stuck local to the target JVM's filesystem with no way for us to retrieve it for the user. The user might be able to kubectl cp or kubectl rsync it out, but this requires the target's container image to contain tar, which it may not.

So we're still likely only able to implement something reasonably useful if it's designed to rely upon our Agent being present. In that case, the Agent can receive the DiagnosticCommandMBean invocation request (the framework for this is already laid since #133), ask its own VM to dump to a file, and then send that file back to the Cryostat server.

This way, no tar is required in the target container. Since our Agent can dynamically attach, and is also published as an "init container", it could potentially be used as an ephemeral container for ad-hoc injection into target application containers which are missing jcmd, jstack, tar, and other utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

4 participants