-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] Capture thread dumps along with thread-level CPU usage #135
Comments
Instead of using jstack or top, you could use either JFR or JMX to gather this info. JFR:
These events are a good solution since Cryostat already captures this data . Nothing extra needs to be done. You would just have to inspect the JFR recordings. JMX
For convenience, maybe it would be helpful if Cryostat had a dedicated view for this data that was automatically updated? Maybe something similar to the other JMX dashboard widgets. Side note: Somehow giving Cryostat the ability to use jstack or jcmd to gather thread dump or CPU usage is probably not a good approach. Both those tools rely on the Attach-API. On POSIX, this uses Unix domain sockets. So you'd have to somehow set up the sockets in a shared volume that client/server containers can both access. That may not work out of the box with the standard attach flow. And if the containers are on different hosts, it won't work at all. |
I was thinking that this would be done using the Cryostat Agent, so no need for doing it over the network or over any shared volume. But if @lkonno and her team can get the information they need out of existing JFR events, then that's even better! We do have an overall CPU usage panel on the Grafana dashboard that simply displays high-level CPU timeseries stats: I don't think we have a good way right now to render the thread dumps (stack traces), but given the JFR file maybe it's reasonable to just use JMC to open the file and inspect it. For |
Oh yes, issuing the commands from the agent is good idea! Then you wouldn't really have to worry about the limitations of unix sockets. Maybe that's another option, if the existing JFR events (or JMX MXBeans) aren't good enough.
Yeah that seems good too. I initially wasn't sure how you'd represent all the threads, because there may be a very high cardinality. But if there's already a way of representing many object types, a similar approach would probably make sense here too. |
The visualization for many object types certainly leaves something to be desired, since the allocation types is pretty high cardinality too. As the screenshots show, the aggregate panel can become quite noisy, although it does help with identifying spikes of particular allocation types at particular times, and maybe can help with identifying correlations. The breakout dashboard has each allocation type on an independent chart (with linked x-axis domains for zoom control) and independent y-axes, so this helps to identify trends in particular allocation types rather than comparing types to each other. I imagine something similar might make sense for the thread visualization. Since it's percentage load based it does mean we know we can fix the y-axis range to [0, 1], which is nice. |
That's a good point. Having the aggregate panel to help identify correlations probably makes sense for threads as well as allocations. Yeah, and having the cpu loads already normalized is convenient too. I'm not sure whether Cryostat already has this feature, but having the ability to filter out allocation size below a specified threshold might help with noise in the aggregated chart. I imagine there's probably a lot of smaller allocations, but in most cases you care more about trends between object types that have large/sustained allocations. This could allow you to do some graph re-scaling as well, spacing out the "important" lines more, making them easier to differentiate. |
I think the jfr-datasource can already support a query like that, but we don't currently have any controls on the panel to add that kind of condition to the query. It's a good idea. |
The specific thread dump format that the support team receives from other tooling is important to support here, as at least the foundation of this feature. This is the format output by tools like
I think this probably matches the file output format of the HotSpotDiagnosticMXBean: The support team uses other scripts and tooling which ingest this format, such as https://github.com/aogburn/yatda . The ThreadMXBean#dumpAllThreads format is somewhat different by default - here is the output you get from simply doing
Notably, the format of each thread entry is somewhat different and details like the thread CPU usage are missing. The "header" and "footer" sections of the full thread dump file are also missing. header:
footer:
So, being able to capture the I think from here we have several different features that could be implemented:
|
It looks like
https://github.com/openjdk/jdk/blob/30645f3309c040deb5bef71b1bd349942b4aa076/src/jdk.attach/linux/classes/sun/tools/attach/VirtualMachineImpl.java#L142 (Attach API, Unix sockets) The Attach API command is handled on the other side by the VM here: https://github.com/openjdk/jdk/blob/6fd043f1e4423b61cb5b85af9380f75e6a3846a2/src/hotspot/share/services/attachListener.cpp#L178 |
Another roadblock: The Worse, using a simple JDK 21
This is again different from the So in the end, it seems that the only way to fulfill the support team's original request here is to hook in to the diagnostic command and ask the VM to perform a thread dump to file, either by sending a Unix signal or by using the Attach API. This means that we can only support this exact format if the user's target JVM is already instrumented with the Cryostat Agent. For all other scenarios, the best we can reliably do across various JVM version is probably to return that |
Removing @Josh-Matsuoka assignment for now, we will need to have some more discussions about this feature to decide the scope and see what, if anything, we want to include in 4.0. |
There is one more way, and it's probably the best option: We could therefore invoke dcmds remotely over JMX, including requesting the JVM to perform a thread dump (or #134 heap dump), however this still leaves the dump file stuck local to the target JVM's filesystem with no way for us to retrieve it for the user. The user might be able to So we're still likely only able to implement something reasonably useful if it's designed to rely upon our Agent being present. In that case, the Agent can receive the DiagnosticCommandMBean invocation request (the framework for this is already laid since #133), ask its own VM to dump to a file, and then send that file back to the Cryostat server. This way, no |
Describe the feature
For high CPU investigation by the java process, the thread-level CPU usage (with top command) is also captured along to the thread dumps (jstack).
It would be helpful to have an easier way for this capture remotely than executing a script inside the pod.
Anything other information?
No response
The text was updated successfully, but these errors were encountered: