-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loom: jdk19 OpenJDK Invalid JIT return address ASSERTION FAILED swalk.c:1601 #15939
Comments
With Loom support now in, are these still an issue? |
@fengxue-IS Is this test passing now? |
@fengxue-IS any update on this? |
This test passed with -Xint, failed when JIT is enabled due to
I am working on triaging this, may need help from JIT team with understanding this cause |
I was able to get a core locally by directly compiling the test class the invalid JIT Return address comes from GC walking JIT stacks on carrier thread after the continuation have yielded
The Java stack of thread 29 in DDR:
|
Update When testing with |
it looks similar issue as #16212 |
@nbhuiyan : please investigate this from a JIT perspective. |
The original timeout no longer occurs. Currently, there are two failures associated to this test:
|
I discussed with @fengxue-IS about the JIT invalid return address issue running the PingPong test. Jack reported that running with
|
I have obtained a JIT verbose log from a failing run and that contained some of the methods seen in the strack trace from investigating a core dump. I was trying to narrow down to the smallest set of methods that need to be compiled in order for the invalid return address to occur, so that I can focus on what happens during the compilation of those method(s). However, using limit file to try various subsets of methods causes the problem to hide, so I have yet to figure out a way to progress further with that. I have also tried the GC race condition fix in #16290, and still saw the problem happening. |
@nbhuiyan Do you think this will be resolved within 2 weeks? |
It's possible, depending on what the actual cause is. Examining the JIT logs of the methods leading up to the frame where invalid JIT return address occurs during a failed run does not show anything going wrong there, so I am starting to somewhat suspect that the problem may not be JIT-related. |
Some updates with this investigation:
Segfault case, where we end up in the invalid JIT code cache region:
JIT invalid return address case, where we fail during stackwalk:
Looking at how we end up with the invalid JIT return address:
The stack walker is triggered through
In the J9VMThread, we see the JIT return address set as: |
While mounting a virtual thread, we copy the J9VMContinuation fields into J9VMThread and store the carrier thread fields in J9VMContinuation. While unmounting a virtual thread, we restore the carrier thread state by again swapping values into J9VMThread and caching the virtual thread specific-details in J9VMContinuation. We may need to cache the jitReturnAddress and other similar JIT fields in J9VMContinuation, which may need to be swapped during virtual thread mount and unmount. J9VMContinuation contains fields that are thread-specific and cannot be shared between a virtual thread and its carrier thread. See code associated to the below function for the above operations. Currently, jitReturnAddress is not part of the swap. openj9/runtime/vm/ContinuationHelpers.hpp Lines 51 to 84 in f21a361
|
Ah, I see now - Even in the helpers, the value would be stored into a resolve frame before any potential calls into java could occur. You could try swapping it to see if the assertion goes away, but I do not believe it will have any effect. |
@nbhuiyan : from the crash where you extracted the info above, can you prove whether the I think prototyping adding a field to the |
|
@nbhuiyan do you have an update |
@tajila Yes, I have prototyped adding a jitReturnAddress field to J9VMContinuation struct, and using that to swap the JIT return address during virtual thread mounting and unmounting. I confirmed that the J9VMThread had jitReturnAddress field pointing to a valid address, however the |
This is what I expected. |
Using -verbose:stackwalk, I was able to track down what the invalid address belonged to, and it was the PC for method Here is a part of the list of compiled methods obtained from a core dump using jdmpview:
There are similar methods with more than one entry in the list, such as
The warm compilation of So for the crash, we have
and the jit return address is immediately after the JIT-ed method address range of the first entry of topLevelExec. In the core dump, that address was invalid, but within the range of the JIT code cache. The stack walk results in an error due to this invalid address in 00007FBA7AA4C6B0, which is the walkState object. Here is the info we get using KCA on the J9Method:
As you can see, the compiled method start is the same value as the compiled method start of the second entry of topLevelExec reported by jdmpview. So somehow the walkState object needed to be updated with the new address for topLevelExec, but that’s what is not happening. |
local testing with #16374 against Skynet.java test: Not sure if crash in BalancedGC is an intermittent issue or not, have not seen on other GC policy( or I haven't ran enough times to trigger it) Updated commit: dfebbcd |
How shall we proceed with this? I have a fix that implements the stack walk in place in Alternatively, if your universal stack walk feature will be available soon then I can incorporate that instead. However, given where we are in the release cycle I would prefer the first option because it is less disruptive in terms of code changes. I'm worried about introducing subtle problems that will be hard to track down late in the release. |
@fengxue-IS We need a definitive answer on whether all continuations can currently be found via the vthread list. |
For JDK19 this is the case. |
OK so the current solution will work. In the future, the code really belongs in a VM helper, but that's complicated by realtime (which is basically the antithesis or loom anyway). |
I've renamed the issue to more accurated represent the problem |
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/63/
|
Just FYI that my workaround fix also crashed with balanced GC with the same backtrace as Jack's PR. |
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/67
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_x86-64_linux_Nightly/68 No diagnostics, process was terminated.
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/71
|
Not sure if this is related but I'll add it here for now. https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/72
|
JDK19 internal build(
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/88
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/89
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/90
|
JDK19 internal build(
|
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/93
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_s390x_linux_Nightly/96
https://openj9-jenkins.osuosl.org/job/Test_openjdk19_j9_sanity.openjdk_ppc64_aix_Nightly/98 |
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374 Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Closing as issue is addressed |
Tests excluded due eclipse-openj9/openj9#15939 are fixed by eclipse-openj9/openj9#16374 Signed-off-by: Jack Lu <Jack.S.Lu@ibm.com>
Likely due to lack of JIT support.
The text was updated successfully, but these errors were encountered: