-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes #40917
Comments
I can only hypothesize, but my hypothesis is that perhaps the JVM has an internal lock (or more likely, an invisible native call frame) which is being acquired in between the Maybe @franz1981 has an idea? |
I'll try and think of some way to prove this hypothesis... |
Looking into the JDK source, it appears that the stack walker used to print the pinned stack trace does not include the |
we do have a thread unsafe declared class loader so maybe is the synchronization performed by the JVM itself that is not well-behaving here...? |
/cc @cescoffier (virtual-threads), @ozangunalp (virtual-threads) |
I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being I think that we may want to revisit the idea of blocking inside the class loader (in this case, we are using a shared lock on the resource map). I believe we should change our assumptions and assume that because class loading may happen in a virtual thread, we cannot block (wait) inside of a class loading operation. (Yes I realize that loading a resource may be a blocking operation, but it is generally not a waiting operation.) |
yeah I meant that my previous statement about not being parallel capable was very wrong :P |
re
I need first to understand what means "pinning" here, looking at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L385 to match what the user is doing. |
@dmlloyd I think that what the JDK does to understand if pinning is happening is to check on (a failed) preemption the context of the frame, see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L69-L71 So would be "nice" to have in the output of the pinned reason, which detailed reason is... |
Right, I can only infer the possibility of a hidden native frame based on what's happening at the time, but we don't really know without either the pinned reason or full stack trace. |
Exactly, and I've found this -> https://github.com/openjdk/jdk/blob/5abc02927b480a85fadecf8d03850604510276e4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L205-L207 I could debug it via gdb, in case (poor me .-. very rusty on this) |
It might be simpler to patch the JDK |
The pinned thread printer could easily be enhanced to also print the pinning reason as well. I wonder if it is worth opening a bug and a PR? |
Is there a JFR event that we could use to find the reason for the pinning? |
@geoand https://sap.github.io/SapMachine/jfrevents/21.html#virtualthreadpinned The interesting bit is that you can configure threshold to check if is a temporary pinning or not eg read/write reentrant locks in the past used to spin a bit, sometime, causing to actually "pin" the thread, IDK if this is still valid. |
@imperatorx what is printed with jdk.tracePinnedThreads=full? |
full pin trace:
JFR stack frames printed:
|
When I run it with the java 23 loom early acces JDK, the JFR event has an additional message:
|
It looks to me as a reentrancy case (because I expect no other threads are there holding the write lock) likely due to a write lock already held and a read lock unable to be acquired (hence leading to park). |
I've tried to reproduce with this on 3.11.0 on an hello world endpoint: @Path("/hello")
public class GreetingResource {
private static class Fun {
}
@GET
@Produces(MediaType.TEXT_PLAIN)
public String hello() {
var cl = Thread.currentThread()
.getContextClassLoader();
System.out.println(cl.getClass());
try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 2; i++) {
exec.submit(() -> {
try {
cl.loadClass("profiling.workshop.greeting.GreetingResource$Fun");
System.out.println("Loaded");
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
return "hello";
}
} on Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, sharing) without much luck. curl http://localhost:8080/hello few times, both running the quarkus application with @imperatorx Can you share the full reproducer? To check if is any different from what I've done...? |
aha, found why: the classloading with inner classes is not the same as any top level application classes .-. dumb me :D Said that, by switching to @Path("/hello")
public class GreetingResource {
@GET
@Produces(MediaType.TEXT_PLAIN)
public String hello() {
var cl = Thread.currentThread()
.getContextClassLoader();
System.out.println(cl.getClass());
try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
for (int i = 0; i < 2; i++) {
exec.submit(() -> {
try {
cl.loadClass("profiling.workshop.greeting.Fun");
} catch (Exception e) {
e.printStackTrace();
}
});
}
}
return "hello";
}
} or increasing the iteration count to 10_000 still didn't worked to cause the deadlock. |
The deadlock happens in the actual application with more complex stuff going on, the reproducer is just to reproduce that the |
Got it: so @dmlloyd @Sanne @geoand from what I can see here:
I'm not quite sure about the deadlock, here, but that's why I would like a reproducer that at least make the deadlock to happen, because, under heavy contention lock/unlock can cause some form of monopolization, for sure, due to spinning. @imperatorx The reproducer you placed there is able to produce the stack trace due to pinning you posted? Because that's what I cannot reproduce... |
Yes, we need to further validate this idea and check the implications, but it is only the (optional) point 4. of the actions that I suggested and I think that 1. to 3. could be taken regardless of it.
@Sanne In reality this is for sure the main point of this whole discussion. The first thing that we need to address is figuring out if we need a parallel classloader or not. See premise 1. of my former comment, I had a quick chat with @franz1981 this morning about it and he thinks that, regardless the smallrye fix for which the parallel classloader has been introduced, there could be other relevant use cases where it could be necessary or at least a nice to have. Can you please help us to double check this? If @franz1981 is right and since @geoand's suggestion is a more long-term solution, the only alternative would be finalizing my pull request, making it as good as possible, and merging it. |
@dmlloyd this concerns me and I believe you mentioned this in the past as well - I'm not aware of us getting in potential deadlocks with the existing code but that's more likely due to not having understood the scenario you have in mind. You think you could describe a potential scenario in such a way that someone here might try to create some tests for it? FWIW when I last reviewed the design (and it's been a while - I have not been able to look carefully at the current state) we only had reentrant locks protecting resource state changes, locks would not be held for longer than such state transitions and (I believe) not held during operations that could trigger loading of other classes - at least not from a different thread, which would have been necessary to get in trouble with the reentrant locks. |
the app gets blocked after we see the
or
or
we have to wait for the liveness to kill the pod. and processing can restart. that is harsh. |
This seems to be definitively a bug of the parallel About the reproducer it would be great if you could provide one, or at least give us some more hints on how to create one on our own, for instance providing the full stack traces of the Quarkus application when the deadlock is detected. I will try to write a proper unit test for the parallel As a final note, these days I'm working on an alternative lock-free implementation of the parallel |
not likely. when it happens it always is close to startup (10 seconds after the app has started). the app consumes messages at a decent rate (25 to 50 messages per second) with virtual thread max-concurrency: 5.
a thread dump you mean?
sure. no problem. |
Yes sorry, I meant a complete thread dump of the application when the deadlock happens. |
Thanks a lot for that thread dump. I must admit that it looks quite weird though and I'm still trying to make sense of it. What I mean is that I see 53 threads saying
but at the same time neither I see any track of that virtual thread @vsevel for now I will continue my investigations, mostly trying to reproduce the same problem with a more isolated test case. Please let me know when you'll have a chance to give a try to my pull request and how it works for you. |
same for me. I could not find the VT holding the lock. we created the dump by connecting with the remote debugger from intellij into our ocp cluster using port forward, and executing Run > Debubgging Actions > Get Thread Dump. I do not know if this dump has limited visibility.
where is it? FYI I also created #41313 |
Sadly the thread dump analysis could be unfruitful due to how RW locks work, see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L336 It stores the information of first uncontended read locker Thread, but just for performance reasons, and there is no concept of "read ownership" because a read lock has no exclusive ownership, while the write one, yep.
|
I will spend some time these next 2 days |
Thanks for that link!
I assume Alan means from the JDK folks? |
Yep, but that means too that using reentrant locks while on a native call (i.e. |
After a call with @mariofusco we found how to trigger it reliably: there's no need (i.e. it is not seen as pinning by the JVM) to call |
That's a very interesting find... |
Any late linkage will cause it, for example calling a method that returns a Foo will cause Foo to be loaded if it already isn't loaded, plus the corresponding hidden native frame. |
Now the funny part is the deadlock, but actually If waiting on a lock pinn the carrier - be just need to make some recursive class definition to enter in the locks in the right other and exhausting the number of carriers and it's done :P |
we were able to spend some time on this issue. |
This is a great news. Thanks for giving it a try.
Yes, this is not completely related even though still unexpected. Can you please paste the stack trace of the pinning using my classloader?
This is also unexpected to be honest, or probably I'm missing something. If you reverted to |
I believe this is this issue #41314
let me check with the dev team. |
Ok, this makes sense now: it is a totally different pinning problem and completely unrelated with the classloader. |
short answer (before we re-test) is: I am definitely sure. because they always stayed with the LTS, and the behavior they we were saying is the stack with the runner that we searched across the quarkus issues. and that is how we found this current issue. |
I finally recreated in isolation the virtual thread pinning condition originally reported here. Given all that have been discussed so far and the feedback provided by @vsevel I don't see any reason to further hold this pull request removing the read/write lock from the RunnerClassLoader. |
I can confirm it is not useless, it fixed the original issue. |
we tested it again and were able to reproduce it:
thread dump shows multiple stacks (22 threads blocked) such as:
but I can't see |
we did some additional tests with your branch, hoping to capture a full thread dump when detecting the pinning. unfortunately (for this test) we did not reproduce the pinning with your fix. |
Describe the bug
reproducer below
After upgrading my app to 3.11.0 from 3.10.2 and changing nothing else, suddenly I get deadlocks. I turned on virtual thread pinning logs and indeed my carrier threads got pinned. Interestingly enough, there is no
<- monitor
in the stack, so I think this is not caused by synchronization but maybe a native call? This might be related to the class loader changes in 3.11:Expected behavior
Should not pin carrier threads as before
Actual behavior
Pins carrier threads. This happens in parallel in the actual application, so it ends up pinning all threads and deadlocking the application.
How to Reproduce?
Minimal reproducer:
Output of
uname -a
orver
win64
Output of
java -version
build 21+35-2513
Quarkus version or git rev
3.11.0
Build tool (ie. output of
mvnw --version
orgradlew --version
)mvn
Additional information
No response
The text was updated successfully, but these errors were encountered: