Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes #40917

Open
imperatorx opened this issue May 31, 2024 · 75 comments
Labels
area/virtual-threads Issue related to Java's Virtual Threads kind/bug Something isn't working

Comments

@imperatorx
Copy link
Contributor

imperatorx commented May 31, 2024

Describe the bug

reproducer below

After upgrading my app to 3.11.0 from 3.10.2 and changing nothing else, suddenly I get deadlocks. I turned on virtual thread pinning logs and indeed my carrier threads got pinned. Interestingly enough, there is no <- monitor in the stack, so I think this is not caused by synchronization but maybe a native call? This might be related to the class loader changes in 3.11:

Pinned thread trace log:

Thread[#67,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

Expected behavior

Should not pin carrier threads as before

Actual behavior

Pins carrier threads. This happens in parallel in the actual application, so it ends up pinning all threads and deadlocking the application.

How to Reproduce?

Minimal reproducer:

var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());

        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for(int i = 0; i < 2; i++) {
                exec.submit(()->{
                    try {
                       cl.loadClass("an application class name here");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
  • Build with maven, then run from target/quarkus-app with -Djdk.tracePinnedThreads=short
  • Observe thread pin warnings in the log
  • Another weirdness is that even though prinning trace is set to "short", it prints full stacks?

Output of uname -a or ver

win64

Output of java -version

build 21+35-2513

Quarkus version or git rev

3.11.0

Build tool (ie. output of mvnw --version or gradlew --version)

mvn

Additional information

No response

@imperatorx imperatorx added the kind/bug Something isn't working label May 31, 2024
@imperatorx imperatorx changed the title 3.11.0 causes thread pinning regression (compared to 3.10.2) while using Kryo - possible class loader issue 3.11.0 causes thread pinning regression (compared to 3.10.2) while loading classes May 31, 2024
@imperatorx imperatorx changed the title 3.11.0 causes thread pinning regression (compared to 3.10.2) while loading classes 3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes May 31, 2024
@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

I can only hypothesize, but my hypothesis is that perhaps the JVM has an internal lock (or more likely, an invisible native call frame) which is being acquired in between the org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32) frame and the io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71) frame.

Maybe @franz1981 has an idea?

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

I'll try and think of some way to prove this hypothesis...

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

Looking into the JDK source, it appears that the stack walker used to print the pinned stack trace does not include the SHOW_HIDDEN_FRAMES flag, which I think strengthens my hypothesis considerably.

@franz1981
Copy link
Contributor

we do have a thread unsafe declared class loader so maybe is the synchronization performed by the JVM itself that is not well-behaving here...?

@geoand geoand added area/virtual-threads Issue related to Java's Virtual Threads and removed triage/needs-triage labels May 31, 2024
@quarkus-bot
Copy link

quarkus-bot bot commented May 31, 2024

/cc @cescoffier (virtual-threads), @ozangunalp (virtual-threads)

@franz1981
Copy link
Contributor

franz1981 commented May 31, 2024

image

@dmlloyd @Sanne @geoand these seems the only 2 commits which separate the 2 versions mentioned

I was wrong about being thread-unsafe because of registerAsParallelCapable(); added in such (with the relevant related changes to make it possible - which I have to learn/look at)

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

I think that we may want to revisit the idea of blocking inside the class loader (in this case, we are using a shared lock on the resource map). I believe we should change our assumptions and assume that because class loading may happen in a virtual thread, we cannot block (wait) inside of a class loading operation. (Yes I realize that loading a resource may be a blocking operation, but it is generally not a waiting operation.)

@franz1981
Copy link
Contributor

franz1981 commented May 31, 2024

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

yeah I meant that my previous statement about not being parallel capable was very wrong :P

@franz1981
Copy link
Contributor

re

I think that we may want to revisit the idea of blocking inside the class loader

I need first to understand what means "pinning" here, looking at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L385 to match what the user is doing.

@franz1981
Copy link
Contributor

franz1981 commented May 31, 2024

@dmlloyd I think that what the JDK does to understand if pinning is happening is to check on (a failed) preemption the context of the frame, see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L69-L71

So would be "nice" to have in the output of the pinned reason, which detailed reason is...

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

Right, I can only infer the possibility of a hidden native frame based on what's happening at the time, but we don't really know without either the pinned reason or full stack trace.

@franz1981
Copy link
Contributor

Exactly, and I've found this -> https://github.com/openjdk/jdk/blob/5abc02927b480a85fadecf8d03850604510276e4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L205-L207

I could debug it via gdb, in case (poor me .-. very rusty on this)

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

It might be simpler to patch the JDK PinnedThreadPrinter with more stack walker options as a first try using --patch-module.

@dmlloyd
Copy link
Member

dmlloyd commented May 31, 2024

The pinned thread printer could easily be enhanced to also print the pinning reason as well. I wonder if it is worth opening a bug and a PR?

@geoand
Copy link
Contributor

geoand commented May 31, 2024

Is there a JFR event that we could use to find the reason for the pinning?

@franz1981
Copy link
Contributor

franz1981 commented May 31, 2024

@geoand https://sap.github.io/SapMachine/jfrevents/21.html#virtualthreadpinned

The interesting bit is that you can configure threshold to check if is a temporary pinning or not

eg read/write reentrant locks in the past used to spin a bit, sometime, causing to actually "pin" the thread, IDK if this is still valid.

@franz1981
Copy link
Contributor

franz1981 commented May 31, 2024

@imperatorx what is printed with jdk.tracePinnedThreads=full?

@imperatorx
Copy link
Contributor Author

@imperatorx what is printed with jdk.tracePinnedThreads=full?

full pin trace:

Thread[#74,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$3(GreetingResource.java:50)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

JFR stack frames printed:

{
  method = java.lang.VirtualThread.parkOnCarrierThread(boolean, long)
  lineNumber = 687
  bytecodeIndex = 116
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.park()
  lineNumber = 603
  bytecodeIndex = 200
  type = "Interpreted"
}

{
  method = java.lang.System$2.parkVirtualThread()
  lineNumber = 2639
  bytecodeIndex = 17
  type = "Interpreted"
}

{
  method = jdk.internal.misc.VirtualThreads.park()
  lineNumber = 54
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.LockSupport.park(Object)
  lineNumber = 219
  bytecodeIndex = 16
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer$Node, int, boolean, boolean, boolean, long)
  lineNumber = 754
  bytecodeIndex = 361
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(int)
  lineNumber = 1079
  bytecodeIndex = 15
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock()
  lineNumber = 738
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference()
  lineNumber = 156
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.getResourceData(String)
  lineNumber = 72
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String, boolean)
  lineNumber = 105
  bytecodeIndex = 159
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String)
  lineNumber = 71
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource.lambda$init$3(ClassLoader)
  lineNumber = 50
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource$$Lambda+0x0000024c211e1248.904209509.run()
  lineNumber = -1
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.Executors$RunnableAdapter.call()
  lineNumber = 572
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.FutureTask.run()
  lineNumber = 317
  bytecodeIndex = 39
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.runWith(Object, Runnable)
  lineNumber = 341
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.run(Runnable)
  lineNumber = 311
  bytecodeIndex = 63
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread$VThreadContinuation$1.run()
  lineNumber = 192
  bytecodeIndex = 8
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter0()
  lineNumber = 320
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter(Continuation, boolean)
  lineNumber = 312
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enterSpecial(Continuation, boolean, boolean)
  lineNumber = -1
  bytecodeIndex = 0
  type = "Native"
}

@imperatorx
Copy link
Contributor Author

When I run it with the java 23 loom early acces JDK, the JFR event has an additional message:

pinnedReason = "Native frame or <clinit> on stack"

@franz1981
Copy link
Contributor

It looks to me as a reentrancy case (because I expect no other threads are there holding the write lock) likely due to a write lock already held and a read lock unable to be acquired (hence leading to park).

@franz1981
Copy link
Contributor

franz1981 commented Jun 3, 2024

I've tried to reproduce with this on 3.11.0 on an hello world endpoint:

@Path("/hello")
public class GreetingResource {

    private static class Fun {

    }


    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.GreetingResource$Fun");
                        System.out.println("Loaded");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }
}

on

Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, sharing)

without much luck.
I've used

curl http://localhost:8080/hello

few times, both running the quarkus application with -Djdk.tracePinnedThreads=short and without it, not getting the same outcome of this issue (deadlock or any log re pinned threads).

@imperatorx Can you share the full reproducer? To check if is any different from what I've done...?
Did you produce an high load of requests instead?

@franz1981
Copy link
Contributor

franz1981 commented Jun 3, 2024

aha, found why: the classloading with inner classes is not the same as any top level application classes .-. dumb me :D

Said that, by switching to

@Path("/hello")
public class GreetingResource {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.Fun");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }

}

or increasing the iteration count to 10_000 still didn't worked to cause the deadlock.

@imperatorx
Copy link
Contributor Author

The deadlock happens in the actual application with more complex stuff going on, the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread in 3.11 where it did not in 3.10. The thread.park is not a problem on itself, but there is a native frame or a static class initializer on the stack somewhere (not visible in pin trace though) that is causing the pinning of the carrier thread.

@franz1981
Copy link
Contributor

franz1981 commented Jun 3, 2024

the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread

Got it: so @dmlloyd @Sanne @geoand from what I can see here:

  • if multiple (virtual threads) call RunnerClassLoader::loadClass and findLoadedClass return null, they all compete against the same JarResource::readLock (which internally can spin due to the contention)
  • Right after releasing the read lock at JarResource::readLock they all acquire the write lock at JarResource::ensureJarFileIsOpen - which again means parking any other threads AND making them to spin

I'm not quite sure about the deadlock, here, but that's why I would like a reproducer that at least make the deadlock to happen, because, under heavy contention lock/unlock can cause some form of monopolization, for sure, due to spinning.

@imperatorx The reproducer you placed there is able to produce the stack trace due to pinning you posted? Because that's what I cannot reproduce...

@mariofusco
Copy link
Contributor

We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

I would still like to people to try and poke holes in this idea... I personally don't see why it would not work better, but others might :).

Also, this solution would definitely be more involved (in terms of how many things need to be changed) than fixes such as this one.

Yes, we need to further validate this idea and check the implications, but it is only the (optional) point 4. of the actions that I suggested and I think that 1. to 3. could be taken regardless of it.

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread. Did that change actually manifest in tangible benefits?

@Sanne In reality this is for sure the main point of this whole discussion. The first thing that we need to address is figuring out if we need a parallel classloader or not. See premise 1. of my former comment, I had a quick chat with @franz1981 this morning about it and he thinks that, regardless the smallrye fix for which the parallel classloader has been introduced, there could be other relevant use cases where it could be necessary or at least a nice to have. Can you please help us to double check this? If @franz1981 is right and since @geoand's suggestion is a more long-term solution, the only alternative would be finalizing my pull request, making it as good as possible, and merging it.

@Sanne
Copy link
Member

Sanne commented Jun 10, 2024

Also there's the ever-present risk of ABA deadlocks in the event that there is any kind of circularity happening when locking is in use.

@dmlloyd this concerns me and I believe you mentioned this in the past as well - I'm not aware of us getting in potential deadlocks with the existing code but that's more likely due to not having understood the scenario you have in mind.

You think you could describe a potential scenario in such a way that someone here might try to create some tests for it?

FWIW when I last reviewed the design (and it's been a while - I have not been able to look carefully at the current state) we only had reentrant locks protecting resource state changes, locks would not be held for longer than such state transitions and (I believe) not held during operations that could trigger loading of other classes - at least not from a different thread, which would have been necessary to get in trouble with the reentrant locks.

@vsevel
Copy link
Contributor

vsevel commented Jun 19, 2024

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
the liveness probe does not answer.
we see timeouts on kafka consumer poll.
some examples of thread being blocked:

"micrometer-kafka-metrics@16921" tid=0x64 nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              blocks micrometer-kafka-metrics@16901
              blocks micrometer-kafka-metrics@16912
              blocks micrometer-kafka-metrics@16918
              blocks micrometer-kafka-metrics@16924
              blocks micrometer-kafka-metrics@16927
              blocks micrometer-kafka-metrics@16930
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.micrometer.core.instrument.Tags$ArrayIterator.<init>(Tags.java:131)
                at io.micrometer.core.instrument.Tags.iterator(Tags.java:128)
                at io.micrometer.core.instrument.Tags.of(Tags.java:224)
...

or

"smallrye-kafka-consumer-thread-2@16897" tid=0x4b nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.smallrye.mutiny.groups.UniOnItem.delayIt(UniOnItem.java:188)
                at io.smallrye.mutiny.helpers.ExponentialBackoff.lambda$randomExponentialBackoffFunction$0(ExponentialBackoff.java:48)
                at io.smallrye.mutiny.helpers.ExponentialBackoff$$Lambda/0x00007fd03090fa88.apply(Unknown Source:-1)
                at io.smallrye.context.impl.wrappers.SlowContextualFunction.apply(SlowContextualFunction.java:21)
                at io.smallrye.mutiny.groups.MultiOnItem.lambda$transformToUni$6(MultiOnItem.java:268)

or

"executor-thread-1@16867" tid=0x1f nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at com.x.HealthCheckService_Bean.proxy(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:554)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:534)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:567)
                at io.quarkus.arc.impl.ArcContainerImpl.instance(ArcContainerImpl.java:339)
                at x.HealthCheckService_ScheduledInvoker_autoCheckAll_01fef9f6dc57a56c94b1934cb946de439403b427.invokeBean(Unknown Source:-1)

we have to wait for the liveness to kill the pod. and processing can restart. that is harsh.

@mariofusco
Copy link
Contributor

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183) we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader) the liveness probe does not answer. we see timeouts on kafka consumer poll.

This seems to be definitively a bug of the parallel RunnerClassLoader implementation and indeed we (temporaraly?) reverted it to the original sequential version. What at the moment isn't clear to me is why this is happening only with virtual threads, while if there is a potential deadlock condition this should probably happen regardless of the kind of threads. The only (admittedly partial and not strong enough) explanation that I can see at the moment is that with virtual threads we have a higher degree of parallelism thus making the problem more evident and easier to reproduce.

About the reproducer it would be great if you could provide one, or at least give us some more hints on how to create one on our own, for instance providing the full stack traces of the Quarkus application when the deadlock is detected. I will try to write a proper unit test for the parallel RunnerClassLoader anyway, otherwise any potential fix would be a blind one.

As a final note, these days I'm working on an alternative lock-free implementation of the parallel RunnerClassLoader. @vsevel it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.

@vsevel
Copy link
Contributor

vsevel commented Jun 19, 2024

it would be great if you could provide one

not likely. when it happens it always is close to startup (10 seconds after the app has started). the app consumes messages at a decent rate (25 to 50 messages per second) with virtual thread max-concurrency: 5.

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.

sure. no problem.

@mariofusco
Copy link
Contributor

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

Yes sorry, I meant a complete thread dump of the application when the deadlock happens.

@vsevel
Copy link
Contributor

vsevel commented Jun 19, 2024

threads.txt

@mariofusco
Copy link
Contributor

Thanks a lot for that thread dump. I must admit that it looks quite weird though and I'm still trying to make sense of it. What I mean is that I see 53 threads saying

  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

@vsevel for now I will continue my investigations, mostly trying to reproduce the same problem with a more isolated test case. Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

@vsevel
Copy link
Contributor

vsevel commented Jun 19, 2024

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

same for me. I could not find the VT holding the lock. we created the dump by connecting with the remote debugger from intellij into our ocp cluster using port forward, and executing Run > Debubgging Actions > Get Thread Dump. I do not know if this dump has limited visibility.

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

FYI I also created #41313
this is not related to class loading, but I need help investigating this

@mariofusco
Copy link
Contributor

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

#40942

FYI I also created #41313 this is not related to class loading, but I need help investigating this

I will give a look asap, but I would like to make sense of this issue first.

@franz1981
Copy link
Contributor

franz1981 commented Jun 19, 2024

Sadly the thread dump analysis could be unfruitful due to how RW locks work, see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L336

It stores the information of first uncontended read locker Thread, but just for performance reasons, and there is no concept of "read ownership" because a read lock has no exclusive ownership, while the write one, yep.
This means that a thread dump cannot print read lock owners which prevent writers or others to be unblocked.
(Let's remember that no write lock can be acquired while a read one is already held).
I believe that's the deadlock happening here:

  • a read lock acquired
  • a reentrant class loading on recursive define class
  • a subsequent write lock, which cause a deadlock with itself, but invisible to thread dumps

@mariofusco
Copy link
Contributor

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

#40942

@vsevel Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

@vsevel
Copy link
Contributor

vsevel commented Jun 24, 2024

Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

I will spend some time these next 2 days

@franz1981
Copy link
Contributor

@geoand
Copy link
Contributor

geoand commented Jun 25, 2024

Thanks for that link!

but class loading and initialization is something that needs to be re-examined next

I assume Alan means from the JDK folks?

@franz1981
Copy link
Contributor

franz1981 commented Jun 25, 2024

Yep, but that means too that using reentrant locks while on a native call (i.e. defineClass) "maybe" won't be a problem in subsequent releases - as per synchronized blocks.
That means that I won't be worried for the pinning event but only to fix the deadlocking issue.

@franz1981
Copy link
Contributor

franz1981 commented Jun 25, 2024

After a call with @mariofusco we found how to trigger it reliably: there's no need (i.e. it is not seen as pinning by the JVM) to call <ClassLoader>::loadClass but the real reason why pinning happen is if we have native stack frames in the carrier i.e. Class::forName or just NameOfTheClass.class.getName()

@geoand
Copy link
Contributor

geoand commented Jun 25, 2024

That's a very interesting find...

@dmlloyd
Copy link
Member

dmlloyd commented Jun 25, 2024

Any late linkage will cause it, for example calling a method that returns a Foo will cause Foo to be loaded if it already isn't loaded, plus the corresponding hidden native frame.

@franz1981
Copy link
Contributor

Now the funny part is the deadlock, but actually If waiting on a lock pinn the carrier - be just need to make some recursive class definition to enter in the locks in the right other and exhausting the number of carriers and it's done :P

@vsevel
Copy link
Contributor

vsevel commented Jun 25, 2024

we were able to spend some time on this issue.
we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs. but we still see the pinning issue when going through RunnerClassLoader.
I understand it is more of an issue inside the open jdk. correct?
just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

@mariofusco
Copy link
Contributor

we were able to spend some time on this issue. we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs.

This is a great news. Thanks for giving it a try.

but we still see the pinning issue when going through RunnerClassLoader. I understand it is more of an issue inside the open jdk. correct?

Yes, this is not completely related even though still unexpected. Can you please paste the stack trace of the pinning using my classloader?

just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

This is also unexpected to be honest, or probably I'm missing something. If you reverted to 3.8 it means that you're using the RunnerClassLoader when it still was sequential, or in other words far before it was registered as parallel capable in version 3.11.0. Do you mean that also the sequential version of the old classloader can suffer of deadlock? If so the revert that we did recently to that sequential version is practically useless, can you please double check?

@vsevel
Copy link
Contributor

vsevel commented Jun 25, 2024

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue #41314

can you please double check?

let me check with the dev team.

@mariofusco
Copy link
Contributor

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue #41314

Ok, this makes sense now: it is a totally different pinning problem and completely unrelated with the classloader.

@vsevel
Copy link
Contributor

vsevel commented Jun 25, 2024

can you please double check?

short answer (before we re-test) is: I am definitely sure. because they always stayed with the LTS, and the behavior they we were saying is the stack with the runner that we searched across the quarkus issues. and that is how we found this current issue.

@mariofusco
Copy link
Contributor

I finally recreated in isolation the virtual thread pinning condition originally reported here.

Given all that have been discussed so far and the feedback provided by @vsevel I don't see any reason to further hold this pull request removing the read/write lock from the RunnerClassLoader.

@imperatorx
Copy link
Contributor Author

If so the revert that we did recently to that sequential version is practically useless

I can confirm it is not useless, it fixed the original issue.

@vsevel
Copy link
Contributor

vsevel commented Jun 26, 2024

we tested it again and were able to reproduce it:

2024-06-26 09:28:18,231 INFO [io.quarkus] (main) {} x 1.1.6-SNAPSHOT on JVM (powered by Quarkus 3.8.4) started in 6.142s. Listening on: http://0.0.0.0:8080 ...
...
Thread[#80,ForkJoinPool-1-worker-2,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:582)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:143)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:99)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:65)
    io.quarkus.cache.runtime.CacheInterceptor.getArcCacheInterceptionContext(CacheInterceptor.java:82)
    io.quarkus.cache.runtime.CacheInterceptor.getInterceptionContext(CacheInterceptor.java:61)
    io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:43)
    io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(Unknown Source)
    io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
    io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
...

thread dump shows multiple stacks (22 threads blocked) such as:

"vert.x-eventloop-thread-0@16949" tid=0x1a nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
	 waiting for quarkus-virtual-thread-16@17044 to release lock on <0x4293> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:537)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:500)
	  at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:67)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:244)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool.evict(SimpleConnectionPool.java:549)
	  at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint.checkExpired(SharedClientHttpStreamEndpoint.java:115)
	  at io.vertx.core.http.impl.HttpClientImpl.lambda$static$1(HttpClientImpl.java:106)
	  at io.vertx.core.http.impl.HttpClientImpl$$Lambda/0x00007fd358434700.accept(Unknown Source:-1)
	  at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4783)
	  at io.vertx.core.net.impl.pool.ConnectionManager.forEach(ConnectionManager.java:33)
	  at io.vertx.core.http.impl.HttpClientImpl.checkExpired(HttpClientImpl.java:150)
	  at io.vertx.core.http.impl.HttpClientImpl.access$100(HttpClientImpl.java:48)
	  at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:144)
	  at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:132)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1056)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1027)
	  at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:328)
	  at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:321)
	  at io.vertx.core.impl.ContextInternal.emit(ContextInternal.java:209)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:1045)
	  at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	  at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	  at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	  at java.lang.Thread.runWith(Thread.java:1596)
	  at java.lang.Thread.run(Thread.java:1583)

but I can't see quarkus-virtual-thread-16@17044 anywhere.

@vsevel
Copy link
Contributor

vsevel commented Jun 26, 2024

we did some additional tests with your branch, hoping to capture a full thread dump when detecting the pinning. unfortunately (for this test) we did not reproduce the pinning with your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/virtual-threads Issue related to Java's Virtual Threads kind/bug Something isn't working
Projects
Status: Planned
Development

No branches or pull requests

7 participants