VirtualThread pinning with RealSpan and zipkin2.reporter #1426

keydon · 2024-03-22T10:28:42Z

Describe the Bug

I am using spring boot 3 with tracing through brave & zipkin. All fine until I tried out virtual threads with jdk21

Steps to Reproduce

Execute a spring boot 3 app with configured brave/zipkin tracing and set the JVM Flag -Djdk.tracePinnedThreads=full for detecting problematic pinning.

after a while your get the following stack-trace:

Thread[#239,ForkJoinPool-1-worker-17,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(Unknown Source)
    java.base/jdk.internal.vm.Continuation.onPinned0(Unknown Source)
    java.base/java.lang.VirtualThread.park(Unknown Source)
    java.base/java.lang.System$2.parkVirtualThread(Unknown Source)
    java.base/jdk.internal.misc.VirtualThreads.park(Unknown Source)
    java.base/java.util.concurrent.locks.LockSupport.park(Unknown Source)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(Unknown Source)
    java.base/java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
    zipkin2.reporter.ByteBoundedQueue.offer(ByteBoundedQueue.java:51)
    zipkin2.reporter.AsyncReporter$BoundedAsyncReporter.report(AsyncReporter.java:259)
    zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:44)
    zipkin2.reporter.brave.ConvertingSpanReporter.report(ConvertingSpanReporter.java:29)
    zipkin2.reporter.brave.ZipkinSpanHandler.end(ZipkinSpanHandler.java:148)
    brave.internal.handler.NoopAwareSpanHandler$CompositeSpanHandler.end(NoopAwareSpanHandler.java:107)
    brave.internal.handler.NoopAwareSpanHandler.end(NoopAwareSpanHandler.java:57)
    brave.internal.recorder.PendingSpans.finish(PendingSpans.java:122)
    brave.RealSpan.finish(RealSpan.java:143) <== monitors:1
    brave.RealSpan.finish(RealSpan.java:138)
    io.micrometer.tracing.brave.bridge.BraveSpan.end(BraveSpan.java:117)
    io.micrometer.tracing.handler.TracingObservationHandler.endSpan(TracingObservationHandler.java:216)
    io.micrometer.tracing.handler.PropagatingReceiverTracingObservationHandler.onStop(PropagatingReceiverTracingObservationHandler.java:101)
    io.micrometer.tracing.handler.PropagatingReceiverTracingObservationHandler.onStop(PropagatingReceiverTracingObservationHandler.java:35)
    io.micrometer.observation.ObservationHandler$FirstMatchingCompositeObservationHandler.onStop(ObservationHandler.java:197)
    io.micrometer.observation.SimpleObservation.lambda$notifyOnObservationStopped$0(SimpleObservation.java:268)
    java.base/java.util.ArrayDeque$DescendingIterator.forEachRemaining(Unknown Source)
    io.micrometer.observation.SimpleObservation.notifyOnObservationStopped(SimpleObservation.java:268)
    io.micrometer.observation.SimpleObservation.stop(SimpleObservation.java:188)
    org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:123)
    org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
    org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:174)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)
    org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)
    org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
    org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)
    org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
    org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
    org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
    org.apache.catalina.valves.RemoteIpValve.invoke(RemoteIpValve.java:735)
    org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:344)
    org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391)
    org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
    org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)
    org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744)
    org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
    java.base/java.lang.VirtualThread.run(Unknown Source)

the finish() Method of the RealSpan is marked as the culprit.
I am no expert in multithreading and locking, but it seems that the "synchronized" keyword is the issue here and should be replaced with a virtual-thread friendlier way of locking, like ReentrantLock.

Expected Behaviour

no thread pinning should occur.

The text was updated successfully, but these errors were encountered:

codefromthecrypt · 2024-03-22T23:57:53Z

So, you are suggesting that synchronized keyword in libraries is basically a problem in general now? I would be surprised if most code would work, if that's the case. Do you have any documentation that can help discuss this as we also need to consider old versions of Java (brave supports Java 6)

reta · 2024-03-26T14:31:09Z

@codefromthecrypt just to chime in, the official recommendation from Oracle [1] is to avoid synchronized in virtual thread flows and switch to ReentrantLock which is (to our luck) supported since JDK 1.5. I hope it helps.

PS: ready to help with the change

[1] https://spring.io/blog/2022/10/11/embracing-virtual-threads#mitigating-limitations
[2] https://docs.oracle.com/en/java/javase/21/core/virtual-threads.html#GUID-704A716D-0662-4BC7-8C7F-66EE74B1EDAD

codefromthecrypt · 2024-03-27T02:37:46Z

@reta ok thanks for details offer to help (saves me from asking you ;)). There are a few places here and reporter that use synchronized. If you don't mind checking bench before and after (where they exist and are important).

reta · 2024-03-27T02:39:25Z

@reta ok thanks for details offer to help (saves me from asking you ;)). There are a few places here and reporter that use synchronized. If you don't mind checking bench before and after (where they exist and are important).

Sure, will do, thanks @codefromthecrypt

codefromthecrypt · 2024-04-02T04:59:03Z

Here's a TL'DR' from research I've done (and p.s. I'm done ;) )

This is a nice overview of the situation, https://todd.ginsberg.com/post/java/virtual-thread-pinning/ especially helpful is this quote:

An application that exhibits virtual thread pinning isn’t bad or incorrect, but it could be suboptimal.

So, yes there can be pinning. Someone can decide this is critical and choose to write another handler. IMHO suboptimal is subjective and likely this is not the most suboptimal code in a typical spring boot stack. Given that context, we should really be careful about how to proceed and not introduce a cure worse than the disease.

A way that precisely breaks up the critical section is probably the best. A way that is just trying to get rid of synchronized everywhere is probably not best. more notes below.

Now, there is a lot of suboptimal code in a spring boot stack, and especially before micrometer 1.3 because for some reason when things moved from sleuth to the new system a couple things missed.

one is no more automatic use of brave's samplers like http (obviously you don't want to report 100pct of zero value things). This hasn't been fixed and I really don't expect them to.

the other is that under the SpanHandler.finish function, the span is copied into another format unnecessarily (zipkin core). This causes more overhead inside that synchronized block. This could have been addressed for years, but it wasn't until I did it recently. Should be out in 1.3 which you can try in our example now https://github.com/openzipkin/brave-example/tree/master/webflux6-micrometer

So, what's left? well my suggestion isn't to overreact. Instead look at the issue holistically. Folks interested this narrowly in performance may be unable to use most of the spring boot/micrometer setup at all. For example, if you'll measure the frontend examples vs armeria directly, there are millisecond grade differences. Yes, this is about concurrency, but also consider whatever is being done should be measured, and I bet there are other problems that will rise way ahead of this, such as stack depths hundreds deep due to reactive code.

Instead of overreacting, I suggest a very narrow approach. If the goal is to reduce pressure on the lock, an option that causes actually more overhead, but likely gets rid of this message can be done. I say an "option" because I think for a lot this is a cure worse than the disease.

So right now there is a lock on finish, let's simplify to this

      synchronized(mutableSpan) {
          // can take time in asyncreporter due to sizing to avoid OOM on backlog
          handler.end(mutableSpan);
      }

So, if we assume this lock being somewhere in microsecond range is a problem, the problem is inside AsyncZipkinSpanHandler here..

https://github.com/openzipkin/zipkin-reporter-java/blob/master/brave/src/main/java/zipkin2/reporter/brave/AsyncZipkinSpanHandler.java#L212

which does some work here to prevent OOM, specifically encoder.sizeInBytes(next);
https://github.com/openzipkin/zipkin-reporter-java/blob/master/core/src/main/java/zipkin2/reporter/internal/AsyncReporter.java#L219-L233

So, if we really have measured and think this is a problem etc etc. One thing not done is moving the OOM handling onto another thread. Basically re-use the AsyncReporter thread to execute report(S next) in another thread.

This will cause problems of course. One is that to do this safely, you must now copy/clone the mutable span. Also, the backpressure/blocking that exists now prevents OOM. By moving that to another thread, you can receive more spans into that sizer thread vs what could fit in memory under your rules. In other words, you could make it more likely for an OOM. To adjust for this, you may need to reduce your maxSize to adjust for the larger amount of in-flight (not yet sized) spans. In any case, the size heuristic isn't perfect anyway. Just know there is a side-effect.

Anyway, for someone really concerned about this log message, or they hit performance problems reduced to this critical section.. maybe it is a good tradeoff.

Regardless, zipkin-reporter has not had much love, and like this topic few actually working on tracing contribute. The async design there is very old and could be refactored in relationship to this issue and advances that have happened in the last >5 years

All that said.. some configuration to reduce the lock *in zipkin-reporter and when the correct code is in use (AsyncSpanHandler).. it could reduce the issue to something viable for another few years, and not rely on staff that don't exist.

I'll leave final decision to @reta and someone else to review his change, because I don't work on this project, this programming language, or even tracing. I hope the pointers help, though!

codefromthecrypt · 2024-04-06T01:39:16Z

heading out on holiday, here's a suggestion to move past this that will be easy to configure openzipkin/zipkin-reporter-java#259

Detailed rationale in RATIONALE.md Closes #204 Closes openzipkin/brave#1426 Signed-off-by: Adrian Cole <adrian@tetrate.io>

codefromthecrypt · 2024-04-13T02:01:33Z

https://github.com/openzipkin/zipkin-reporter-java/releases/tag/3.4.0 removes the blocking during report

keydon added the bug label Mar 22, 2024

codefromthecrypt removed the bug label Mar 22, 2024

reta self-assigned this Mar 27, 2024

This was referenced Mar 31, 2024

ArrayIndexOutOfBoundsException from MutableSpan #1384

Closed

Fix VirtualThread pinning with RealSpan and zipkin2.reporter by switching from synchronized to ReentrantLock #1427

Closed

codefromthecrypt mentioned this issue Apr 6, 2024

AsyncReporter/SpanHandler: make queuedMaxBytes=0 disable pre-flight size checks openzipkin/zipkin-reporter-java#259

Closed

reta mentioned this issue Apr 11, 2024

AsyncReporter/SpanHandler: make queuedMaxBytes=0 disable pre-flight size checks openzipkin/zipkin-reporter-java#260

Merged

codefromthecrypt pushed a commit to openzipkin/zipkin-reporter-java that referenced this issue Apr 12, 2024

Deprecates AsyncReporter/SpanHandler queuedMaxBytes

9a457d4

Detailed rationale in RATIONALE.md Closes #204 Closes openzipkin/brave#1426 Signed-off-by: Adrian Cole <adrian@tetrate.io>

codefromthecrypt mentioned this issue Apr 12, 2024

Deprecates AsyncReporter/SpanHandler queuedMaxBytes openzipkin/zipkin-reporter-java#264

Merged

codefromthecrypt closed this as completed in openzipkin/zipkin-reporter-java#264 Apr 12, 2024

codefromthecrypt closed this as completed in openzipkin/zipkin-reporter-java@fcc67bb Apr 12, 2024

reta mentioned this issue Jun 28, 2024

Potential deadlock between zipkin-reporter and Brave when used with Virtual Threads #1440

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VirtualThread pinning with RealSpan and zipkin2.reporter #1426

VirtualThread pinning with RealSpan and zipkin2.reporter #1426

keydon commented Mar 22, 2024

codefromthecrypt commented Mar 22, 2024

reta commented Mar 26, 2024 •

edited

Loading

codefromthecrypt commented Mar 27, 2024

reta commented Mar 27, 2024

codefromthecrypt commented Apr 2, 2024 •

edited

Loading

codefromthecrypt commented Apr 6, 2024

codefromthecrypt commented Apr 13, 2024

VirtualThread pinning with RealSpan and zipkin2.reporter #1426

VirtualThread pinning with RealSpan and zipkin2.reporter #1426

Comments

keydon commented Mar 22, 2024

Describe the Bug

Steps to Reproduce

Expected Behaviour

codefromthecrypt commented Mar 22, 2024

reta commented Mar 26, 2024 • edited Loading

codefromthecrypt commented Mar 27, 2024

reta commented Mar 27, 2024

codefromthecrypt commented Apr 2, 2024 • edited Loading

codefromthecrypt commented Apr 6, 2024

codefromthecrypt commented Apr 13, 2024

reta commented Mar 26, 2024 •

edited

Loading

codefromthecrypt commented Apr 2, 2024 •

edited

Loading