-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VirtualThread pinning with RealSpan and zipkin2.reporter #1426
Comments
So, you are suggesting that synchronized keyword in libraries is basically a problem in general now? I would be surprised if most code would work, if that's the case. Do you have any documentation that can help discuss this as we also need to consider old versions of Java (brave supports Java 6) |
@codefromthecrypt just to chime in, the official recommendation from Oracle [1] is to avoid PS: ready to help with the change [1] https://spring.io/blog/2022/10/11/embracing-virtual-threads#mitigating-limitations |
@reta ok thanks for details offer to help (saves me from asking you ;)). There are a few places here and reporter that use synchronized. If you don't mind checking bench before and after (where they exist and are important). |
Sure, will do, thanks @codefromthecrypt |
Here's a TL'DR' from research I've done (and p.s. I'm done ;) ) This is a nice overview of the situation, https://todd.ginsberg.com/post/java/virtual-thread-pinning/ especially helpful is this quote:
So, yes there can be pinning. Someone can decide this is critical and choose to write another handler. IMHO suboptimal is subjective and likely this is not the most suboptimal code in a typical spring boot stack. Given that context, we should really be careful about how to proceed and not introduce a cure worse than the disease. A way that precisely breaks up the critical section is probably the best. A way that is just trying to get rid of synchronized everywhere is probably not best. more notes below. Now, there is a lot of suboptimal code in a spring boot stack, and especially before micrometer 1.3 because for some reason when things moved from sleuth to the new system a couple things missed. one is no more automatic use of brave's samplers like http (obviously you don't want to report 100pct of zero value things). This hasn't been fixed and I really don't expect them to. the other is that under the SpanHandler.finish function, the span is copied into another format unnecessarily (zipkin core). This causes more overhead inside that synchronized block. This could have been addressed for years, but it wasn't until I did it recently. Should be out in 1.3 which you can try in our example now https://github.com/openzipkin/brave-example/tree/master/webflux6-micrometer So, what's left? well my suggestion isn't to overreact. Instead look at the issue holistically. Folks interested this narrowly in performance may be unable to use most of the spring boot/micrometer setup at all. For example, if you'll measure the frontend examples vs armeria directly, there are millisecond grade differences. Yes, this is about concurrency, but also consider whatever is being done should be measured, and I bet there are other problems that will rise way ahead of this, such as stack depths hundreds deep due to reactive code. Instead of overreacting, I suggest a very narrow approach. If the goal is to reduce pressure on the lock, an option that causes actually more overhead, but likely gets rid of this message can be done. I say an "option" because I think for a lot this is a cure worse than the disease. So right now there is a lock on finish, let's simplify to this synchronized(mutableSpan) {
// can take time in asyncreporter due to sizing to avoid OOM on backlog
handler.end(mutableSpan);
} So, if we assume this lock being somewhere in microsecond range is a problem, the problem is inside which does some work here to prevent OOM, specifically So, if we really have measured and think this is a problem etc etc. One thing not done is moving the OOM handling onto another thread. Basically re-use the AsyncReporter thread to execute This will cause problems of course. One is that to do this safely, you must now copy/clone the mutable span. Also, the backpressure/blocking that exists now prevents OOM. By moving that to another thread, you can receive more spans into that sizer thread vs what could fit in memory under your rules. In other words, you could make it more likely for an OOM. To adjust for this, you may need to reduce your maxSize to adjust for the larger amount of in-flight (not yet sized) spans. In any case, the size heuristic isn't perfect anyway. Just know there is a side-effect. Anyway, for someone really concerned about this log message, or they hit performance problems reduced to this critical section.. maybe it is a good tradeoff. Regardless, zipkin-reporter has not had much love, and like this topic few actually working on tracing contribute. The async design there is very old and could be refactored in relationship to this issue and advances that have happened in the last >5 years All that said.. some configuration to reduce the lock *in zipkin-reporter and when the correct code is in use ( I'll leave final decision to @reta and someone else to review his change, because I don't work on this project, this programming language, or even tracing. I hope the pointers help, though! |
heading out on holiday, here's a suggestion to move past this that will be easy to configure openzipkin/zipkin-reporter-java#259 |
Detailed rationale in RATIONALE.md Closes #204 Closes openzipkin/brave#1426 Signed-off-by: Adrian Cole <adrian@tetrate.io>
https://github.com/openzipkin/zipkin-reporter-java/releases/tag/3.4.0 removes the blocking during report |
Describe the Bug
I am using spring boot 3 with tracing through brave & zipkin. All fine until I tried out virtual threads with jdk21
Steps to Reproduce
Execute a spring boot 3 app with configured brave/zipkin tracing and set the JVM Flag
-Djdk.tracePinnedThreads=full
for detecting problematic pinning.after a while your get the following stack-trace:
the finish() Method of the RealSpan is marked as the culprit.
I am no expert in multithreading and locking, but it seems that the "synchronized" keyword is the issue here and should be replaced with a virtual-thread friendlier way of locking, like ReentrantLock.
Expected Behaviour
no thread pinning should occur.
The text was updated successfully, but these errors were encountered: