-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix pthread_cond_wait race on macOS #82709
Conversation
The native runtime event implementations for nativeaot and GC use pthread_cond_wait to wait for the event and pthread_cond_broadcast to signal that the event was set. While the usage of the pthread_cond_broadcast conforms with the documentation, it turns out that glibc before 2.25 had a race in the implementation that can cause the pthread_cond_broadcast to be unnoticed and the wait waiting forever. It turns out that macOS implementation has the same issue. The fix for the issue is to call pthread_cond_broadcast while the related mutex is taken. This change fixes intermittent crossgen2 hangs with nativeaot build of crossgen2 reported in dotnet#81570. I was able to repro the hang locally in tens of thousands of iterations of running crossgen2 without any arguments (the hang occurs when server GC creates threads). With this fix, it ran without problems over the weekend, passing 5.5 million iterations.
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsThe native runtime event implementations for nativeaot and GC use pthread_cond_wait to wait for the event and pthread_cond_broadcast to signal that the event was set. While the usage of the pthread_cond_broadcast conforms with the documentation, it turns out that glibc before 2.25 had a race in the implementation that can cause the pthread_cond_broadcast to be unnoticed and the wait waiting forever. It turns out that macOS implementation has the same issue. This change fixes intermittent crossgen2 hangs with nativeaot build of crossgen2 reported in #81570. I was able to repro the hang locally in tens of thousands of iterations of running crossgen2 without any arguments (the hang occurs when server GC creates threads). With this fix, it ran without problems over the weekend, passing 5.5 million iterations.
|
cc: @BrennanConroy |
There is a guide recommending calling the I've also found a detailed analysis on why the hang happens in older glibc: The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@janvorli is this only going to be available on net8 or planned to be backported to net7? |
@theolivenbaum it only affects NativeAOT and standalone GC (when you explicitly use the clrgc.dll / libclrgc.so instead of the GC built-in the runtime for some reason). I was actually thinking about backporting this to .NET 7 today. |
/backport to release/7.0 |
Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/4314789320 |
Ok thanks! Just because it sounds similar to an issue we hit constantly with our app on macOS, but we're not using AOT. |
The native runtime event implementations for nativeaot and GC use pthread_cond_wait to wait for the event and pthread_cond_broadcast to signal that the event was set. While the usage of the pthread_cond_broadcast conforms with the documentation, it turns out that glibc before 2.25 had a race in the implementation that can cause the pthread_cond_broadcast to be unnoticed and the wait waiting forever. It turns out that macOS implementation has the same issue.
The fix for the issue is to call pthread_cond_broadcast while the related mutex is taken.
This change fixes intermittent crossgen2 hangs with nativeaot build of crossgen2 reported in #81570. I was able to repro the hang locally in tens of thousands of iterations of running crossgen2 without any arguments (the hang occurs when server GC creates threads). With this fix, it ran without problems over the weekend, passing 5.5 million iterations.