-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze in GC for multithreaded code #15620
Comments
the segfault is expected (it's a safepoint trigger that forces that thread into gc). it seems at least one thread may not have reached a call to jl_gc_collect or a safepoint trigger (there aren't enough of them currently)? |
I don't think we ever did the "every backedge gets a safepoint" thing ? It's gonna be a pain to have that not prevent vectorization (and inserting them after opts is unsafe since you might be in the middle of a critical section) |
So I can't look at state in a debugger then? Is there a workaround so I can figure out where the laggard thread is stuck? |
to avoid the segfault you can ask gdb to ignore it and pass it back to our signal handler with something like |
The signal handling is mentioned in the debugging doc. The FIXME should be irrelevant. The freeze can happen if you have a infinite wait loop in C or julia without any allocation. |
Also, is there any code to reproduce this? |
I haven't been able to isolate it enough to find a code snippet; Celeste is pretty big. If I understand this correctly, every thread must reach a safepoint before GC can run. So if thread 1 is busy in some tight loop, perhaps waiting for thread 2 to do something, but thread 2 is at a safepoint in the GC, the application will freeze like this? |
yep. a workaround would be to insert a call to the runtime from time to time in the tight loop. The proper solution is to have codegen generate safepoints in every loop. |
Bigger ones are fine too. Assuming you are allowed to post it of course...
Correct. In order to fix this, we need GC safepoint (and transition) support in codegen. The runtime part of this is almost done (with a missing sync at the beginning of the GC to force a write barrier on other threads). The codegen part is not there. I haven't got a chance to go through the current codegen and figure out where to add the necessary pieces yet. |
See Line 34 in 6b9023b
|
I think I see. So codegen will insert safepoints in generated code? But this won't help if a thread is blocked in a C library or in a system call, right? |
C calls to random libraries and system calls will be safe in that gc can run concurrently with them, but I don't think this has been implemented yet either (safe regions) |
Not with safepoint only but it will with GC transitions. See the system mutex impl for an idea of what the code would look like before optimization when we have GC transition support. See my summary in the original PR for the plan forward. |
Thanks for the explanations guys. Maybe this is a dumb question, but have you considered the opposite approach -- entering and leaving unsafe regions explicitly? Then the default thread state would be safe, and when safe, the thread could be signaled for GC synchronization. This would eliminate the need to wait for threads to reach safe points, but would require waiting for them to leave unsafe regions. Would there be too many unsafe regions? |
That is exactly the plan |
Okay then! I'm trying to isolate this further and will update or close this when I understand the freeze better. |
Back to my desk...
So the plan is to do codegen in exactly this way. (gc safe by default, and mark critical unsafe region). Since each transition need a store (unless we have good unwinding and stack map etc) we would like to minimize the transition we actually emit in a post-codegen optimization by running more code in unsafe region. We just need to make sure that those additional code in unsafe region doesn't have anything that has to be run in save region (loops, julia-unaware I'll need to write some note about the plan in more detail although currently I don't feel like advertising it too much before I actually sit down and implement it.... |
Please confirm: if I'm calling out to a C library from Julia, I should insert |
Correct. The codegen support part is basically to insert this automatically. (and merge them).
You can read or write Please also note that the |
@yuyichao: can you expand on this:
We seem to be seeing this freeze in some other code that is calling out to FFTW from multiple threads. FFTW writes into Julia managed memory. So can you explain this write barrier, or point me at any explanation please? |
It's a little hard to say without actually seeing the code. A few comment I can make now,
|
Haven't seen this in Celeste in a long while. Closing. |
This is on Linux using commit d72842a, which is 13 days old.
Here's the backtrace:
I see a
FIXME
in__pool_alloc()
which is atgc.c:1185
on master; not sure if this is the issue.Running inside gdb, I consistently get a segfault:
@JeffBezanson, @vtjnash, @yuyichao.
The text was updated successfully, but these errors were encountered: