make python GC always run on thread 1 #520

ericphanson · 2024-07-10T21:30:07Z

closes #201

Here, we make the GC queue threadsafe (by using a Channel), so we can safely enqueue objects from any thread. Then we process it from a dedicated task which runs on thread 1, which is the thread in which we have initalized the correct state in the python runtime (https://docs.python.org/3/c-api/init.html#non-python-created-threads). We continue to respect GC.enable() and GC.disable() but it should not be very useful anymore since it is no longer required to disable GC in threaded regions. (However users must still only interact with the python runtime via thread 1, unless they setup the right state).

This was a bit complicated to test unfortunately, but if I have reasoned correctly (if!) there should be no races in the testing.

Notes:

I'm not sure if we should skip the task creation during __init__ during precompilation (will it hang precompilation? is it fine to not GC then)?
I'm not sure if this will be less performant in the single-threaded case, if we are not eagerly freeing but instead deferring to a separate task to do so
I haven't used Threads.Condition or Threads.Event before, hopefully I got it right

ericphanson · 2024-07-10T22:12:52Z

src/GC/GC.jl

 """
 function enable()
    ENABLED[] = true
+    notify(ENABLED_EVENT)


We could wait for the GC_finished event here to preserve the semantics that this function call only returns after all GC events are processed

cjdoris · 2024-07-11T08:13:40Z

Very cool thanks. Will take me a while to absorb - I've not used Julia's multithreading much.

I'd be very interested to see some benchmarking to see how this affects performance in the single threaded case.

Another worry is what happens if we're running from JuliaCall? An issue with JuliaCall is that Julia's event loop doesn't run while in Python-land (unless you call yield() explicitly from Python), so if you do a bunch of stuff at the Python REPL, the PythonCall GC task might not run, maybe? I'd like to check.

ericphanson · 2024-07-11T20:46:34Z

Do you have any good examples for a single-threaded python-object-heavy workload to benchmark? I think it's something that would benefit from a slightly realistic workload over just calling py_finalizer a bunch myself.

I also haven't used juliacall before so I'll have to figure out how to get that setup.

ericphanson · 2024-07-11T20:49:36Z

src/GC/GC.jl

+    task = Task(gc_loop)
+    task.sticky = VERSION >= v"1.7" # disallow task migration which was introduced in 1.7
+    # ensure the task runs from thread 1
+    ccall(:jl_set_task_tid, Cvoid, (Any, Cint), task, 0)
+    schedule(task)


this is from https://github.com/tro3/ThreadPools.jl/blob/44730372b9b00ba38192e5d2850dfe24c20cdbeb/src/macros.jl#L269-L276

ericphanson · 2024-07-13T16:15:57Z

in the gcbench.jl script committed, I get:

Runtime: 15.873462 seconds (164.15 M allocations: 3.690 GiB, 24.07% gc time)
Python GC time: 2.328511 seconds
Python GC time (waiting): 0.000013 seconds
Next full GC: 2.217151 seconds (9.03% gc time)
Total: 18.104279 seconds (164.15 M allocations: 3.690 GiB, 22.21% gc time)

with Julia 1.10, with -t1 to force single-threaded. I had to special case the case we are already on thread 1 to eagerly finalize the pointer instead of deferring to the queue, as the existing implementation does. This uses some distasteful manipulation of the current task's sticky bit to ensure it does not migrate while we are busy finalizing.

This matches the performance I get on master:

Runtime: 16.901158 seconds (164.15 M allocations: 3.690 GiB, 24.29% gc time)
Python GC time: 2.633090 seconds
Next full GC: 2.115611 seconds (9.28% gc time)
Total: 19.023249 seconds (164.15 M allocations: 3.690 GiB, 22.62% gc time)

(obtained with these changes + script).

The segfault in CI is concerning, it is on the pyjulia side which I have not looked into yet.

ericphanson · 2024-07-13T20:13:38Z

I think the segfault was just since I forgot to acquire the GIL after refactoring to add the fast path, should be fixed.

An issue with JuliaCall is that Julia's event loop doesn't run while in Python-land (unless you call yield() explicitly from Python), so if you do a bunch of stuff at the Python REPL, the PythonCall GC task might not run, maybe? I'd like to check.

I’ve added some off-by-default logging so we can see when this background task kicks in. I’ll try running from pyjulia and see if it seems to be behaving as expected.

Btw do you mean the libuv event loop? I think the task runtime is separate from that but also I’m not really sure how it all works under the hood.

Will take me a while to absorb - I've not used Julia's multithreading much.

I’m happy to hop on a call if that would be helpful btw.

ericphanson · 2024-07-14T12:13:17Z

I added a test for pyjulia and addressed the first point of #219 (comment) by turning on signal handling if Julia is multithreaded, since threads + no signal handling is unsupported in Julia and leads to crashes. I also added threads to the CI matrix to prevent regressions on this.

cjdoris · 2024-07-16T16:17:39Z

Thanks for your work so far.

Would be cool to implement and test the method from JuliaPy/PyCall.jl#883 too - the simplicity (and avoidance of anything async) is appealing.

Btw do you mean the libuv event loop?
No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

MilesCranmer · 2024-07-21T22:35:02Z

Looks awesome!

Regarding testing, it could be nice to try something like rr's chaos mode to assault the garbage collector and reveal any potential thread safety issues. (Edit: but this would take a while, whereas this PR already fixes some issues. So I don't think such testing is essential for the moment)

However, it is tricky to start Julia with rr from within Python – see JuliaLang/julia#52689. So the test would have to be run from the Julia side of PythonCall.jl, and open Python from there, rather than the other direction. But maybe that would be good enough for testing this.

I believe it can be done with

julia --bug-report=rr-local,chaos

MilesCranmer · 2024-07-21T22:38:53Z

Btw, @lassepe reports that this PR fixes the issues we had ran into here: avik-pal/Wandb.jl#27

ericphanson · 2024-07-22T13:30:38Z

No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

I see what you mean, if I stick

    t = Threads.@spawn begin
        while true
            println("Background task running $(time())")
            sleep(1)
        end
    end
    Base.errormonitor(t)

into the __init__ method, then open python, import juliacall, I see the printout during loading of juliacall, but nothing when just sitting in the python repl. Calling methods like m.rand(100) etc via julia does not yield the printing, but if I call m.sleep(10) then I see it (for a julia module m). So I think you're right, that the queue can go unprocessed when we aren't calling into julia code.

ericphanson · 2024-07-22T13:32:45Z

BTW I wonder if this is an issue with the status quo solution as well? If a finalizer doesn't run (e.g. since GC hasn't run yet), the refcount won't be decremented. Do GC/finalizers run "in the background" properly in a way the tasks don't?

ericphanson · 2024-07-22T13:48:43Z

I think another solution here would be to acquire the GIL from non-thread-1 threads (with PyGILState_Ensure() and release) and do the decrement eagerly within the finalizer. I'm not sure how performant that would be but it might be worth a try. The other thing is we want to make sure we can't deadlock somehow. Do we currently hold the GIL the whole time from thread 1?

cjdoris · 2024-07-24T14:59:06Z

I think another solution here would be to acquire the GIL from non-thread-1 threads (with PyGILState_Ensure() and release) and do the decrement eagerly within the finalizer. I'm not sure how performant that would be but it might be worth a try. The other thing is we want to make sure we can't deadlock somehow. Do we currently hold the GIL the whole time from thread 1?

Basically yes - thread 1 holds the GIL all the time.

Using PythonCall from Julia, the GIL is acquired implicitly when the Python runtime is initialised and released when it is exited. It is never released anywhere else by PythonCall.

Using JuliaCall from Python, the GIL is managed by Python as usual. Most of the time it is already acquired by the Python thread calling into PythonCall. Again, PythonCall never explicitly handles the GIL. The only exception to this is while PythonCall is being loaded, because this happens in a Julia C function being called from ctypes and this released the GIL, so we need to re-acquire it. Having read the docs just now, I could actually remove all of this and just use PyDLL on the Python side to keep the GIL. Edit: #530 removes this exception.

(Of course, you could call some Python code - such as ctypes or multithreading - which itself releases the GIL. But then it's currently documented that you should only call back into PythonCall from thread 1. Hence the above statement about the GIL always being held on thread 1 is still true while in any PythonCall code. If we're careful about it, we may be able to relax this.)

cjdoris · 2024-07-24T15:03:34Z

BTW I wonder if this is an issue with the status quo solution as well? If a finalizer doesn't run (e.g. since GC hasn't run yet), the refcount won't be decremented. Do GC/finalizers run "in the background" properly in a way the tasks don't?

Yep that's true. However with the status quo, the Python objects do get freed when Julia's GC is run. With the new solution, they only get freed after GC is run AND the async task you added gets yielded to. GC gets called frequently (it's hard to completely avoid allocating when using PythonCall) but most things you might do in JuliaCall probably don't yield.

cjdoris · 2024-07-24T15:12:23Z

No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

I see what you mean, if I stick
    t = Threads.@spawn begin
        while true
            println("Background task running $(time())")
            sleep(1)
        end
    end
    Base.errormonitor(t)
into the __init__ method, then open python, import juliacall, I see the printout during loading of juliacall, but nothing when just sitting in the python repl. Calling methods like m.rand(100) etc via julia does not yield the printing, but if I call m.sleep(10) then I see it (for a julia module m). So I think you're right, that the queue can go unprocessed when we aren't calling into julia code.

If you call m.Base.Libc.systemsleep(10) instead, I assume you do not see the printing either? Just checking my understanding, because this should not yield.

cjdoris · 2024-07-24T15:29:01Z

Some alternative approaches:

In the finalizer: if on thread 1 then decref immediately and also decref anything in the queue; if on another thread then add to the queue. No async tasks here. This frees things quickly so long as there are Python objects being GC'd on thread 1 frequently.
In the finaliser: if on thread 1 then decref immediately; if on another thread then add the finalizer again. This frees things quickly so long as the GC is called at all on thread 1 frequently (less restrictive than the previous option because the Python object remains un-finalized until GC runs on thread 1). Possibly slower, just because the finalizer might be called multiple times, not sure.
This idea is kinda perverse but - instead of using an async task, we could create a mutable struct with a finalizer which just decrefs anything in the queue if on thread 1, and then adds the finalizer again to itself. This finalizer will be run periodically by the GC. Then Python objects being finalized just add their pointer to the queue (or maybe decref immediately if on thread 1).

cjdoris · 2024-07-24T16:29:15Z

See #529 for a draft of an alternative approach without tasks.

cjdoris · 2024-07-29T16:13:21Z

See #534 for a third approach to this. I think it's my favourite approach.

cjdoris · 2024-08-02T21:13:07Z

Closing in favour of #529

ericphanson added 6 commits July 10, 2024 23:19

make python GC always run on thread 1

5472d8b

unnecessary

4cb1b46

update docstrings

d34ef10

use Base.@lock for nicer syntax

433d928

use errormonitor conditionally

364a6f8

up

1325e75

ericphanson commented Jul 10, 2024

View reviewed changes

add comment

e3a28fb

ericphanson commented Jul 11, 2024

View reviewed changes

ericphanson mentioned this pull request Jul 12, 2024

Make GC more thread-safe JuliaPy/PyCall.jl#883

Open

ericphanson added 2 commits July 13, 2024 10:35

add some instrumentation

0271a9c

improve single-thread performance

49f736e

don't forget the GIL

2584d15

ericphanson added 4 commits July 14, 2024 12:49

test multithreaded in CI

854e8e9

add comments, test multithreaded

dbdad78

cleanup GC code (add comments, delete old code)

6f42fc5

turn on signal handling for multithreaded julia

d84bb1b

lassepe mentioned this pull request Jul 21, 2024

Segfault from basic example avik-pal/Wandb.jl#27

Closed

cjdoris mentioned this pull request Jul 24, 2024

More thread-safe GC #529

Merged

MilesCranmer mentioned this pull request Jul 29, 2024

rr chaos mode for stress testing garbage collection #532

Open

cjdoris mentioned this pull request Jul 29, 2024

More thread-safe GC (another version) #534

Closed

cjdoris closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make python GC always run on thread 1 #520

make python GC always run on thread 1 #520

ericphanson commented Jul 10, 2024

ericphanson Jul 10, 2024

cjdoris commented Jul 11, 2024

ericphanson commented Jul 11, 2024

ericphanson Jul 11, 2024

ericphanson commented Jul 13, 2024

ericphanson commented Jul 13, 2024 •

edited

Loading

ericphanson commented Jul 14, 2024 •

edited

Loading

cjdoris commented Jul 16, 2024

MilesCranmer commented Jul 21, 2024 •

edited

Loading

MilesCranmer commented Jul 21, 2024

ericphanson commented Jul 22, 2024 •

edited

Loading

ericphanson commented Jul 22, 2024

ericphanson commented Jul 22, 2024

cjdoris commented Jul 24, 2024 •

edited

Loading

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 29, 2024

cjdoris commented Aug 2, 2024

make python GC always run on thread 1 #520

make python GC always run on thread 1 #520

Conversation

ericphanson commented Jul 10, 2024

ericphanson Jul 10, 2024

Choose a reason for hiding this comment

cjdoris commented Jul 11, 2024

ericphanson commented Jul 11, 2024

ericphanson Jul 11, 2024

Choose a reason for hiding this comment

ericphanson commented Jul 13, 2024

ericphanson commented Jul 13, 2024 • edited Loading

ericphanson commented Jul 14, 2024 • edited Loading

cjdoris commented Jul 16, 2024

MilesCranmer commented Jul 21, 2024 • edited Loading

MilesCranmer commented Jul 21, 2024

ericphanson commented Jul 22, 2024 • edited Loading

ericphanson commented Jul 22, 2024

ericphanson commented Jul 22, 2024

cjdoris commented Jul 24, 2024 • edited Loading

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 24, 2024

cjdoris commented Jul 29, 2024

cjdoris commented Aug 2, 2024

ericphanson commented Jul 13, 2024 •

edited

Loading

ericphanson commented Jul 14, 2024 •

edited

Loading

MilesCranmer commented Jul 21, 2024 •

edited

Loading

ericphanson commented Jul 22, 2024 •

edited

Loading

cjdoris commented Jul 24, 2024 •

edited

Loading