Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make python GC always run on thread 1 #520

Closed
wants to merge 14 commits into from

Conversation

ericphanson
Copy link
Contributor

closes #201

Here, we make the GC queue threadsafe (by using a Channel), so we can safely enqueue objects from any thread. Then we process it from a dedicated task which runs on thread 1, which is the thread in which we have initalized the correct state in the python runtime (https://docs.python.org/3/c-api/init.html#non-python-created-threads). We continue to respect GC.enable() and GC.disable() but it should not be very useful anymore since it is no longer required to disable GC in threaded regions. (However users must still only interact with the python runtime via thread 1, unless they setup the right state).

This was a bit complicated to test unfortunately, but if I have reasoned correctly (if!) there should be no races in the testing.

Notes:

  • I'm not sure if we should skip the task creation during __init__ during precompilation (will it hang precompilation? is it fine to not GC then)?
  • I'm not sure if this will be less performant in the single-threaded case, if we are not eagerly freeing but instead deferring to a separate task to do so
  • I haven't used Threads.Condition or Threads.Event before, hopefully I got it right

"""
function enable()
ENABLED[] = true
notify(ENABLED_EVENT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wait for the GC_finished event here to preserve the semantics that this function call only returns after all GC events are processed

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 11, 2024

Very cool thanks. Will take me a while to absorb - I've not used Julia's multithreading much.

I'd be very interested to see some benchmarking to see how this affects performance in the single threaded case.

Another worry is what happens if we're running from JuliaCall? An issue with JuliaCall is that Julia's event loop doesn't run while in Python-land (unless you call yield() explicitly from Python), so if you do a bunch of stuff at the Python REPL, the PythonCall GC task might not run, maybe? I'd like to check.

@ericphanson
Copy link
Contributor Author

Do you have any good examples for a single-threaded python-object-heavy workload to benchmark? I think it's something that would benefit from a slightly realistic workload over just calling py_finalizer a bunch myself.

I also haven't used juliacall before so I'll have to figure out how to get that setup.

Comment on lines +106 to +110
task = Task(gc_loop)
task.sticky = VERSION >= v"1.7" # disallow task migration which was introduced in 1.7
# ensure the task runs from thread 1
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), task, 0)
schedule(task)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericphanson
Copy link
Contributor Author

in the gcbench.jl script committed, I get:

Runtime: 15.873462 seconds (164.15 M allocations: 3.690 GiB, 24.07% gc time)
Python GC time: 2.328511 seconds
Python GC time (waiting): 0.000013 seconds
Next full GC: 2.217151 seconds (9.03% gc time)
Total: 18.104279 seconds (164.15 M allocations: 3.690 GiB, 22.21% gc time)

with Julia 1.10, with -t1 to force single-threaded. I had to special case the case we are already on thread 1 to eagerly finalize the pointer instead of deferring to the queue, as the existing implementation does. This uses some distasteful manipulation of the current task's sticky bit to ensure it does not migrate while we are busy finalizing.

This matches the performance I get on master:

Runtime: 16.901158 seconds (164.15 M allocations: 3.690 GiB, 24.29% gc time)
Python GC time: 2.633090 seconds
Next full GC: 2.115611 seconds (9.28% gc time)
Total: 19.023249 seconds (164.15 M allocations: 3.690 GiB, 22.62% gc time)

(obtained with these changes + script).

The segfault in CI is concerning, it is on the pyjulia side which I have not looked into yet.

@ericphanson
Copy link
Contributor Author

ericphanson commented Jul 13, 2024

I think the segfault was just since I forgot to acquire the GIL after refactoring to add the fast path, should be fixed.

An issue with JuliaCall is that Julia's event loop doesn't run while in Python-land (unless you call yield() explicitly from Python), so if you do a bunch of stuff at the Python REPL, the PythonCall GC task might not run, maybe? I'd like to check.

I’ve added some off-by-default logging so we can see when this background task kicks in. I’ll try running from pyjulia and see if it seems to be behaving as expected.

Btw do you mean the libuv event loop? I think the task runtime is separate from that but also I’m not really sure how it all works under the hood.

Will take me a while to absorb - I've not used Julia's multithreading much.

I’m happy to hop on a call if that would be helpful btw.

@ericphanson
Copy link
Contributor Author

ericphanson commented Jul 14, 2024

I added a test for pyjulia and addressed the first point of #219 (comment) by turning on signal handling if Julia is multithreaded, since threads + no signal handling is unsupported in Julia and leads to crashes. I also added threads to the CI matrix to prevent regressions on this.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 16, 2024

Thanks for your work so far.

Would be cool to implement and test the method from JuliaPy/PyCall.jl#883 too - the simplicity (and avoidance of anything async) is appealing.

Btw do you mean the libuv event loop?
No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Jul 21, 2024

Looks awesome!

Regarding testing, it could be nice to try something like rr's chaos mode to assault the garbage collector and reveal any potential thread safety issues. (Edit: but this would take a while, whereas this PR already fixes some issues. So I don't think such testing is essential for the moment)

However, it is tricky to start Julia with rr from within Python – see JuliaLang/julia#52689. So the test would have to be run from the Julia side of PythonCall.jl, and open Python from there, rather than the other direction. But maybe that would be good enough for testing this.

I believe it can be done with

julia --bug-report=rr-local,chaos

@MilesCranmer
Copy link
Contributor

Btw, @lassepe reports that this PR fixes the issues we had ran into here: avik-pal/Wandb.jl#27

@ericphanson
Copy link
Contributor Author

ericphanson commented Jul 22, 2024

No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

I see what you mean, if I stick

    t = Threads.@spawn begin
        while true
            println("Background task running $(time())")
            sleep(1)
        end
    end
    Base.errormonitor(t)

into the __init__ method, then open python, import juliacall, I see the printout during loading of juliacall, but nothing when just sitting in the python repl. Calling methods like m.rand(100) etc via julia does not yield the printing, but if I call m.sleep(10) then I see it (for a julia module m). So I think you're right, that the queue can go unprocessed when we aren't calling into julia code.

@ericphanson
Copy link
Contributor Author

BTW I wonder if this is an issue with the status quo solution as well? If a finalizer doesn't run (e.g. since GC hasn't run yet), the refcount won't be decremented. Do GC/finalizers run "in the background" properly in a way the tasks don't?

@ericphanson
Copy link
Contributor Author

I think another solution here would be to acquire the GIL from non-thread-1 threads (with PyGILState_Ensure() and release) and do the decrement eagerly within the finalizer. I'm not sure how performant that would be but it might be worth a try. The other thing is we want to make sure we can't deadlock somehow. Do we currently hold the GIL the whole time from thread 1?

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 24, 2024

I think another solution here would be to acquire the GIL from non-thread-1 threads (with PyGILState_Ensure() and release) and do the decrement eagerly within the finalizer. I'm not sure how performant that would be but it might be worth a try. The other thing is we want to make sure we can't deadlock somehow. Do we currently hold the GIL the whole time from thread 1?

Basically yes - thread 1 holds the GIL all the time.

Using PythonCall from Julia, the GIL is acquired implicitly when the Python runtime is initialised and released when it is exited. It is never released anywhere else by PythonCall.

Using JuliaCall from Python, the GIL is managed by Python as usual. Most of the time it is already acquired by the Python thread calling into PythonCall. Again, PythonCall never explicitly handles the GIL. The only exception to this is while PythonCall is being loaded, because this happens in a Julia C function being called from ctypes and this released the GIL, so we need to re-acquire it. Having read the docs just now, I could actually remove all of this and just use PyDLL on the Python side to keep the GIL. Edit: #530 removes this exception.

(Of course, you could call some Python code - such as ctypes or multithreading - which itself releases the GIL. But then it's currently documented that you should only call back into PythonCall from thread 1. Hence the above statement about the GIL always being held on thread 1 is still true while in any PythonCall code. If we're careful about it, we may be able to relax this.)

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 24, 2024

BTW I wonder if this is an issue with the status quo solution as well? If a finalizer doesn't run (e.g. since GC hasn't run yet), the refcount won't be decremented. Do GC/finalizers run "in the background" properly in a way the tasks don't?

Yep that's true. However with the status quo, the Python objects do get freed when Julia's GC is run. With the new solution, they only get freed after GC is run AND the async task you added gets yielded to. GC gets called frequently (it's hard to completely avoid allocating when using PythonCall) but most things you might do in JuliaCall probably don't yield.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 24, 2024

No I meant the task scheduler, or whatever Julia calls the mechanism to run tasks asynchronously. My point was that in JuliaCall you might return to Python before the PythonCall GC task can do an iteration, and depending on what you're calling from Python it might never iterate.

I see what you mean, if I stick

    t = Threads.@spawn begin
        while true
            println("Background task running $(time())")
            sleep(1)
        end
    end
    Base.errormonitor(t)

into the __init__ method, then open python, import juliacall, I see the printout during loading of juliacall, but nothing when just sitting in the python repl. Calling methods like m.rand(100) etc via julia does not yield the printing, but if I call m.sleep(10) then I see it (for a julia module m). So I think you're right, that the queue can go unprocessed when we aren't calling into julia code.

If you call m.Base.Libc.systemsleep(10) instead, I assume you do not see the printing either? Just checking my understanding, because this should not yield.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 24, 2024

Some alternative approaches:

  • In the finalizer: if on thread 1 then decref immediately and also decref anything in the queue; if on another thread then add to the queue. No async tasks here. This frees things quickly so long as there are Python objects being GC'd on thread 1 frequently.
  • In the finaliser: if on thread 1 then decref immediately; if on another thread then add the finalizer again. This frees things quickly so long as the GC is called at all on thread 1 frequently (less restrictive than the previous option because the Python object remains un-finalized until GC runs on thread 1). Possibly slower, just because the finalizer might be called multiple times, not sure.
  • This idea is kinda perverse but - instead of using an async task, we could create a mutable struct with a finalizer which just decrefs anything in the queue if on thread 1, and then adds the finalizer again to itself. This finalizer will be run periodically by the GC. Then Python objects being finalized just add their pointer to the queue (or maybe decref immediately if on thread 1).

@cjdoris cjdoris mentioned this pull request Jul 24, 2024
@cjdoris
Copy link
Collaborator

cjdoris commented Jul 24, 2024

See #529 for a draft of an alternative approach without tasks.

@cjdoris
Copy link
Collaborator

cjdoris commented Jul 29, 2024

See #534 for a third approach to this. I think it's my favourite approach.

@cjdoris
Copy link
Collaborator

cjdoris commented Aug 2, 2024

Closing in favour of #529

@cjdoris cjdoris closed this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfaults with frequent python calls from thread 1 on multithreaded Julia
3 participants