Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

juliacall 0.9.22 can result in hanging precompilation #537

Closed
MilesCranmer opened this issue Aug 12, 2024 · 9 comments · Fixed by MilesCranmer/PySR#703
Closed

juliacall 0.9.22 can result in hanging precompilation #537

MilesCranmer opened this issue Aug 12, 2024 · 9 comments · Fixed by MilesCranmer/PySR#703
Labels
bug Something isn't working

Comments

@MilesCranmer
Copy link
Contributor

MilesCranmer commented Aug 12, 2024

Affects: JuliaCall

Describe the bug

Upgrading to 0.9.22 causes PySR's CI to hang upon precompilation of Zygote.jl on macOS. I haven't been able to reproduce it locally but can confirm that downgrading to 0.9.21 is enough to fix the issue for PySR. So perhaps the new GC is causing some infinite loop somewhere, for some kind of PythonCall object? The error is randomly encountered so it has been hard for me to track down.

Your system
Please provide detailed information about your system:

  • The operating system: macos-latest
  • The version of Julia: 1.10
  • Python 3.12

Here's an example of the test failing, with 0.9.22:

And succeeding, on 0.9.21:

It is random but it has so far this CI seems to always work on 0.9.21 while it fails 75% of the time on 0.9.22. It's also only on macOS for whatever reason.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 12, 2024

Is there a way I can manually trigger the PythonCall GC to run? I wonder if I can debug it (or work around the issue) by manually triggering the GC at different stages?

@cjdoris
Copy link
Collaborator

cjdoris commented Aug 13, 2024

Hmm not sure what's going on here. I wonder if it's due to the presence of the immortal GCHook() object? You could try commenting out this line. If that helps, we could avoid constructing GCHook() while precompiling.

Is there a way I can manually trigger the PythonCall GC to run? I wonder if I can debug it (or work around the issue) by manually triggering the GC at different stages?

Yep you can do PythonCall.GC.gc().

@MilesCranmer
Copy link
Contributor Author

That would mean there would be no garbage collection during precompilation, right? Is that desired?

I wonder if there’s some infinite loop possible somewhere, where Julia GC calls the finalizer on the GC hook, but the finalizer is called immediately, and this process repeats indefinitely.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 14, 2024

Do any other packages use a similar pattern with a self-regenerating self-terminating GC object? If not, are we sure it’s safe?

I’m also curious what happens if one were to call

Base.GC.gc(#=full=#true)

Would this trigger any sort of infinite loop of finalizers?

cc @ericphanson

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 14, 2024

One other thing to note is that the infinite precompilation only seems to happen if the program is already running.

  1. Load package using PythonCall.
  2. Do something.
  3. Add another package to the environment.
  4. Re-run precompile.

This is where the infinite precompilation seems to occur (and only on this specific system and environment). Maybe something about running precompilation a second time triggers it?

@cjdoris
Copy link
Collaborator

cjdoris commented Aug 15, 2024

That would mean there would be no garbage collection during precompilation, right? Is that desired?

It was just a suggestion to narrow down what's causing the problem. My suggestion shouldn't prevent GC - the GCHook is just there to free any Python objects that were finalized while the GIL was released (e.g. on a different thread). Anyway, it will all be collected when the precompile process ends, so unless you are allocating a lot during precompilation I wouldn't worry.

Do any other packages use a similar pattern with a self-regenerating self-terminating GC object? If not, are we sure it’s safe?

I don't know and I don't know. I had thought someone commented at some point that this behaviour is documented to work, but I can't find that now so maybe I dreamt it.

I’m also curious what happens if one were to call

Base.GC.gc(#=full=#true)

Would this trigger any sort of infinite loop of finalizers?

I don't think so - I called that function a lot when testing the new GC and didn't observe any hangs.

Maybe something about running precompilation a second time triggers it?

Maybe, I don't know enough about the mechanics of precompilation to say.

@ppalmes
Copy link

ppalmes commented Aug 21, 2024

version 0.9.22 is creating endless loop with linux and macos triggered by:
error in running finalizer: UndefVarError(var=:PYNULL)
it produces 9k lines of errors:

error in running finalizer: UndefVarError(var=:PYNULL)
ijl_undefined_var_error at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/rtutils.c:134
ijl_get_binding_or_error at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/module.c:471
enqueue_all at /home/runner/.julia/packages/PythonCall/flx5V/src/GC/GC.jl:106
pyobjectarray_finalizer at /home/runner/.julia/packages/PythonCall/flx5V/src/JlWrap/objectarray.jl:30
unknown function (ip: 0x7ffb6a98cd45)
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2895 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
run_finalizer at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:318
jl_gc_run_finalizers_in_list at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:408
run_finalizers at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gc.c:454
jl_mutex_unlock at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia_locks.h:80 [inlined]
jl_generate_fptr_impl at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/jitlayers.cpp:529
jl_compile_method_internal at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2481 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2368
_jl_invoke at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:2887 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/gf.c:3077
jl_apply at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/builder-amdci4-0/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:892 [inlined]

@cjdoris
Copy link
Collaborator

cjdoris commented Aug 22, 2024

@ppalmes that specific issue has just been fixed by #544

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 23, 2024

#544 fixes this as well.

I think this is probably a good reason to add some variant of #461 and other integration tests to help prevent more complex issues which are undetected by unit tests alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants