Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible Memory Corruption in atexit_hooks (on MacOS) leading to GC corruption and/or segfault #49746

Closed
NHDaly opened this issue May 10, 2023 · 5 comments · Fixed by #49774
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@NHDaly
Copy link
Member

NHDaly commented May 10, 2023

Quite possibly related to #43567.

Here is a small reproducer https://github.com/NHDaly/ThreadingUtils.jl
Uploaded here:
ThreadingUtils.zip

When running the tests for this small repro, on macOS, I can reliably get a crash around 1/20 times. (this repro was distilled down from our real codebase, where the crash happens more like 1/5 times.)

We would love to get an rr trace, but we sadly haven't been able to repro on linux yet 😢

We've seen crashes on both our internal build of 1.8.2, as well as on the latest master, built as of yesterday, e204e20.

To repro:

~/src/julia/julia --project=. -E 'using Pkg; for _ in 1:100; Pkg.test("ThreadingUtils", julia_args=["-t4,1", "--gcthreads=1"]); end'

I eventually see crashes, such as:

13:29:46 | maxrss  0.9% | mem 98.9% | DONE  ( 3/11) test item "spawn_sticky_periodic_task on correct threadpool" 2.1 secs (60.2% compile, 0.3% GC), 2.07 M allocs (149.759 MB)
GC error (probable corruption) :
Allocations: 8942391 (Pool: 8933067; Big: 9324); GC: 13

!!! ERROR in jl_ -- ABORTING !!!

[98691] signal (6): Abort trap: 6
in expression starting at /Users/nathandaly/work/Delve/packages/ThreadingUtils/test/Future_tests.jl:1
Allocations: 8942391 (Pool: 8933067; Big: 9324); GC: 13
ERROR: Package ThreadingUtils errored during testing (received signal: 6)

or

signal (11): Segmentation fault: 11
in expression starting at /Users/nathandaly/work/raicode2/packages/ThreadingUtils/test/runtests.jl:2
gc_mark_loop at /nix/store/w6hgdihnsbgl5sk14rybgm7ffyh0mgq9-julia-1.8.2-patched/lib/julia/libjulia-internal.1.8.dylib (unknown line)
_jl_gc_collect at /nix/store/w6hgdihnsbgl5sk14rybgm7ffyh0mgq9-julia-1.8.2-patched/lib/julia/libjulia-internal.1.8.dylib (unknown line)
ijl_gc_collect at /nix/store/w6hgdihnsbgl5sk14rybgm7ffyh0mgq9-julia-1.8.2-patched/lib/julia/libjulia-internal.1.8.dylib (unknown line)
gc at ./gcutils.jl:93 [inlined]
#runtestitem#75 at /Users/nathandaly/.julia/packages/ReTestItems/Go9Yr/src/ReTestItems.jl:711
runtestitem##kw at /Users/nathandaly/.julia/packages/ReTestItems/Go9Yr/src/ReTestItems.jl:630 [inlined]

or

[61257] signal (11.2): Segmentation fault: 11
in expression starting at /Users/nathandaly/work/Delve/packages/ThreadingUtils/test/runtests.jl:5
gc_mark_outrefs at /Users/nathandaly/src/julia/src/gc.c:2573 [inlined]
gc_mark_loop_serial_ at /Users/nathandaly/src/julia/src/gc.c:2759
gc_mark_loop_serial at /Users/nathandaly/src/julia/src/gc.c:2782
gc_mark_loop at /Users/nathandaly/src/julia/src/gc.c:2908 [inlined]
_jl_gc_collect at /Users/nathandaly/src/julia/src/gc.c:3227
ijl_gc_collect at /Users/nathandaly/src/julia/src/gc.c:3540
gc at ./gcutils.jl:129 [inlined]
#runtestitem#79 at /Users/nathandaly/.julia/packages/ReTestItems/NIjKJ/src/ReTestItems.jl:727

(Sorry, the line numbers will be wrong, because i've been adding printlns for debugging. This is the line on that gc_mark_outrefs):

julia/src/gc.c

Line 2569 in 5521212

size_t dtsz = jl_datatype_size(vt);


We added some println debugging (in #49741), and we got some more information, which is what led us to the fact that the corruption is happening in the atexit_hooks object:

# added this to the top of runtests.jl
@info "Pointer of atexit_hooks: $(Base.pointer_from_objref(Base.atexit_hooks))"

and it matches up here:

[ Info: Pointer of atexit_hooks: Ptr{Nothing} @0x000000011f632d40

...

GC error (probable corruption) while enqueing:
Allocations: 8893120 (Pool: 8883835; Big: 9285); GC: 13
While marking parent 0x11f632d40, of type:
Array{Union{Function, Type}, 1}
invalid object encountered when enqueuing object at 0x1633ac358, of type:
Base.var"#939#940"()
Attempting to dump parent object:
Array{Union{Function, Type}, (4,)}[
...

Note: 0x000000011f632d40 and parent 0x11f632d40.

So, somehow, one of the functions in the atexit_hooks vector is getting trashed. Here are some of the times it's managed to dump the vector via jl_:

GC error (probable corruption) while enqueing:
Allocations: 9075187 (Pool: 9065758; Big: 9429); GC: 16
While marking parent 0x11eae2d40, of type:
Array{Union{Function, Type}, 1}
invalid object encountered when enqueuing object at 0x10e4eafd8, of type:
Union{}
Attempting to dump parent object:
Array{Union{Function, Type}, (6,)}[
  ##spawn named periodic task#229.var"#2#4"(),
  ##periodic task with exceptions#228.var"#2#4"(),
  ##spawn_sticky_periodic_task on correct threadpool#227.var"#3#10"{WeakRef}(#189#wkr=WeakRef(value=nothing)),
  ##spawn_sticky_periodic_task on correct threadpool#227.var"#3#10"{WeakRef}(#189#wkr=WeakRef(value=nothing)),
  ##spawn_sticky_periodic_task on correct threadpool#227.var"#3#10"{WeakRef}(#189#wkr=WeakRef(value=nothing)),

!!! ERROR in jl_ -- ABORTING !!!

[93789] signal (6): Abort trap: 6

or also:

Attempting to dump parent object:
Array{Union{Function, Type}, (3,)}[
  ##spawn_sticky_periodic_task on correct threadpool#227.var"#3#10"{WeakRef}(#166#wkr=WeakRef(value=ThreadingUtils.PeriodicTask([...]))  # elided
  ##spawn_sticky_periodic_task on correct threadpool#227.var"#3#10"{WeakRef}(#166#wkr=WeakRef(value=ThreadingUtils.PeriodicTask([...]))  # elided
  <?#0x281ef42b8::
!!! ERROR in jl_ -- ABORTING !!!

In the original code, the atexit() do hooks were using a WeakRef to ensure that the hook itself isn't unnecessarily keeping the object alive. We've removed that from the version of the reproducer provided above, since it reproduces without the WeakRefs, so the WeakRefs themselves are probably a red herring.

That said, it seemed to reproduce slightly more frequently when WeakRefs were involved as well. The diff to put them back is here:

diff --git a/src/ThreadingUtils.jl b/src/ThreadingUtils.jl
index 26be26d..4a3aa66 100644
--- a/src/ThreadingUtils.jl
+++ b/src/ThreadingUtils.jl
@@ -54,8 +54,11 @@ macro spawn_interactive_periodic_task(name, period, expr)
             end
         end
         _pt = PeriodicTask(n, p, timer, should_terminate, task)
+        wkr = WeakRef(_pt)
         atexit() do
-            stop_periodic_task!(_pt)
+            if wkr.value !== nothing
+                stop_periodic_task!(wkr.value)
+            end
         end
         _pt
     end

Phew! I'm not sure how actionable this is, but it's a pretty surprisingly small reproducer and it can surface a crash, so this should be looked into! I'm not sure who would be best positioned to investigate further. I'm happy to help if anyone wants to pair.

Thanks!

@NHDaly NHDaly added the bug Indicates an unexpected problem or unintended behavior label May 10, 2023
@kpamnany
Copy link
Contributor

kpamnany commented May 11, 2023

Not sure it's helpful or even related, but I've also seen the following on rare occasions, at termination:

master:

atexit hook threw an error: UndefRefError()
getindex at ./essentials.jl:13 [inlined]
popfirst! at ./array.jl:1520 [inlined]
_atexit at ./initdefs.jl:381
jfptr__atexit_55921 at /home/kpamnany/julia.master/usr/lib/julia/sys.so (unknown line)
jl_apply at /home/kpamnany/julia.master/src/julia.h:1961 [inlined]
ijl_atexit_hook at /home/kpamnany/julia.master/src/init.c:280
jl_repl_entrypoint at /home/kpamnany/julia.master/src/jlapi.c:735
main at /home/kpamnany/julia.master/cli/loader_exe.c:58
unknown function (ip: 0x7f0affe26d8f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at /home/kpamnany/julia.master/julia (unknown line)

1.8.2:

atexit hook threw an error: UndefRefError()
getindex at ./array.jl:924 [inlined]
popfirst! at ./array.jl:1431 [inlined]
_atexit at ./initdefs.jl:370
jfptr__atexit_26934.clone_1 at /nix/store/b6jlsvbdkbfsl9d11pnvrgvvhnfvdxx0-julia-1.8.2-patched/lib/julia/sys.so (unknown line)
jl_apply at /build/source/src/julia.h:1842 [inlined]
ijl_atexit_hook at /build/source/src/init.c:219
jl_repl_entrypoint at /build/source/src/jlapi.c:720
main at julia (unknown line)
__libc_start_main at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
_start at julia (unknown line)

@NHDaly
Copy link
Member Author

NHDaly commented May 11, 2023

Okay here's something else weird... and it matches up with what you saw, @kpamnany!:

I ran the tests for like 100 iterations, with GC disabled, and then printed the atexit_hooks:

julia> Base.atexit_hooks
374-element Vector{Union{Function, Type}}:
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
   
 #undef
 #undef
 #undef
 #undef
 #undef

Somehow the last 5 elements are #undef!?

julia> Base.atexit_hooks[end-6:end]
7-element Vector{Union{Function, Type}}:
    #3 (generic function with 1 method)
    #3 (generic function with 1 method)
 #undef
 #undef
 #undef
 #undef
 #undef

Which shouldn't be possible because the only code in the whole julia codebase that modifies atexit_hooks is this line:

atexit(f::Function) = (pushfirst!(atexit_hooks, f); nothing)

So it should only be calling pushfirst!() over and over..... so there shouldn't be any way to accumulate #undefs at the end.......

And also, somehow now when I run GC, it is no longer marking the original #939 hook (which i was watching with a watchpoint in LLDB), because it's no longer referenced by the vector.. So it's been somehow lost!


And indeed, when I finally exited the process, I see the same thing you saw:

atexit hook threw an error: UndefRefError()
getindex at ./essentials.jl:13 [inlined]
popfirst! at ./array.jl:1520 [inlined]
_atexit at ./initdefs.jl:381
jfptr__atexit_55947 at /Users/nathandaly/src/julia/usr/lib/julia/sys.dylib (unknown line)
_jl_invoke at /Users/nathandaly/src/julia/src/gf.c:0 [inlined]
ijl_apply_generic at /Users/nathandaly/src/julia/src/gf.c:3069
jl_apply at /Users/nathandaly/src/julia/src/./julia.h:1958 [inlined]
ijl_atexit_hook at /Users/nathandaly/src/julia/src/init.c:280
jl_repl_entrypoint at /Users/nathandaly/src/julia/src/jlapi.c:735

@NHDaly
Copy link
Member Author

NHDaly commented May 11, 2023

OH DUHHHHHHHH!!!!!!!!!!!!!!!!!!!

It's not threadsafe, my friends!!! We're calling atexit from multiple threads, which is pushing into the vector without a lock 💡 duh!

Phew. I can't believe how long that took us to figure out. PR incoming

@DilumAluthge
Copy link
Member

Should we have a global lock around pushing to the vector?

@NHDaly
Copy link
Member Author

NHDaly commented May 11, 2023

Yeah, exactly Dilum. 👍 That's what I did in #49774. 👍

NHDaly added a commit that referenced this issue May 18, 2023
- atexit(f) mutates global shared state.
- atexit(f) can be called anytime by any thread.
- Accesses & mutations to global shared state must be locked if they can be accessed from multiple threads.

Fixes #49746
NHDaly added a commit to RelationalAI/julia that referenced this issue May 19, 2023
- atexit(f) mutates global shared state.
- atexit(f) can be called anytime by any thread.
- Accesses & mutations to global shared state must be locked if they can be accessed from multiple threads.

Fixes JuliaLang#49746
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants