Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock finalizers' lists at exit #49931

Merged
merged 1 commit into from
May 23, 2023
Merged

Lock finalizers' lists at exit #49931

merged 1 commit into from
May 23, 2023

Conversation

kpamnany
Copy link
Contributor

We have occasionally seen memory corruption errors at the end of runs, such as:

Stack trace (from 2 threads, mixed together).
double free or corruption (!prev)

signal (6): Aborted
in expression starting at none:0
error in running finalizer: MethodError(f=Base.AsyncCondition(handle=0x00000000008feb10, cond=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Base.var"#611", rngState0=0xdf5f0d8bd23d416b, rngState1=0x0a255ccbafb1a1fa, rngState2=0x3fd62529a488eea1, rngState3=0x05c091e535deb2c4, _state=0x00, sticky=false, _isexception=false, priority=0x0000), tail=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Base.var"#611", rngState0=0xdf5f0d8bd23d416b, rngState1=0x0a255ccbafb1a1fa, rngState2=0x3fd62529a488eea1, rngState3=0x05c091e535deb2c4, _state=0x00, sticky=false, _isexception=false, priority=0x0000)), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Profile.var"#3", rngState0=0xd837ea2798675862, rngState1=0x6bd6b44513577585, rngState2=0xc8b0102cce0c51ce, rngState3=0x99df76eb80250e05, _state=0x00, sticky=false, _isexception=false, priority=0x0000), tail=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Base.var"#611", rngState0=0xdf5f0d8bd23d416b, rngState1=0x0a255ccbafb1a1fa, rngState2=0x3fd62529a488eea1, rngState3=0x05c091e535deb2c4, _state=0x00, sticky=false, _isexception=false, priority=0x0000), tail=Task(next=nothing, queue=<circular reference @-2>, storage=nothing, donenotify=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Base.var"#611", rngState0=0xdf5f0d8bd23d416b, rngState1=0x0a255ccbafb1a1fa, rngState2=0x3fd62529a488eea1, rngState3=0x05c091e535deb2c4, _state=0x00, sticky=false, _isexception=false, priority=0x0000)), lock=Base.Threads.SpinLock(owned=0)), result=nothing, logstate=nothing, code=Profile.var"#3", rngState0=0xd837ea2798675862, rngState1=0x6bd6b44513577585, rngState2=0xc8b0102cce0c51ce, rngState3=0x99df76eb80250e05, _state=0x00, sticky=false, _isexception=false, priority=0x0000)), lock=Base.Threads.SpinLock(owned=0)), isopen=true, set=false), args=(Base.uvfinalize,), world=0x0000000000007dfc)
gsignal at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
abort at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
error in running finalizer: MethodError(f=Base.Timer(handle=0x0000000000000000, cond=Base.GenericCondition{Base.Threads.SpinLock}(waitq=Base.IntrusiveLinkedList{Task}(head=nothing, tail=nothing), lock=Base.Threads.SpinLock(owned=0)), isopen=false, set=false), args=(Base.uvfinalize,), world=0x0000000000007dfc)
jl_method_error_bare at /build/source/src/gf.c:1879
jl_method_error at /build/source/src/gf.c:1897
jl_lookup_generic_ at /build/source/src/gf.c:2530 [inlined]
ijl_apply_generic at /build/source/src/gf.c:2545
jl_method_error_bare at /build/source/src/gf.c:1879
jl_method_error at /build/source/src/gf.c:1897
jl_lookup_generic_ at /build/source/src/gf.c:2530 [inlined]
ijl_apply_generic at /build/source/src/gf.c:2545
jl_apply at /build/source/src/julia.h:1842 [inlined]
run_finalizer at /build/source/src/gc.c:283
jl_gc_run_finalizers_in_list at /build/source/src/gc.c:370
run_finalizers at /build/source/src/gc.c:413
__libc_message at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
malloc_printerr at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
jl_gc_run_pending_finalizers at /build/source/src/gc.c:426
_int_free at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
_int_realloc at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
realloc at /nix/store/0xxjx37fcy2nl3yz6igmv4mag2a7giq6-glibc-2.33-123/lib/libc.so.6 (unknown line)
jl_apply at /build/source/src/julia.h:1842 [inlined]
run_finalizer at /build/source/src/gc.c:283
arraylist_grow at /build/source/src/support/arraylist.c:58 [inlined]
jl_gc_run_finalizers_in_list at /build/source/src/gc.c:370
arraylist_push at /build/source/src/support/arraylist.c:69
jl_gc_run_finalizers_in_list at /build/source/src/gc.c:362
run_finalizers at /build/source/src/gc.c:413
run_finalizers at /build/source/src/gc.c:413
jl_gc_run_pending_finalizers at /build/source/src/gc.c:426
jl_gc_run_pending_finalizers at /build/source/src/gc.c:426
enable_finalizers at ./gcutils.jl:121 [inlined]
unlock at ./locks-mt.jl:66 [inlined]
unlock at ./locks-mt.jl:66 [inlined]
trylock at ./locks-mt.jl:57 [inlin
]
trylock at ./locks-mt.jl:57 [inlin
multiq_deletemin at ./partr.jl:153

We found that jl_atexit_hook calls jl_gc_run_all_finalizers calls schedule_all_finalizers calls schedule_finalization which pushes to the to_finalize global list without locking. This list is mutated by run_finalizers which can run when a lock is released, as shown in the stack trace above.

The use of finalizers_lock is pretty messy -- on one path it is locked in one function and unlocked in another, while on another path it is locked and unlocked in the same function... all this could use a clean up, but for now, I've implemented a minimal fix in this PR.

This fix seems to have eliminated the crash we've been seeing but we're still running more tests.

@kpamnany kpamnany requested review from vtjnash and d-netto May 22, 2023 20:28
Copy link
Member

@NHDaly NHDaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 awesome, thanks. 👍

+1 that it'd be even better to clean this up by making the lock much tighter, only directly around the accesses, but that can come later.

@vtjnash vtjnash merged commit c470dc3 into master May 23, 2023
@vtjnash vtjnash deleted the kp/finalizers-lock-atexit branch May 23, 2023 13:26
kpamnany added a commit to RelationalAI/julia that referenced this pull request May 23, 2023
@kpamnany
Copy link
Contributor Author

@KristofferC: backport to 1.9.1?

kpamnany added a commit to RelationalAI/julia that referenced this pull request May 24, 2023
kpamnany added a commit to RelationalAI/julia that referenced this pull request Jun 7, 2023
kpamnany added a commit to RelationalAI/julia that referenced this pull request Jun 22, 2023
@NHDaly NHDaly added the backport 1.9 Change should be backported to release-1.9 label Jun 28, 2023
@NHDaly
Copy link
Member

NHDaly commented Jun 28, 2023

We backported #49868, and so we should do this one too, which is the other half of the same bug. I added the label.

kpamnany added a commit to RelationalAI/julia that referenced this pull request Jun 29, 2023
kpamnany added a commit to RelationalAI/julia that referenced this pull request Jul 6, 2023
KristofferC pushed a commit that referenced this pull request Jul 11, 2023
@KristofferC KristofferC mentioned this pull request Jul 11, 2023
35 tasks
kpamnany added a commit to RelationalAI/julia that referenced this pull request Jul 27, 2023
KristofferC added a commit that referenced this pull request Aug 18, 2023
Backported PRs:
- [x] #47782 <!-- Generalize Bool parse method to AbstractString -->
- [x] #48634 <!-- Remove unused "deps" mechanism in internal sorting
keywords [NFC] -->
- [x] #49931 <!-- Lock finalizers' lists at exit -->
- [x] #50064 <!-- Fix numbered prompt with input only with comment -->
- [x] #50474 <!-- docs: Fix a `!!! note` which was miscapitalized -->
- [x] #50516 <!-- Fix visibility of assert on GCC12/13 -->
- [x] #50635 <!-- `versioninfo()`: include build info and unofficial
warning -->
- [x] #49915 <!-- Revert "Remove number / vector (#44358)" -->
- [x] #50781 <!-- fix `bit_map!` with aliasing -->
- [x] #50845 <!-- fix #50438, use default pool for at-threads -->
- [x] #49031 <!-- Update inference.md -->
- [x] #50289 <!-- Initialize prev_nold and nold in gc_reset_page -->
- [x] #50559 <!-- Expand kwcall lowering positional default check to
vararg -->
- [x] #49582 <!-- Update HISTORY.md for `DelimitedFiles` -->
- [x] #50341 <!-- invokelatest docs should say not exported before 1.9
-->
- [x] #50525 <!-- only check that values are finite in `generic_lufact`
when `check=true` -->
- [x] #50444 <!-- Optimize getfield lowering to avoid boxing in some
cases -->
- [x] #50523 <!-- Avoid generic call in most cases for getproperty -->
- [x] #50860 <!-- Add `Base.get_extension` to docs/API -->
- [x] #50164 <!-- codegen: handle dead code with unsafe_store of FCA
pointers -->
- [x] #50568 <!-- `Array(::AbstractRange)` should return an `Array` -->
- [x] #50871 <!-- macOS: Don't inspect dead threadtls during exception
handling. -->

Need manual backport:
- [ ] #48542 <!-- Add docs on task-specific buffering using
multithreading -->
- [ ] #50591 <!-- build: fix various makefile bugs -->


Non-merged PRs with backport label:
- [ ] #50842 <!-- Avoid race conditions with recursive rm -->
- [ ] #50823 <!-- Make ranges more robust with unsigned indexes. -->
- [ ] #50663 <!-- Fix Expr(:loopinfo) codegen -->
- [ ] #49716 <!-- Update varinfo() docstring signature -->
- [ ] #49713 <!-- prevent REPL from erroring in numbered mode in some
situations -->
- [ ] #49573 <!-- Implement jl_cpu_pause on PPC64 -->
- [ ] #48726 <!-- fix macro expansion of property destructuring -->
- [ ] #48642 <!-- Use gc alloc instead of alloc typed in lowering -->
- [ ] #48183 <!-- Don't use pkgimage for package if any includes fall in
tracked path for coverage or alloc tracking -->
- [ ] #48050 <!-- improve `--heap-size-hint` arg handling -->
- [ ] #47615 <!-- Allow threadsafe access to buffer of type inference
profiling trees -->
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants