Occasional segfaults when running with @threads #44460

chipbuster · 2022-03-04T20:45:12Z

This is mostly a condensed repost of #44019. I was told that julia 1.7.1 was a "very old release to be relying on threading," but the issue still occurs on the latest nightly.

Running the following code will sometimes segfault. It is more common with 32 threads, but will sometimes occur with 16 or fewer threads.

Current Code

using Base.Iterators
using Base.Threads
using Serialization
using Distributions

function bin_data(data, lo, hi, nbins)
    dx = (hi - lo) / nbins
    bins = ((data .- lo) ./ dx) .|> floor
    bins = UInt8.(bins)
    clamp.(bins, UInt8(0), UInt8(nbins))
end

l = SpinLock()
function compress_data(data)
    lock(l)
    tmpfn = tempname()
    unlock(l)
    write(tmpfn, data)
    run(
        pipeline(
            `xz -9e --keep --format=raw --suffix=.xz $(tmpfn)`,
            stdout = devnull,
            stderr = devnull,
        ),
    )
    nbytes = filesize(tmpfn * ".xz")
    rm(tmpfn * ".xz")
    rm(tmpfn)
    return nbytes
end

compressed_size_bytes(data) = compress_data(data)
compressed_size_bits(data) = compress_data(data) * 8

function emission_times_exp(n, k, Γ)
    η = (k + Γ) / (k * Γ)
    dist = Exponential(η)
    rand(dist, n)
end

function lose_data(lagtimes, γ)
    @assert(all(lagtimes .>= 0.0))
    ind = Int[]
    fixed_times = cumsum(lagtimes)
    for i = 1:length(lagtimes)
        x = rand()
        if x < γ
            push!(ind, i)
        end
    end
    detected_times = fixed_times[ind]
    detected_times |> diff
end

ns = [100_000, 1_000_000, 10_000_000]
# ns = [1_000]  # testing only
ks = [0.1, 0.5, 1.0, 5.0, 10.0]
Γs = [0.1, 0.5, 1.0, 5.0, 10.0]
γs = range(0.1, 1.0, step = 0.1)
ntrials = 1000

smrates = Iterators.product(ks, Γs) |> collect |> vec

l = SpinLock()
@threads for trialnum = 1:ntrials
    data = Dict()
    for p in smrates
        (k, Γ) = p
        for n in ns
            # nm_times = get_emission_dt(n, k, Γ)
            # mar_times = emission_times_exp(n, k, Γ)
            nm_times = 10.0 .* rand(n)
            mar_times = 10.0 .* rand(n)

            for γ in γs
                nm_lost = lose_data(nm_times, γ)
                mar_lost = lose_data(mar_times, γ)
                hi = max(maximum(nm_lost),maximum(mar_lost))

                @assert(all(nm_lost .>= 0.0))
                @assert(all(mar_lost .>= 0.0))

                nm_binned = bin_data(nm_lost, 0.0, hi, 100)
                mar_binned = bin_data(mar_lost, 0.0, hi, 100)

                nm_size = compressed_size_bytes(nm_binned)
                mar_size = compressed_size_bytes(mar_binned)

                experiment_index = (n = n, k = k, Γ = Γ, γ = γ, trial = trialnum)

                try
                    lock(l)
                    data[experiment_index] = (1.0, 1.0)
                finally
                    unlock(l)
                end
            end
        end
    end
end

Output of `versioninfo()`

Julia Version 1.9.0-DEV.109
Commit 3a47c1c4e1 (2022-03-01 20:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 32 × AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver3)
  Threads: 1 on 32 virtual cores

Stack trace from the one time in many that the program managed to print one before dying

The line that originates in my code (line 65 at the top) is just the @threads loop.

signal (11): Segmentation fault
in expression starting at /mnt/ssd-data/Experiments/02-2022/simple-photon-model/code/buggy/gen_paramsweep_segfault.jl:65
jl_uv_call_close_callback at /buildworker/worker/package_linux64/build/src/jl_uv.c:83 [inlined]
jl_uv_closeHandle at /buildworker/worker/package_linux64/build/src/jl_uv.c:106
uv__finish_close at /workspace/srcdir/libuv/src/unix/core.c:301
uv__run_closing_handles at /workspace/srcdir/libuv/src/unix/core.c:315
uv_run at /workspace/srcdir/libuv/src/unix/core.c:393
ijl_process_events at /buildworker/worker/package_linux64/build/src/jl_uv.c:210
ijl_task_get_next at /buildworker/worker/package_linux64/build/src/partr.c:598
poptask at ./task.jl:884
wait at ./task.jl:893
wait at ./condition.jl:124
wait at ./process.jl:647
success at ./process.jl:509
Exception: /tmp/julia-3a47c1c4e1/bin/julia killed by signal segmentation fault (core dumped)

A valgrind report from the previous issue

Taken from this comment

snellius paulm@tcn116 17:23 ~$ valgrind --smc-check=all-non-file --suppressions=$HOME/c/julia-git/contrib/valgrind-julia.supp ~/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== Memcheck, a memory error detector
==1988794== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1988794== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==1988794== Command: /home/paulm/software/julia-1.7.1/bin/julia -t 128 segfault.jl
==1988794== 
--1988794-- WARNING: unhandled amd64-linux syscall: 1008
--1988794-- You may be able to write your own handler.
--1988794-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--1988794-- Nevertheless we consider this a bug.  Please report
--1988794-- it at http://valgrind.org/support/bug_reports.html.
==1988794== Warning: client switching stacks?  SP change: 0x311d17d8 --> 0xe2a5fff8
==1988794==          to suppress, use: --max-stackframe=2978539552 or greater
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: invalid file descriptor -1 in syscall close()
==1988794== Warning: client switching stacks?  SP change: 0x30bce7d8 --> 0xef554ff8
==1988794==          to suppress, use: --max-stackframe=3197659168 or greater
==1988794== Warning: client switching stacks?  SP change: 0x34dd57d8 --> 0xf3958ff8
==1988794==          to suppress, use: --max-stackframe=3199744032 or greater
==1988794==          further instances of this message will not be shown.
==1988794== Thread 3:
==1988794== Syscall param write(buf) points to uninitialised byte(s)
==1988794==    at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794==  Address 0xef54e000 is in a rw- anonymous segment
==1988794== 
==1988794== Syscall param write(buf) points to unaddressable byte(s)
==1988794==    at 0x4F4A52D: syscall (in /usr/lib64/libc-2.28.so)
==1988794==  Address 0xef54e000 is in a rw- anonymous segment
==1988794== 
==1988794== Thread 80:
==1988794== Invalid read of size 8
==1988794==    at 0x5B87E84: maybe_collect (julia_threads.h:325)
==1988794==    by 0x5B87E84: jl_gc_big_alloc (gc.c:947)
==1988794==  Address 0xfffffffffe49ff10 is not stack'd, malloc'd or (recently) free'd
==1988794== 
==1988794== Thread 88:
==1988794== Invalid read of size 8
==1988794==    at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794==    by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==1988794== 
==1988794== 
==1988794== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==1988794==  Access not within mapped region at address 0x0
==1988794==    at 0x5B7CC12: jl_gc_state_set (julia_threads.h:325)
==1988794==    by 0x5B7CC12: jl_task_get_next (partr.c:523)
==1988794==  If you believe this happened as a result of a stack
==1988794==  overflow in your program's main thread (unlikely but
==1988794==  possible), you can try to increase the size of the
==1988794==  main thread stack using the --main-stacksize= flag.
==1988794==  The main thread stack size used in this run was 16777216.
==1988794== 
==1988794== HEAP SUMMARY:
==1988794==     in use at exit: 603,293,029 bytes in 49,164 blocks
==1988794==   total heap usage: 1,022,326 allocs, 973,162 frees, 3,449,011,340 bytes allocated
==1988794== 
==1988794== LEAK SUMMARY:
==1988794==    definitely lost: 163 bytes in 12 blocks
==1988794==    indirectly lost: 0 bytes in 0 blocks
==1988794==      possibly lost: 1,273,212 bytes in 12,672 blocks
==1988794==    still reachable: 602,018,922 bytes in 36,477 blocks
==1988794==                       of which reachable via heuristic:
==1988794==                         newarray           : 56,448 bytes in 10 blocks
==1988794==                         multipleinheritance: 7,992 bytes in 15 blocks
==1988794==         suppressed: 732 bytes in 3 blocks
==1988794== Rerun with --leak-check=full to see details of leaked memory
==1988794== 
==1988794== Use --track-origins=yes to see where uninitialised values come from
==1988794== For lists of detected and suppressed errors, rerun with: -s
==1988794== ERROR SUMMARY: 15 errors from 4 contexts (suppressed: 46 from 6)
Segmentation fault

An rr trace of the bug under Julia 1.7.1 can be found here. However, I have not been able to catch the program crashing under rr under the nightly build: trying to use BugReporting.jl will simply complete the program successfully, while running julia itself under rr has not managed to produce any useful results yet (it has been running for over 24 hours at this point with no apparent progress).

The text was updated successfully, but these errors were encountered:

chipbuster · 2022-03-04T22:10:45Z

Running under an external rr has crashed with an internal rr error.

It looks like this is an rr issue and not the actual Julia crash (the error message was about the wrong signal being received for a stop) but I have uploaded this trace in case it winds up being useful.

https://www.dropbox.com/s/ln805f9tfq3tsau/julia-rr-crash-nightly.tar.zst?dl=0

There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. Fixes #44460

There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. The uv_return_spawn callback also needs to be synchronized with the constructor, since we might have arrived there before we finished allocating the Process struct here, leading to missed exit events. Fixes #44460

There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. The uv_return_spawn callback also needs to be synchronized with the constructor, since we might have arrived there before we finished allocating the Process struct here, leading to missed exit events. Fixes #44460 (cherry picked from commit c591bf2)

dpinol · 2022-04-22T10:41:39Z

I get a crash also on my application when pushing into a array local variable within Threads.@thread.
Then, I tested the code in this issue description and it still crashes on my Ubuntu in both 1.8.0-beta 3 and julia built from master yesterday. @chipbuster can you reproduce the crash with current julia versions?

julia> versioninfo()
Julia Version 1.9.0-DEV.425
Commit 4340d23d28* (2022-04-21 08:06 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, skylake)
  Threads: 32 on 12 virtual cores

This is what I get with valgrind on my application:


==915992== Thread 17:
==915992== Invalid read of size 8
==915992==    at 0x323D1C9A: jl_gc_state_set (julia_threads.h:340)
==915992==    by 0x323D1C9A: ijl_task_get_next (partr.c:593)
==915992==  Address 0x0 is not stack'd, malloc'd or (recently) free'd

This is what I get without valgrind on my application with julia 1.9-dev:

signal (11): Segmentation fault
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
in expression starting at none:0
jl_gc_alloc at /home/dani/dev/julia/julia/usr/bin/../lib/libjulia-internal.so.1 (unknown line)
unknown function (ip: 0x7fab2a7537d4)
ijl_process_events at /home/dani/dev/julia/julia/src/jl_uv.c:210
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
jl_array_grow_at_end at /home/dani/dev/julia/julia/src/array.c:893
_growend! at ./array.jl:1006 [inlined]
push! at ./array.jl:1053 [inlined]
.....
#1261#threadsfor_fun#248 at ./threadingconstructs.jl:84
#1261#threadsfor_fun at ./threadingconstructs.jl:52 [inlined]
#1 at ./threadingconstructs.jl:30
unknown function (ip: 0x7faa6da9bacf)
jl_mutex_unlock at /home/dani/dev/julia/julia/src/julia_locks.h:129 [inlined]
ijl_process_events at /home/dani/dev/julia/julia/src/jl_uv.c:217
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
ijl_task_get_next at /home/dani/dev/julia/julia/src/partr.c:420
jl_apply at /home/dani/dev/julia/julia/src/julia.h:1838 [inlined]
start_task at /home/dani/dev/julia/julia/src/task.c:931
Allocations: 46370570 (Pool: 46340026; Big: 30544); GC: 64

With Julia 1.8-beta3

signal (11): Segmentation fault
in expression starting at none:0
maybe_collect at /buildworker/worker/package_linux64/build/src/julia_threads.h:340 [inlined]
jl_gc_pool_alloc_inner at /buildworker/worker/package_linux64/build/src/gc.c:1228 [inlined]
jl_gc_pool_alloc_noinline at /buildworker/worker/package_linux64/build/src/gc.c:1287 [inlined]
jl_gc_alloc_ at /buildworker/worker/package_linux64/build/src/julia_internal.h:368 [inlined]
jl_gc_alloc at /buildworker/worker/package_linux64/build/src/gc.c:3352
jl_gc_alloc_buf at /buildworker/worker/package_linux64/build/src/julia_internal.h:403 [inlined]
jl_reserve_excstack at /buildworker/worker/package_linux64/build/src/rtutils.c:335 [inlined]
jl_push_excstack at /buildworker/worker/package_linux64/build/src/rtutils.c:347
throw_internal at /buildworker/worker/package_linux64/build/src/task.c:621
ijl_sig_throw at /buildworker/worker/package_linux64/build/src/task.c:681

KristofferC · 2022-04-22T11:05:17Z

pushing into a array local variable within Threads.

Could you be a bit more explicit (perhaps with a code snippet)? As you describe it, that sounds like a data race.

dpinol · 2022-04-22T11:36:31Z

pushing into a array local variable within Threads.

Could you be a bit more explicit (perhaps with a code snippet)? As you describe it, that sounds like a data race.

The crash involves a local variable only accessed from a single thread. The snipped below illustrates what my application does, but julia never crashes with the snippet (nor valgrind reports any issue).

mutable struct S
    @atomic sum::Float32
end
const s=S(0)

for i in 1:100_000_0
    Threads.@threads for t in 1:300
        a=Float32[]
        for i in 1:rand(1:2000)
            push!(a, 0.3f0)
        end
        @atomic s.sum += a[end]
    end
end
@info "sum" s.sum

dpinol · 2022-04-22T11:51:56Z

I uploaded the result of running #44460 (comment) with julia 1.9-dev with --bug-julia report=rr-local --project=. --threads=32 (5Gb 😿 )
(link edited with the packed rr data)
https://www.dropbox.com/s/ia8b159h59x9uv4/issue-44460-rr.tgz?dl=0

vtjnash · 2022-04-22T16:28:10Z

It looks like you need to run rr pack before uploading that trace (should only make it slightly bigger)

dpinol · 2022-04-22T16:47:01Z

It looks like you need to run rr pack before uploading that trace (should only make it slightly bigger)
The packed data:https://www.dropbox.com/s/ia8b159h59x9uv4/issue-44460-rr.tgz?dl=0 thanks

chipbuster · 2022-04-23T13:54:27Z

Yeah, the example snippet is not really the best because it generates a lot of stuff in rr and doesn't always crash quickly.

Something that I found sometimes persuades it to crash quicker is clearing all caches and buffer from Linux before starting a recording, but it's not really foolproof.

vtjnash · 2022-04-26T03:40:20Z

@dpinol there doesn't seem to be anything remarkable in your trace that I can see. The process runs for a while (about 4138 iterations of the loop happen), then something external terminated it with a SIGKILL (9).

chipbuster · 2022-05-01T03:02:09Z

@dpinol I'm not able to reproduce the crash running with the latest Julia.

Based on the description provided by vtjnash, t's possible that my example is crashing for you simply by running out of memory.

dpinol · 2022-05-05T15:16:29Z

Yes, I agree in this case it runs out of memory. I created a different MWE at #45196. thanks

vtjnash mentioned this issue Mar 5, 2022

process: ensure uvfinalize and _uv_close_cb are synchronized #44476

Merged

JeffBezanson added the multithreading Base.Threads and related functionality label Mar 7, 2022

vtjnash closed this as completed in #44476 Mar 8, 2022

dpinol mentioned this issue May 5, 2022

Occasional segfaults when running with @threads 3 #45196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional segfaults when running with @threads #44460

Occasional segfaults when running with @threads #44460

chipbuster commented Mar 4, 2022 •

edited

Loading

chipbuster commented Mar 4, 2022

dpinol commented Apr 22, 2022 •

edited

Loading

KristofferC commented Apr 22, 2022 •

edited

Loading

dpinol commented Apr 22, 2022

dpinol commented Apr 22, 2022 •

edited

Loading

vtjnash commented Apr 22, 2022

dpinol commented Apr 22, 2022

chipbuster commented Apr 23, 2022

vtjnash commented Apr 26, 2022

chipbuster commented May 1, 2022

dpinol commented May 5, 2022

Occasional segfaults when running with @threads #44460

Occasional segfaults when running with @threads #44460

Comments

chipbuster commented Mar 4, 2022 • edited Loading

chipbuster commented Mar 4, 2022

dpinol commented Apr 22, 2022 • edited Loading

KristofferC commented Apr 22, 2022 • edited Loading

dpinol commented Apr 22, 2022

dpinol commented Apr 22, 2022 • edited Loading

vtjnash commented Apr 22, 2022

dpinol commented Apr 22, 2022

chipbuster commented Apr 23, 2022

vtjnash commented Apr 26, 2022

chipbuster commented May 1, 2022

dpinol commented May 5, 2022

chipbuster commented Mar 4, 2022 •

edited

Loading

dpinol commented Apr 22, 2022 •

edited

Loading

KristofferC commented Apr 22, 2022 •

edited

Loading

dpinol commented Apr 22, 2022 •

edited

Loading