Skip to content

Commit

Permalink
Merge pull request #36 from RelationalAI/dcn-backport-parallel-marking
Browse files Browse the repository at this point in the history
Backport parallel marking
  • Loading branch information
d-netto authored Sep 11, 2023
2 parents 7369d20 + 6085d74 commit f41d92f
Show file tree
Hide file tree
Showing 33 changed files with 1,540 additions and 1,564 deletions.
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Compiler/Runtime improvements
* All uses of the `@pure` macro in `Base` have been replaced with the now-preferred `Base.@assume_effects` ([#44776]).
* `invoke(f, invokesig, args...)` calls to a less-specific method than would normally be chosen
for `f(args...)` are no longer spuriously invalidated when loading package precompile files ([#46010]).
* The mark phase of the Garbage Collector is now multi-threaded ([#48600]).

Command-line option changes
---------------------------
Expand All @@ -49,6 +50,8 @@ Command-line option changes
number of interactive threads to create (`auto` currently means 1) ([#42302]).
* New option `--heap-size-hint=<size>` suggests a size limit to invoke garbage collection more eagerly.
The size may be specified in bytes, kilobytes (1000k), megabytes (300M), or gigabytes (1.5G) ([#45369]).
* New option `--gcthreads` to set how many threads will be used by the Garbage Collector ([#48600]).
The default is set to `N/2` where `N` is the amount of worker threads (`--threads`) used by Julia.

Multi-threading changes
-----------------------
Expand Down
1 change: 1 addition & 0 deletions base/options.jl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ struct JLOptions
cpu_target::Ptr{UInt8}
nthreadpools::Int16
nthreads::Int16
ngcthreads::Int16
nthreads_per_pool::Ptr{Int16}
nprocs::Int32
machine_file::Ptr{UInt8}
Expand Down
7 changes: 7 additions & 0 deletions base/threadingconstructs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,13 @@ function threadpooltids(pool::Symbol)
end
end

"""
Threads.ngcthreads() -> Int
Returns the number of GC threads currently configured.
"""
ngcthreads() = Int(unsafe_load(cglobal(:jl_n_gcthreads, Cint))) + 1

function threading_run(fun, static)
ccall(:jl_enter_threaded_region, Cvoid, ())
n = threadpoolsize()
Expand Down
5 changes: 5 additions & 0 deletions doc/man/julia.1
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,11 @@ supported (Linux and Windows). If this is not supported (macOS) or
process affinity is not configured, it uses the number of CPU
threads.

.TP
--gcthreads <n>
Enable n GC threads; If unspecified is set to half of the
compute worker threads.

.TP
-p, --procs {N|auto}
Integer value N launches N additional local worker processes `auto` launches as many workers
Expand Down
1 change: 1 addition & 0 deletions doc/src/base/multi-threading.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Base.Threads.nthreads
Base.Threads.threadpool
Base.Threads.nthreadpools
Base.Threads.threadpoolsize
Base.Threads.ngcthreads
```

See also [Multi-Threading](@ref man-multithreading).
Expand Down
5 changes: 3 additions & 2 deletions doc/src/manual/command-line-interface.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,9 @@ The following is a complete list of command-line switches available when launchi
|`-e`, `--eval <expr>` |Evaluate `<expr>`|
|`-E`, `--print <expr>` |Evaluate `<expr>` and display the result|
|`-L`, `--load <file>` |Load `<file>` immediately on all processors|
|`-t`, `--threads {N\|auto`} |Enable N threads; `auto` tries to infer a useful default number of threads to use but the exact behavior might change in the future. Currently, `auto` uses the number of CPUs assigned to this julia process based on the OS-specific affinity assignment interface, if supported (Linux and Windows). If this is not supported (macOS) or process affinity is not configured, it uses the number of CPU threads.|
|`-p`, `--procs {N\|auto`} |Integer value N launches N additional local worker processes; `auto` launches as many workers as the number of local CPU threads (logical cores)|
|`-t`, `--threads {N\|auto}` |Enable N threads; `auto` tries to infer a useful default number of threads to use but the exact behavior might change in the future. Currently, `auto` uses the number of CPUs assigned to this julia process based on the OS-specific affinity assignment interface, if supported (Linux and Windows). If this is not supported (macOS) or process affinity is not configured, it uses the number of CPU threads.|
| `--gcthreads {N}` |Enable N GC threads; If unspecified is set to half of the compute worker threads.|
|`-p`, `--procs {N\|auto}` |Integer value N launches N additional local worker processes; `auto` launches as many workers as the number of local CPU threads (logical cores)|
|`--machine-file <file>` |Run processes on hosts listed in `<file>`|
|`-i` |Interactive mode; REPL runs and `isinteractive()` is true|
|`-q`, `--quiet` |Quiet startup: no banner, suppress REPL warnings|
Expand Down
8 changes: 8 additions & 0 deletions doc/src/manual/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,14 @@ then spinning threads never sleep. Otherwise, `$JULIA_THREAD_SLEEP_THRESHOLD` is
interpreted as an unsigned 64-bit integer (`uint64_t`) and gives, in
nanoseconds, the amount of time after which spinning threads should sleep.

### [`JULIA_NUM_GC_THREADS`](@id env-gc-threads)

Sets the number of threads used by Garbage Collection. If unspecified is set to
half of the number of worker threads.

!!! compat "Julia 1.10"
The environment variable was added in 1.10

### `JULIA_EXCLUSIVE`

If set to anything besides `0`, then Julia's thread policy is consistent with
Expand Down
9 changes: 9 additions & 0 deletions doc/src/manual/multi-threading.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,15 @@ julia> Threads.threadid()
three processes have 2 threads enabled. For more fine grained control over worker
threads use [`addprocs`](@ref) and pass `-t`/`--threads` as `exeflags`.

### Multiple GC Threads

The Garbage Collector (GC) can use multiple threads. The amount used is either half the number
of compute worker threads or configured by either the `--gcthreads` command line argument or by using the
[`JULIA_NUM_GC_THREADS`](@ref env-gc-threads) environment variable.

!!! compat "Julia 1.10"
The `--gcthreads` command line argument requires at least Julia 1.10.

## [Threadpools](@id man-threadpools)

When a program's threads are busy with many tasks to run, tasks may experience
Expand Down
2 changes: 1 addition & 1 deletion src/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ ifeq ($(USE_SYSTEM_LIBUV),0)
UV_HEADERS += uv.h
UV_HEADERS += uv/*.h
endif
PUBLIC_HEADERS := $(BUILDDIR)/julia_version.h $(wildcard $(SRCDIR)/support/*.h) $(addprefix $(SRCDIR)/,julia.h julia_assert.h julia_threads.h julia_fasttls.h julia_locks.h julia_atomics.h jloptions.h)
PUBLIC_HEADERS := $(BUILDDIR)/julia_version.h $(wildcard $(SRCDIR)/support/*.h) $(addprefix $(SRCDIR)/,work-stealing-queue.h julia.h julia_assert.h julia_threads.h julia_fasttls.h julia_locks.h julia_atomics.h jloptions.h)
ifeq ($(OS),WINNT)
PUBLIC_HEADERS += $(addprefix $(SRCDIR)/,win32_ucontext.h)
endif
Expand Down
197 changes: 44 additions & 153 deletions src/gc-debug.c
Original file line number Diff line number Diff line change
Expand Up @@ -198,21 +198,32 @@ static void restore(void)

static void gc_verify_track(jl_ptls_t ptls)
{
jl_gc_mark_cache_t *gc_cache = &ptls->gc_cache;
// `gc_verify_track` is limited to single-threaded GC
if (jl_n_gcthreads != 0)
return;
do {
jl_gc_mark_sp_t sp;
gc_mark_sp_init(gc_cache, &sp);
jl_gc_markqueue_t mq;
jl_gc_markqueue_t *mq2 = &ptls->mark_queue;
ws_queue_t *cq = &mq.chunk_queue;
ws_queue_t *q = &mq.ptr_queue;
jl_atomic_store_relaxed(&cq->top, 0);
jl_atomic_store_relaxed(&cq->bottom, 0);
jl_atomic_store_relaxed(&cq->array, jl_atomic_load_relaxed(&mq2->chunk_queue.array));
jl_atomic_store_relaxed(&q->top, 0);
jl_atomic_store_relaxed(&q->bottom, 0);
jl_atomic_store_relaxed(&q->array, jl_atomic_load_relaxed(&mq2->ptr_queue.array));
arraylist_new(&mq.reclaim_set, 32);
arraylist_push(&lostval_parents_done, lostval);
jl_safe_printf("Now looking for %p =======\n", lostval);
clear_mark(GC_CLEAN);
gc_mark_queue_all_roots(ptls, &sp);
gc_mark_queue_finlist(gc_cache, &sp, &to_finalize, 0);
for (int i = 0; i < gc_n_threads; i++) {
gc_mark_queue_all_roots(ptls, &mq);
gc_mark_finlist(&mq, &to_finalize, 0);
for (int i = 0; i < gc_n_threads;i++) {
jl_ptls_t ptls2 = gc_all_tls_states[i];
gc_mark_queue_finlist(gc_cache, &sp, &ptls2->finalizers, 0);
gc_mark_finlist(&mq, &ptls2->finalizers, 0);
}
gc_mark_queue_finlist(gc_cache, &sp, &finalizer_list_marked, 0);
gc_mark_loop(ptls, sp);
gc_mark_finlist(&mq, &finalizer_list_marked, 0);
gc_mark_loop_serial_(ptls, &mq);
if (lostval_parents.len == 0) {
jl_safe_printf("Could not find the missing link. We missed a toplevel root. This is odd.\n");
break;
Expand Down Expand Up @@ -246,22 +257,35 @@ static void gc_verify_track(jl_ptls_t ptls)

void gc_verify(jl_ptls_t ptls)
{
jl_gc_mark_cache_t *gc_cache = &ptls->gc_cache;
jl_gc_mark_sp_t sp;
gc_mark_sp_init(gc_cache, &sp);
// `gc_verify` is limited to single-threaded GC
if (jl_n_gcthreads != 0) {
jl_safe_printf("Warn. GC verify disabled in multi-threaded GC\n");
return;
}
jl_gc_markqueue_t mq;
jl_gc_markqueue_t *mq2 = &ptls->mark_queue;
ws_queue_t *cq = &mq.chunk_queue;
ws_queue_t *q = &mq.ptr_queue;
jl_atomic_store_relaxed(&cq->top, 0);
jl_atomic_store_relaxed(&cq->bottom, 0);
jl_atomic_store_relaxed(&cq->array, jl_atomic_load_relaxed(&mq2->chunk_queue.array));
jl_atomic_store_relaxed(&q->top, 0);
jl_atomic_store_relaxed(&q->bottom, 0);
jl_atomic_store_relaxed(&q->array, jl_atomic_load_relaxed(&mq2->ptr_queue.array));
arraylist_new(&mq.reclaim_set, 32);
lostval = NULL;
lostval_parents.len = 0;
lostval_parents_done.len = 0;
clear_mark(GC_CLEAN);
gc_verifying = 1;
gc_mark_queue_all_roots(ptls, &sp);
gc_mark_queue_finlist(gc_cache, &sp, &to_finalize, 0);
for (int i = 0; i < gc_n_threads; i++) {
gc_mark_queue_all_roots(ptls, &mq);
gc_mark_finlist(&mq, &to_finalize, 0);
for (int i = 0; i < gc_n_threads;i++) {
jl_ptls_t ptls2 = gc_all_tls_states[i];
gc_mark_queue_finlist(gc_cache, &sp, &ptls2->finalizers, 0);
gc_mark_finlist(&mq, &ptls2->finalizers, 0);
}
gc_mark_queue_finlist(gc_cache, &sp, &finalizer_list_marked, 0);
gc_mark_loop(ptls, sp);
gc_mark_finlist(&mq, &finalizer_list_marked, 0);
gc_mark_loop_serial_(ptls, &mq);
int clean_len = bits_save[GC_CLEAN].len;
for(int i = 0; i < clean_len + bits_save[GC_OLD].len; i++) {
jl_taggedvalue_t *v = (jl_taggedvalue_t*)bits_save[i >= clean_len ? GC_OLD : GC_CLEAN].items[i >= clean_len ? i - clean_len : i];
Expand Down Expand Up @@ -500,7 +524,7 @@ int jl_gc_debug_check_other(void)
return gc_debug_alloc_check(&jl_gc_debug_env.other);
}

void jl_gc_debug_print_status(void)
void jl_gc_debug_print_status(void) JL_NOTSAFEPOINT
{
uint64_t pool_count = jl_gc_debug_env.pool.num;
uint64_t other_count = jl_gc_debug_env.other.num;
Expand All @@ -509,7 +533,7 @@ void jl_gc_debug_print_status(void)
pool_count + other_count, pool_count, other_count, gc_num.pause);
}

void jl_gc_debug_critical_error(void)
void jl_gc_debug_critical_error(void) JL_NOTSAFEPOINT
{
jl_gc_debug_print_status();
if (!jl_gc_debug_env.wait_for_debugger)
Expand Down Expand Up @@ -1264,139 +1288,6 @@ int gc_slot_to_arrayidx(void *obj, void *_slot) JL_NOTSAFEPOINT
return (slot - start) / elsize;
}

// Print a backtrace from the bottom (start) of the mark stack up to `sp`
// `pc_offset` will be added to `sp` for convenience in the debugger.
NOINLINE void gc_mark_loop_unwind(jl_ptls_t ptls, jl_gc_mark_sp_t sp, int pc_offset)
{
jl_jmp_buf *old_buf = jl_get_safe_restore();
jl_jmp_buf buf;
jl_set_safe_restore(&buf);
if (jl_setjmp(buf, 0) != 0) {
jl_safe_printf("\n!!! ERROR when unwinding gc mark loop -- ABORTING !!!\n");
jl_set_safe_restore(old_buf);
return;
}
void **top = sp.pc + pc_offset;
jl_gc_mark_data_t *data_top = sp.data;
sp.data = ptls->gc_cache.data_stack;
sp.pc = ptls->gc_cache.pc_stack;
int isroot = 1;
while (sp.pc < top) {
void *pc = *sp.pc;
const char *prefix = isroot ? "r--" : " `-";
isroot = 0;
if (pc == gc_mark_label_addrs[GC_MARK_L_marked_obj]) {
gc_mark_marked_obj_t *data = gc_repush_markdata(&sp, gc_mark_marked_obj_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: Root object: %p :: %p (bits: %d)\n of type ",
(void*)data, (void*)data->obj, (void*)data->tag, (int)data->bits);
jl_((void*)data->tag);
isroot = 1;
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_scan_only]) {
gc_mark_marked_obj_t *data = gc_repush_markdata(&sp, gc_mark_marked_obj_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: Queued root: %p :: %p (bits: %d)\n of type ",
(void*)data, (void*)data->obj, (void*)data->tag, (int)data->bits);
jl_((void*)data->tag);
isroot = 1;
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_finlist]) {
gc_mark_finlist_t *data = gc_repush_markdata(&sp, gc_mark_finlist_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: Finalizer list from %p to %p\n",
(void*)data, (void*)data->begin, (void*)data->end);
isroot = 1;
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_objarray]) {
gc_mark_objarray_t *data = gc_repush_markdata(&sp, gc_mark_objarray_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: %s Array in object %p :: %p -- [%p, %p)\n of type ",
(void*)data, prefix, (void*)data->parent, ((void**)data->parent)[-1],
(void*)data->begin, (void*)data->end);
jl_(jl_typeof(data->parent));
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_obj8]) {
gc_mark_obj8_t *data = gc_repush_markdata(&sp, gc_mark_obj8_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_datatype_t *vt = (jl_datatype_t*)jl_typeof(data->parent);
uint8_t *desc = (uint8_t*)jl_dt_layout_ptrs(vt->layout);
jl_safe_printf("%p: %s Object (8bit) %p :: %p -- [%d, %d)\n of type ",
(void*)data, prefix, (void*)data->parent, ((void**)data->parent)[-1],
(int)(data->begin - desc), (int)(data->end - desc));
jl_(jl_typeof(data->parent));
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_obj16]) {
gc_mark_obj16_t *data = gc_repush_markdata(&sp, gc_mark_obj16_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_datatype_t *vt = (jl_datatype_t*)jl_typeof(data->parent);
uint16_t *desc = (uint16_t*)jl_dt_layout_ptrs(vt->layout);
jl_safe_printf("%p: %s Object (16bit) %p :: %p -- [%d, %d)\n of type ",
(void*)data, prefix, (void*)data->parent, ((void**)data->parent)[-1],
(int)(data->begin - desc), (int)(data->end - desc));
jl_(jl_typeof(data->parent));
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_obj32]) {
gc_mark_obj32_t *data = gc_repush_markdata(&sp, gc_mark_obj32_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_datatype_t *vt = (jl_datatype_t*)jl_typeof(data->parent);
uint32_t *desc = (uint32_t*)jl_dt_layout_ptrs(vt->layout);
jl_safe_printf("%p: %s Object (32bit) %p :: %p -- [%d, %d)\n of type ",
(void*)data, prefix, (void*)data->parent, ((void**)data->parent)[-1],
(int)(data->begin - desc), (int)(data->end - desc));
jl_(jl_typeof(data->parent));
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_stack]) {
gc_mark_stackframe_t *data = gc_repush_markdata(&sp, gc_mark_stackframe_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: %s Stack frame %p -- %d of %d (%s)\n",
(void*)data, prefix, (void*)data->s, (int)data->i,
(int)data->nroots >> 1,
(data->nroots & 1) ? "indirect" : "direct");
}
else if (pc == gc_mark_label_addrs[GC_MARK_L_module_binding]) {
// module_binding
gc_mark_binding_t *data = gc_repush_markdata(&sp, gc_mark_binding_t);
if ((jl_gc_mark_data_t *)data > data_top) {
jl_safe_printf("Mark stack unwind overflow -- ABORTING !!!\n");
break;
}
jl_safe_printf("%p: %s Module (bindings) %p (bits %d) -- [%p, %p)\n",
(void*)data, prefix, (void*)data->parent, (int)data->bits,
(void*)data->begin, (void*)data->end);
}
else {
jl_safe_printf("Unknown pc %p --- ABORTING !!!\n", pc);
break;
}
}
jl_set_safe_restore(old_buf);
}

static int gc_logging_enabled = 0;

JL_DLLEXPORT void jl_enable_gc_logging(int enable) {
Expand Down
Loading

0 comments on commit f41d92f

Please sign in to comment.