Fix invoke for vararg methods. #11151

carnaval · 2015-05-06T04:35:28Z

When the invoke private method table is created we insert into the
method list manually instead of going through method_table_insert.
This codepath was forgetting to update the max_arg field which is
used to correctly compute the specialized signature for variable
argument methods. Should fix #11149.

When the invoke private method table is created we insert into the method list manually instead of going through method_table_insert. This codepath was forgetting to update the max_arg field which is used to correctly compute the specialized signature for variable argument methods. Should fix #11149.

yuyichao · 2015-05-06T13:52:11Z

Apart from the current failure on 32bit linux, maybe it's better to add a test for this?

Sth like.

@noinline f(a, b, args...) = (a, b, args...)
@test f(1, 2, 3) == invoke(f, Tuple{Int, Int, Int}, 1, 2, 3)

carnaval · 2015-05-06T14:09:10Z

Yes it does probably need a test. The failure could be my fault I'll have a look (why does it always has to be on 32bit arch...)

carnaval · 2015-05-06T22:27:49Z

Found the cause of the failure. I'll merge this as soon as the CI is happy.

tkelman · 2015-05-06T23:26:04Z

why does it always has to be on 32bit arch...

Any insights into why things have been less stable there? Obviously most developers are on 64-bit systems day to day, but something about the runtime has seemed much more susceptible to odd intermittent bugs on 32 bit. You probably just squashed one, but I suspect there may still be a few more hiding around.

yuyichao · 2015-05-06T23:28:10Z

@tkelman Is the GC triggerred much more frequently on 32bit?

carnaval · 2015-05-06T23:31:19Z

Yes that would also be my guess for one of the root causes. The default collect interval on x86 is about half of x64. Well, setting it to a tenth of this is one of the way to ease debugging those, at the cost of a terrible runtime of course.

yuyichao · 2015-05-06T23:34:05Z

IIRC, there was a thread on the mailing list about this debuging technique (and other GC related ones). Is that a parameter that can be changed at runtime so that (one of?) the CI can be run that way?

carnaval · 2015-05-06T23:47:19Z

Yeah we have a few flags to debug different things in the GC. The problem being that 1) it makes it very slow and our CI is already kinda clogged and 2) the failure mode is usually to crash earlier but since we don't have an easy way to get the dump in gdb it doesn't help that much

We could definitely make it a lot better though. I'm hopeful that something good will come out of the rr stuff and we may want to run this particular test build under heavy debugging configuration.

Fix invoke for vararg methods.

garrison · 2015-05-07T05:59:21Z

@tkelman I suspect there are still other bugs hiding on 32-bits too, as valgrind has been unhappy (and still is--I just checked) since tupocolypse (see #11003).

yuyichao · 2015-05-07T12:57:02Z

I just did the following change to gc.c and leave it running the test overnight. I seems to get a very reliable segfault on my 64bit laptop (see below for the stack trace, the crash takes about 1 hour to trigger so it's not too bad).

Is there anything that is clearly wrong about the change I've made to the GC (apart from making it >> 100x slower) (e.g. if it breaks some subtle assumptions in the GC itself) that can make this a false positive?

diff --git a/src/gc.c b/src/gc.c
index 3e853b2..47b6ff1 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:

 static inline int maybe_collect(void)
 {
-    if (should_collect()) {
+    if (should_collect() || getenv("JULIA_DEBUG_GC")) {
         jl_gc_collect(0);
         return 1;
     }
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
 static void clear_mark(int);
 #endif

+static unsigned _gc_count = 0;

 void jl_gc_collect(int full)
 {
+    if (full < 0) {
+        full = 0;
+    } else if (getenv("JULIA_DEBUG_GC")) {
+        _gc_count++;
+        if (_gc_count % 10000 == 0)
+            jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+        full = 1;
+    }
     if (!is_gc_enabled) return;
     if (jl_in_gc) return;
     jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
     }
 #endif
     if (recollect)
-        jl_gc_collect(0);
+        jl_gc_collect(-1);
 }

 // allocator entry points

yuyichao% time JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1
     * linalg1             GC triggered: 10000
GC triggered: 20000
GC triggered: 30000
GC triggered: 40000
GC triggered: 50000
GC triggered: 60000
GC triggered: 70000
GC triggered: 80000
GC triggered: 90000
GC triggered: 100000
GC triggered: 110000
GC triggered: 120000

signal (11): 段错误
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1464
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_mark_stack at /home/yuyichao/projects/contrib/julia6/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1699
gc_mark_stack at /home/yuyichao/projects/contrib/julia6/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1699
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
maybe_collect at /home/yuyichao/projects/contrib/julia6/src/gc.c:757
mpfr_init2 at /usr/lib/libmpfr.so (unknown line)
call at ./mpfr.jl:52
call at ./mpfr.jl:70
triu! at ./linalg/dense.jl:78
getindex at ./linalg/factorization.jl:95
A_ldiv_B! at ./linalg/factorization.jl:354
A_ldiv_B! at ./linalg/factorization.jl:400
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
\ at ./linalg/factorization.jl:417
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
anonymous at ./no file:88
jl_apply at /home/yuyichao/projects/contrib/julia6/src/julia.h:1280
jl_toplevel_eval_flex at /home/yuyichao/projects/contrib/julia6/src/toplevel.c:551
jl_load at /home/yuyichao/projects/contrib/julia6/src/toplevel.c:597
include at ./boot.jl:252
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1752
runtests at /home/yuyichao/projects/contrib/julia6/test/testdefs.jl:80
unknown function (ip: 509527299)
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:472
anonymous at ./multi.jl:628
run_work_thunk at ./multi.jl:589
remotecall_fetch at ./multi.jl:662
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:472
remotecall_fetch at ./multi.jl:677
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:439
anonymous at ./task.jl:1392
start_task at /home/yuyichao/projects/contrib/julia6/src/task.c:234
unknown function (ip: 0)
[2]    21509 segmentation fault (core dumped)  JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1
JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1  3564.27s user 1.37s system 99% cpu 59:30.33 total

yuyichao · 2015-05-07T13:01:16Z

And by reliable I mean it appears twe out of the two times I run it

with the same trace up to the garbage collector
after roughly same numbers of collections (which might be implied by 1 already)
at the same line in GC. (and line 1464 is int bits = gc_bits(o); after my change BTW)

carnaval · 2015-05-07T13:55:53Z

You are on the good path. Few comments :

having full = 1 means you are debugging a non-generational gc (it's good !, just be aware if you are chasing some error that it can make it disapear)
maybe_collect is not used in the hot path where it should be (now that I'm thinking about it the main reason for this is not true anymore, oh well) which is pool allocation. If you add the same kind of test in there you will collect way more often.
this stacktrace is typical of corrupted memory. It means either a tag is corrupt and is not valid address, either a field is, since this load is the first we do for every object we mark.
think of disabling ASLR if your error is more random than this.

In this particular case it looks like it's hapenning in julia code so it may be an issue with the generated rooting in codegen (e.g. something introduced by the ccall change recently). You never know though because the corruption could have happenned before we got here, but the smaller the "collect interval", the less likely it is to be non-local.

I'll try to reproduce this specific error in a bit. In the meantime feel free to ask anything, thanks for looking into this kind of things.

carnaval · 2015-05-07T13:58:40Z

I'll add that this kind of hack is usually what it ends up looking to find those errors. You can even run the beginning of bootstrap wile collecting after every allocation ! If the error is further down the code execution you can also use variations of n_pause > A && !(n_pause % B) etc

yuyichao · 2015-05-07T14:00:31Z

@carnaval I've actually just realized that I haven't included you fix yet and I'm rerunning the test.
As long as the change I've made is not likely to generate false positives, I will just let it run in during the day while I'm doing sth else.
I'll also look into your suggestions later (THX). Not sure I understand all of them but I'll ask if I need some help....

carnaval · 2015-05-07T14:06:04Z

Cool, I'll look into what you find. I don't think this fixes it since I see no use of staged functions around here but who knows, those bugs are hard to predict.

yuyichao · 2015-05-07T14:33:19Z

having full = 1 means you are debugging a non-generational gc (it's good !, just be aware if you are chasing some error that it can make it disapear)

Hmmm. I thought collecting more aggressively will always expose more problems. Is that not true?

maybe_collect is not used in the hot path where it should be (now that I'm thinking about it the main reason for this is not true anymore, oh well) which is pool allocation. If you add the same kind of test in there you will collect way more often.

Ah. Found it. I thought I did a search for "collect" in gc.c but somehow missed it.

this stacktrace is typical of corrupted memory. It means either a tag is corrupt and is not valid address, either a field is, since this load is the first we do for every object we mark.

Any suggestions on how to narrow this down.

think of disabling ASLR if your error is more random than this.

No it's not random at all. And you are right that using the latest master (which AFAIK includes your fix) does not make the problem disappear.

I'm using system mpfr from ArchLinux official repo. Not sure if it should make a difference.

I'll add that this kind of hack is usually what it ends up looking to find those errors.

Well, I'm doing this mainly because I feel like this is a better way to utilize my extra processor power for myself than running, e.g., SETI@Home (which is not to say SETI@Home is wrong either) and I don't know more to do fancier debugging.

You can even run the beginning of bootstrap wile collecting after every allocation ! If the error is further down the code execution you can also use variations of n_pause > A && !(n_pause % B) etc

Do you mean basically generating the sys.so with this on? Also could you explain a little more on the n_pause stuff?

carnaval · 2015-05-07T14:49:11Z

Hmmm. I thought collecting more aggressively will always expose more problems. Is that not true?

As a general principle yes, but in practice it changes the allocation pattern a lot, also you could be hitting a bug in the generational stuff in the gc of course :)

I don't think your mpfr is the problem, it would probably crash every time if it was but who knows.

n_pause is just the number of collection since startup, you can use it to "target" your error. Say that it always happen between collection 11 and 12 but only once in every 100 runs. Then you can ask maybe_collect & friends to collect when normal_condition || n_pause >= 11 && !(n_pause % 1000) to make it more deterministic.

yuyichao · 2015-05-07T17:36:45Z

@carnaval So I've let 4 instances running the version I had above (so pool allocation will not always trigger a collection) in gdb and have got very repeatable segfault. According to the counter I added, all of them happens at exactly collection 123484. I still have all instances open and is trying to save the core file (which generate a few warnings). Please tell me if there's anything I can do with those to help.

Backtrace here, not sure how much it can help.... =(

yuyichao · 2015-05-07T17:41:35Z

The line in the init2.c from mpfr is tmp = (mpfr_limb_ptr) (*__gmp_allocate_func)(MPFR_MALLOC_SIZE(xsize)); so I guess it makes some sense that it calls back into julia....

yuyichao · 2015-05-07T17:42:54Z

and just for the record, mpfr_allocate_func is jl_gc_counted_malloc

yuyichao · 2015-05-07T18:08:59Z

Compressed core dump file here

yuyichao · 2015-05-07T18:23:58Z

@carnaval Add the hack in pool allocate as you suggested (see below for the diff) and got a segfault much easier.... a segfault happens as early as base initialization...... (Full backtrace later...)

yuyichao% JULIA_DEBUG_GC=1 ./julia -f -e '0'                  
GC triggered: 10000
GC triggered: 20000
GC triggered: 30000
GC triggered: 40000
GC triggered: 50000
GC triggered: 60000
GC triggered: 70000
GC triggered: 80000
GC triggered: 90000
GC triggered: 100000
GC triggered: 110000
GC triggered: 120000
GC triggered: 130000
GC triggered: 140000
GC triggered: 150000
GC triggered: 160000
GC triggered: 170000
GC triggered: 180000
GC triggered: 190000
GC triggered: 200000
GC triggered: 210000
GC triggered: 220000
GC triggered: 230000
GC triggered: 240000
GC triggered: 250000
GC triggered: 260000
GC triggered: 270000
GC triggered: 280000
GC triggered: 290000

signal (11): 段错误
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1464
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_mark_stack at /home/yuyichao/projects/contrib/julia7/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1699
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
__pool_alloc at /home/yuyichao/projects/contrib/julia7/src/gc.c:1036
newobj at /home/yuyichao/projects/contrib/julia7/src/julia_internal.h:22
jl_f_get_field at /home/yuyichao/projects/contrib/julia7/src/builtins.c:675
init_stdio at ./stream.jl:266
reinit_stdio at ./stream.jl:289
__init__ at /home/yuyichao/projects/contrib/julia7/base/sysimg.jl:309
unknown function (ip: 316317945)
jl_apply_generic at /home/yuyichao/projects/contrib/julia7/src/gf.c:1759
jl_eh_restore_state at /home/yuyichao/projects/contrib/julia7/src/julia.h:1368
jl_init_restored_modules at /home/yuyichao/projects/contrib/julia7/src/dump.c:1537
_julia_init at /home/yuyichao/projects/contrib/julia7/src/init.c:1188
julia_init at /home/yuyichao/projects/contrib/julia7/src/task.c:256
unknown function (ip: 4199219)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4199321)
unknown function (ip: 0)
[1]    28408 segmentation fault (core dumped)  JULIA_DEBUG_GC=1 ./julia -f -e '0'

Current diff

diff --git a/src/gc.c b/src/gc.c
index 3e853b2..c082132 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:

 static inline int maybe_collect(void)
 {
-    if (should_collect()) {
+    if (should_collect() || getenv("JULIA_DEBUG_GC")) {
         jl_gc_collect(0);
         return 1;
     }
@@ -1027,7 +1027,7 @@ static NOINLINE void add_page(pool_t *p)
 static inline void *__pool_alloc(pool_t* p, int osize, int end_offset)
 {
     gcval_t *v, *end;
-    if (__unlikely((allocd_bytes += osize) >= 0)) {
+    if (__unlikely((allocd_bytes += osize) >= 0) || getenv("JULIA_DEBUG_GC")) {
         //allocd_bytes -= osize;
         jl_gc_collect(0);
         //allocd_bytes += osize;
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
 static void clear_mark(int);
 #endif

+static unsigned _gc_count = 0;

 void jl_gc_collect(int full)
 {
+    if (full < 0) {
+        full = 0;
+    } else if (getenv("JULIA_DEBUG_GC")) {
+        _gc_count++;
+        if (_gc_count % 10000 == 0)
+            jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+        full = 1;
+    }
     if (!is_gc_enabled) return;
     if (jl_in_gc) return;
     jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
     }
 #endif
     if (recollect)
-        jl_gc_collect(0);
+        jl_gc_collect(-1);
 }

 // allocator entry points

yuyichao · 2015-05-07T18:26:24Z

gdb backtrace

yuyichao · 2015-05-07T18:27:36Z

My _gc_count value is also extremely repeatable.... 291932

pao · 2015-05-07T18:57:57Z

@yuyichao That's some long output you have there--when there's this much it's probably better to put the long debugging outputs in a gist and link to them here so the conversation is easier to follow.

yuyichao · 2015-05-07T20:21:40Z

Dis-assemble of julia_call_64802, which AFAICT is the constructor of TTY

I've also added some code to use execinfo to print out the backtrace before the crash happens, and it seems that the call at <julia_call_64802+777> does not trigger a segfault while <julia_call_64802+814> very repeatably does. Maybe this can narrow down which variable(object?) is causing the issue?

carnaval · 2015-05-07T20:55:47Z

Hey, glad to see you're diving into this kind of issue. I've tried running a few of those in "always-gc" mode but it doesn't segfault here. It may be a platform specific issue. What are you running on ? Unfortunately the coredumps are not very useful since I believe I would need the exact same binary & shared libs as you to have meaningful results.

Looking at your investigation it smells a lot like a ccall problem. I'm gonna spend a few moments today looking at the generated code to see if there is an obvious issue but it's hard without reproducing. If this doesn't go anywhere I'll try to help you investigate further maybe tomorrow ? If you get bored of this we can also arrange so that I get ssh to the machine this is faulting on if that's ok with you.

Thanks !

yuyichao · 2015-05-07T21:15:03Z

@carnaval
I'm running on ArchLinux.

I've also just tried running generation of sys0.so and got another fault before inference is on, hopefully this one is simpler?

here is the gdb session... too lazy to paste all pointer values but in short the segfault happens in the jl_inst_concrete_tupletype_v in arg_type_tuple but not in jl_wrap_Type before it.....

yuyichao · 2015-05-07T21:19:44Z

Is there a place that has the dump of llvm IR of the sys.so ? Would that be easier to diagnose? Would sys.ji do?

yuyichao · 2015-05-07T21:21:24Z

P.S. @carnaval I've sent you a private email about debuging on my machine to your git address.

carnaval · 2015-05-08T19:55:40Z

@staticfloat Hey, we found the cause (thanks to @yuyichao who actually came in person !). Would it be reasonable to at least run some of the buildbots with GC_VERIFY on ?

tkelman · 2015-05-08T20:04:03Z

Nice work guys.

Would it be reasonable to at least run some of the buildbots with GC_VERIFY on ?

My vote there is yes, we already have buildbot jobs for LLVM svn and coverage and a boatload of different platforms, I don't see any harm in adding one more for help with GC debugging. How would the build process for that job differ from a normal build?

staticfloat · 2015-05-08T20:24:15Z

yeah, this should be possible. Can you give me an idea of how much extra runtime it adds?

carnaval · 2015-05-08T20:27:38Z

As much as we want :)

I'll add a flag to the Makefile to run under real_slow_gc_debug and we'll see how long it takes and tune it down if the csail cloud is sad.

staticfloat · 2015-05-08T22:01:23Z

Also, since I didn't say it in my original post, this is an incredible amount of work on both of your parts. Nice job!

timholy · 2015-05-09T00:54:35Z

Agreed with @staticfloat. Hunting down segfaults is difficult work, but talk about impact...

tkelman · 2015-05-09T01:01:06Z

IMO stability of the runtime is as big a deal as breaking feature changes when it comes to declaring master release-worthy, so this kind of work is huge.

carnaval mentioned this pull request May 6, 2015

Wrong result for kwsorter-for-call-like function #11149

Closed

carnaval force-pushed the ob/fixvainv branch from 99915c5 to 2df184a Compare May 6, 2015 14:35

yuyichao mentioned this pull request May 6, 2015

Add keyword argument and functors support to invoke (invoke improvement No. 4) #11165

Closed

carnaval force-pushed the ob/fixvainv branch from 2df184a to 462029b Compare May 6, 2015 16:28

Fix missing gc root which was causing the 32bit bitarray test failure

b61f46f

carnaval added a commit that referenced this pull request May 7, 2015

Merge pull request #11151 from JuliaLang/ob/fixvainv

cd2c363

Fix invoke for vararg methods.

carnaval merged commit cd2c363 into master May 7, 2015

carnaval deleted the ob/fixvainv branch May 7, 2015 03:55

tkelman mentioned this pull request May 8, 2015

fix some gc rooting #11190

Merged

yuyichao mentioned this pull request May 9, 2015

Fix GC_VERIFY and two gc write barrier issue. Thanks @carnaval #11205

Merged

garrison mentioned this pull request May 24, 2015

Improve memory debugging docs #11214

Closed

yuyichao mentioned this pull request May 25, 2015

Make it easier to find/test/reproduce GC bugs #11358

Merged

Fix invoke for vararg methods. #11151

Fix invoke for vararg methods. #11151

Conversation

carnaval commented May 6, 2015

yuyichao commented May 6, 2015

carnaval commented May 6, 2015

carnaval commented May 6, 2015

tkelman commented May 6, 2015

yuyichao commented May 6, 2015

carnaval commented May 6, 2015

yuyichao commented May 6, 2015

carnaval commented May 6, 2015

garrison commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

carnaval commented May 7, 2015

carnaval commented May 7, 2015

yuyichao commented May 7, 2015

carnaval commented May 7, 2015

yuyichao commented May 7, 2015

carnaval commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

pao commented May 7, 2015

yuyichao commented May 7, 2015

carnaval commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

yuyichao commented May 7, 2015

carnaval commented May 8, 2015

tkelman commented May 8, 2015

staticfloat commented May 8, 2015

carnaval commented May 8, 2015

staticfloat commented May 8, 2015

timholy commented May 9, 2015

tkelman commented May 9, 2015