Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix invoke for vararg methods. #11151

Merged
merged 2 commits into from
May 7, 2015
Merged

Fix invoke for vararg methods. #11151

merged 2 commits into from
May 7, 2015

Conversation

carnaval
Copy link
Contributor

@carnaval carnaval commented May 6, 2015

When the invoke private method table is created we insert into the
method list manually instead of going through method_table_insert.
This codepath was forgetting to update the max_arg field which is
used to correctly compute the specialized signature for variable
argument methods. Should fix #11149.

When the invoke private method table is created we insert into the
method list manually instead of going through method_table_insert.
This codepath was forgetting to update the max_arg field which is
used to correctly compute the specialized signature for variable
argument methods. Should fix #11149.
@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2015

Apart from the current failure on 32bit linux, maybe it's better to add a test for this?

Sth like.

@noinline f(a, b, args...) = (a, b, args...)
@test f(1, 2, 3) == invoke(f, Tuple{Int, Int, Int}, 1, 2, 3)

@carnaval
Copy link
Contributor Author

carnaval commented May 6, 2015

Yes it does probably need a test. The failure could be my fault I'll have a look (why does it always has to be on 32bit arch...)

@carnaval
Copy link
Contributor Author

carnaval commented May 6, 2015

Found the cause of the failure. I'll merge this as soon as the CI is happy.

@tkelman
Copy link
Contributor

tkelman commented May 6, 2015

why does it always has to be on 32bit arch...

Any insights into why things have been less stable there? Obviously most developers are on 64-bit systems day to day, but something about the runtime has seemed much more susceptible to odd intermittent bugs on 32 bit. You probably just squashed one, but I suspect there may still be a few more hiding around.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2015

@tkelman Is the GC triggerred much more frequently on 32bit?

@carnaval
Copy link
Contributor Author

carnaval commented May 6, 2015

Yes that would also be my guess for one of the root causes. The default collect interval on x86 is about half of x64. Well, setting it to a tenth of this is one of the way to ease debugging those, at the cost of a terrible runtime of course.

@yuyichao
Copy link
Contributor

yuyichao commented May 6, 2015

IIRC, there was a thread on the mailing list about this debuging technique (and other GC related ones). Is that a parameter that can be changed at runtime so that (one of?) the CI can be run that way?

@carnaval
Copy link
Contributor Author

carnaval commented May 6, 2015

Yeah we have a few flags to debug different things in the GC. The problem being that 1) it makes it very slow and our CI is already kinda clogged and 2) the failure mode is usually to crash earlier but since we don't have an easy way to get the dump in gdb it doesn't help that much

We could definitely make it a lot better though. I'm hopeful that something good will come out of the rr stuff and we may want to run this particular test build under heavy debugging configuration.

carnaval added a commit that referenced this pull request May 7, 2015
Fix invoke for vararg methods.
@carnaval carnaval merged commit cd2c363 into master May 7, 2015
@carnaval carnaval deleted the ob/fixvainv branch May 7, 2015 03:55
@garrison
Copy link
Member

garrison commented May 7, 2015

@tkelman I suspect there are still other bugs hiding on 32-bits too, as valgrind has been unhappy (and still is--I just checked) since tupocolypse (see #11003).

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

I just did the following change to gc.c and leave it running the test overnight. I seems to get a very reliable segfault on my 64bit laptop (see below for the stack trace, the crash takes about 1 hour to trigger so it's not too bad).

Is there anything that is clearly wrong about the change I've made to the GC (apart from making it >> 100x slower) (e.g. if it breaks some subtle assumptions in the GC itself) that can make this a false positive?

diff --git a/src/gc.c b/src/gc.c
index 3e853b2..47b6ff1 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:

 static inline int maybe_collect(void)
 {
-    if (should_collect()) {
+    if (should_collect() || getenv("JULIA_DEBUG_GC")) {
         jl_gc_collect(0);
         return 1;
     }
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
 static void clear_mark(int);
 #endif

+static unsigned _gc_count = 0;

 void jl_gc_collect(int full)
 {
+    if (full < 0) {
+        full = 0;
+    } else if (getenv("JULIA_DEBUG_GC")) {
+        _gc_count++;
+        if (_gc_count % 10000 == 0)
+            jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+        full = 1;
+    }
     if (!is_gc_enabled) return;
     if (jl_in_gc) return;
     jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
     }
 #endif
     if (recollect)
-        jl_gc_collect(0);
+        jl_gc_collect(-1);
 }

 // allocator entry points
yuyichao% time JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1
     * linalg1             GC triggered: 10000
GC triggered: 20000
GC triggered: 30000
GC triggered: 40000
GC triggered: 50000
GC triggered: 60000
GC triggered: 70000
GC triggered: 80000
GC triggered: 90000
GC triggered: 100000
GC triggered: 110000
GC triggered: 120000

signal (11): 段错误
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1464
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_mark_stack at /home/yuyichao/projects/contrib/julia6/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1699
gc_mark_stack at /home/yuyichao/projects/contrib/julia6/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1699
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia6/src/gc.c:1466
maybe_collect at /home/yuyichao/projects/contrib/julia6/src/gc.c:757
mpfr_init2 at /usr/lib/libmpfr.so (unknown line)
call at ./mpfr.jl:52
call at ./mpfr.jl:70
triu! at ./linalg/dense.jl:78
getindex at ./linalg/factorization.jl:95
A_ldiv_B! at ./linalg/factorization.jl:354
A_ldiv_B! at ./linalg/factorization.jl:400
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
\ at ./linalg/factorization.jl:417
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
anonymous at ./no file:88
jl_apply at /home/yuyichao/projects/contrib/julia6/src/julia.h:1280
jl_toplevel_eval_flex at /home/yuyichao/projects/contrib/julia6/src/toplevel.c:551
jl_load at /home/yuyichao/projects/contrib/julia6/src/toplevel.c:597
include at ./boot.jl:252
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1752
runtests at /home/yuyichao/projects/contrib/julia6/test/testdefs.jl:80
unknown function (ip: 509527299)
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:472
anonymous at ./multi.jl:628
run_work_thunk at ./multi.jl:589
remotecall_fetch at ./multi.jl:662
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:472
remotecall_fetch at ./multi.jl:677
jl_apply_generic at /home/yuyichao/projects/contrib/julia6/src/gf.c:1750
jl_f_apply at /home/yuyichao/projects/contrib/julia6/src/builtins.c:439
anonymous at ./task.jl:1392
start_task at /home/yuyichao/projects/contrib/julia6/src/task.c:234
unknown function (ip: 0)
[2]    21509 segmentation fault (core dumped)  JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1
JULIA_DEBUG_GC=1 ./julia -f test/runtests.jl linalg1  3564.27s user 1.37s system 99% cpu 59:30.33 total

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

And by reliable I mean it appears twe out of the two times I run it

  1. with the same trace up to the garbage collector
  2. after roughly same numbers of collections (which might be implied by 1 already)
  3. at the same line in GC. (and line 1464 is int bits = gc_bits(o); after my change BTW)

@carnaval
Copy link
Contributor Author

carnaval commented May 7, 2015

You are on the good path. Few comments :

  • having full = 1 means you are debugging a non-generational gc (it's good !, just be aware if you are chasing some error that it can make it disapear)
  • maybe_collect is not used in the hot path where it should be (now that I'm thinking about it the main reason for this is not true anymore, oh well) which is pool allocation. If you add the same kind of test in there you will collect way more often.
  • this stacktrace is typical of corrupted memory. It means either a tag is corrupt and is not valid address, either a field is, since this load is the first we do for every object we mark.
  • think of disabling ASLR if your error is more random than this.

In this particular case it looks like it's hapenning in julia code so it may be an issue with the generated rooting in codegen (e.g. something introduced by the ccall change recently). You never know though because the corruption could have happenned before we got here, but the smaller the "collect interval", the less likely it is to be non-local.

I'll try to reproduce this specific error in a bit. In the meantime feel free to ask anything, thanks for looking into this kind of things.

@carnaval
Copy link
Contributor Author

carnaval commented May 7, 2015

I'll add that this kind of hack is usually what it ends up looking to find those errors. You can even run the beginning of bootstrap wile collecting after every allocation ! If the error is further down the code execution you can also use variations of n_pause > A && !(n_pause % B) etc

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

@carnaval I've actually just realized that I haven't included you fix yet and I'm rerunning the test.
As long as the change I've made is not likely to generate false positives, I will just let it run in during the day while I'm doing sth else.
I'll also look into your suggestions later (THX). Not sure I understand all of them but I'll ask if I need some help....

@carnaval
Copy link
Contributor Author

carnaval commented May 7, 2015

Cool, I'll look into what you find. I don't think this fixes it since I see no use of staged functions around here but who knows, those bugs are hard to predict.

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

having full = 1 means you are debugging a non-generational gc (it's good !, just be aware if you are chasing some error that it can make it disapear)

Hmmm. I thought collecting more aggressively will always expose more problems. Is that not true?

maybe_collect is not used in the hot path where it should be (now that I'm thinking about it the main reason for this is not true anymore, oh well) which is pool allocation. If you add the same kind of test in there you will collect way more often.

Ah. Found it. I thought I did a search for "collect" in gc.c but somehow missed it.

this stacktrace is typical of corrupted memory. It means either a tag is corrupt and is not valid address, either a field is, since this load is the first we do for every object we mark.

Any suggestions on how to narrow this down.

think of disabling ASLR if your error is more random than this.

No it's not random at all. And you are right that using the latest master (which AFAIK includes your fix) does not make the problem disappear.

I'm using system mpfr from ArchLinux official repo. Not sure if it should make a difference.

I'll add that this kind of hack is usually what it ends up looking to find those errors.

Well, I'm doing this mainly because I feel like this is a better way to utilize my extra processor power for myself than running, e.g., SETI@Home (which is not to say SETI@Home is wrong either) and I don't know more to do fancier debugging.

You can even run the beginning of bootstrap wile collecting after every allocation ! If the error is further down the code execution you can also use variations of n_pause > A && !(n_pause % B) etc

Do you mean basically generating the sys.so with this on? Also could you explain a little more on the n_pause stuff?

@carnaval
Copy link
Contributor Author

carnaval commented May 7, 2015

Hmmm. I thought collecting more aggressively will always expose more problems. Is that not true?

As a general principle yes, but in practice it changes the allocation pattern a lot, also you could be hitting a bug in the generational stuff in the gc of course :)

I don't think your mpfr is the problem, it would probably crash every time if it was but who knows.

n_pause is just the number of collection since startup, you can use it to "target" your error. Say that it always happen between collection 11 and 12 but only once in every 100 runs. Then you can ask maybe_collect & friends to collect when normal_condition || n_pause >= 11 && !(n_pause % 1000) to make it more deterministic.

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

@carnaval So I've let 4 instances running the version I had above (so pool allocation will not always trigger a collection) in gdb and have got very repeatable segfault. According to the counter I added, all of them happens at exactly collection 123484. I still have all instances open and is trying to save the core file (which generate a few warnings). Please tell me if there's anything I can do with those to help.

Backtrace here, not sure how much it can help.... =(

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

The line in the init2.c from mpfr is tmp = (mpfr_limb_ptr) (*__gmp_allocate_func)(MPFR_MALLOC_SIZE(xsize)); so I guess it makes some sense that it calls back into julia....

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

and just for the record, mpfr_allocate_func is jl_gc_counted_malloc

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

Compressed core dump file here

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

@carnaval Add the hack in pool allocate as you suggested (see below for the diff) and got a segfault much easier.... a segfault happens as early as base initialization...... (Full backtrace later...)

yuyichao% JULIA_DEBUG_GC=1 ./julia -f -e '0'                  
GC triggered: 10000
GC triggered: 20000
GC triggered: 30000
GC triggered: 40000
GC triggered: 50000
GC triggered: 60000
GC triggered: 70000
GC triggered: 80000
GC triggered: 90000
GC triggered: 100000
GC triggered: 110000
GC triggered: 120000
GC triggered: 130000
GC triggered: 140000
GC triggered: 150000
GC triggered: 160000
GC triggered: 170000
GC triggered: 180000
GC triggered: 190000
GC triggered: 200000
GC triggered: 210000
GC triggered: 220000
GC triggered: 230000
GC triggered: 240000
GC triggered: 250000
GC triggered: 260000
GC triggered: 270000
GC triggered: 280000
GC triggered: 290000

signal (11): 段错误
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1464
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_mark_stack at /home/yuyichao/projects/contrib/julia7/src/gc.c:1500
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1699
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1695
gc_push_root at /home/yuyichao/projects/contrib/julia7/src/gc.c:1466
__pool_alloc at /home/yuyichao/projects/contrib/julia7/src/gc.c:1036
newobj at /home/yuyichao/projects/contrib/julia7/src/julia_internal.h:22
jl_f_get_field at /home/yuyichao/projects/contrib/julia7/src/builtins.c:675
init_stdio at ./stream.jl:266
reinit_stdio at ./stream.jl:289
__init__ at /home/yuyichao/projects/contrib/julia7/base/sysimg.jl:309
unknown function (ip: 316317945)
jl_apply_generic at /home/yuyichao/projects/contrib/julia7/src/gf.c:1759
jl_eh_restore_state at /home/yuyichao/projects/contrib/julia7/src/julia.h:1368
jl_init_restored_modules at /home/yuyichao/projects/contrib/julia7/src/dump.c:1537
_julia_init at /home/yuyichao/projects/contrib/julia7/src/init.c:1188
julia_init at /home/yuyichao/projects/contrib/julia7/src/task.c:256
unknown function (ip: 4199219)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4199321)
unknown function (ip: 0)
[1]    28408 segmentation fault (core dumped)  JULIA_DEBUG_GC=1 ./julia -f -e '0'

Current diff

diff --git a/src/gc.c b/src/gc.c
index 3e853b2..c082132 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:

 static inline int maybe_collect(void)
 {
-    if (should_collect()) {
+    if (should_collect() || getenv("JULIA_DEBUG_GC")) {
         jl_gc_collect(0);
         return 1;
     }
@@ -1027,7 +1027,7 @@ static NOINLINE void add_page(pool_t *p)
 static inline void *__pool_alloc(pool_t* p, int osize, int end_offset)
 {
     gcval_t *v, *end;
-    if (__unlikely((allocd_bytes += osize) >= 0)) {
+    if (__unlikely((allocd_bytes += osize) >= 0) || getenv("JULIA_DEBUG_GC")) {
         //allocd_bytes -= osize;
         jl_gc_collect(0);
         //allocd_bytes += osize;
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
 static void clear_mark(int);
 #endif

+static unsigned _gc_count = 0;

 void jl_gc_collect(int full)
 {
+    if (full < 0) {
+        full = 0;
+    } else if (getenv("JULIA_DEBUG_GC")) {
+        _gc_count++;
+        if (_gc_count % 10000 == 0)
+            jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+        full = 1;
+    }
     if (!is_gc_enabled) return;
     if (jl_in_gc) return;
     jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
     }
 #endif
     if (recollect)
-        jl_gc_collect(0);
+        jl_gc_collect(-1);
 }

 // allocator entry points

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

gdb backtrace

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

My _gc_count value is also extremely repeatable.... 291932

@pao
Copy link
Member

pao commented May 7, 2015

@yuyichao That's some long output you have there--when there's this much it's probably better to put the long debugging outputs in a gist and link to them here so the conversation is easier to follow.

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

Dis-assemble of julia_call_64802, which AFAICT is the constructor of TTY

I've also added some code to use execinfo to print out the backtrace before the crash happens, and it seems that the call at <julia_call_64802+777> does not trigger a segfault while <julia_call_64802+814> very repeatably does. Maybe this can narrow down which variable(object?) is causing the issue?

@carnaval
Copy link
Contributor Author

carnaval commented May 7, 2015

Hey, glad to see you're diving into this kind of issue. I've tried running a few of those in "always-gc" mode but it doesn't segfault here. It may be a platform specific issue. What are you running on ? Unfortunately the coredumps are not very useful since I believe I would need the exact same binary & shared libs as you to have meaningful results.

Looking at your investigation it smells a lot like a ccall problem. I'm gonna spend a few moments today looking at the generated code to see if there is an obvious issue but it's hard without reproducing. If this doesn't go anywhere I'll try to help you investigate further maybe tomorrow ? If you get bored of this we can also arrange so that I get ssh to the machine this is faulting on if that's ok with you.

Thanks !

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

@carnaval
I'm running on ArchLinux.

I've also just tried running generation of sys0.so and got another fault before inference is on, hopefully this one is simpler?

here is the gdb session... too lazy to paste all pointer values but in short the segfault happens in the jl_inst_concrete_tupletype_v in arg_type_tuple but not in jl_wrap_Type before it.....

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

Is there a place that has the dump of llvm IR of the sys.so ? Would that be easier to diagnose? Would sys.ji do?

@yuyichao
Copy link
Contributor

yuyichao commented May 7, 2015

P.S. @carnaval I've sent you a private email about debuging on my machine to your git address.

@tkelman tkelman mentioned this pull request May 8, 2015
@carnaval
Copy link
Contributor Author

carnaval commented May 8, 2015

@staticfloat Hey, we found the cause (thanks to @yuyichao who actually came in person !). Would it be reasonable to at least run some of the buildbots with GC_VERIFY on ?

@tkelman
Copy link
Contributor

tkelman commented May 8, 2015

Nice work guys.

Would it be reasonable to at least run some of the buildbots with GC_VERIFY on ?

My vote there is yes, we already have buildbot jobs for LLVM svn and coverage and a boatload of different platforms, I don't see any harm in adding one more for help with GC debugging. How would the build process for that job differ from a normal build?

@staticfloat
Copy link
Member

yeah, this should be possible. Can you give me an idea of how much extra runtime it adds?

@carnaval
Copy link
Contributor Author

carnaval commented May 8, 2015

As much as we want :)

I'll add a flag to the Makefile to run under real_slow_gc_debug and we'll see how long it takes and tune it down if the csail cloud is sad.

@staticfloat
Copy link
Member

Also, since I didn't say it in my original post, this is an incredible amount of work on both of your parts. Nice job!

@timholy
Copy link
Member

timholy commented May 9, 2015

Agreed with @staticfloat. Hunting down segfaults is difficult work, but talk about impact...

@tkelman
Copy link
Contributor

tkelman commented May 9, 2015

IMO stability of the runtime is as big a deal as breaking feature changes when it comes to declaring master release-worthy, so this kind of work is huge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong result for kwsorter-for-call-like function
8 participants