-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix invoke for vararg methods. #11151
Conversation
When the invoke private method table is created we insert into the method list manually instead of going through method_table_insert. This codepath was forgetting to update the max_arg field which is used to correctly compute the specialized signature for variable argument methods. Should fix #11149.
Apart from the current failure on 32bit linux, maybe it's better to add a test for this? Sth like. @noinline f(a, b, args...) = (a, b, args...)
@test f(1, 2, 3) == invoke(f, Tuple{Int, Int, Int}, 1, 2, 3) |
Yes it does probably need a test. The failure could be my fault I'll have a look (why does it always has to be on 32bit arch...) |
Found the cause of the failure. I'll merge this as soon as the CI is happy. |
Any insights into why things have been less stable there? Obviously most developers are on 64-bit systems day to day, but something about the runtime has seemed much more susceptible to odd intermittent bugs on 32 bit. You probably just squashed one, but I suspect there may still be a few more hiding around. |
@tkelman Is the GC triggerred much more frequently on 32bit? |
Yes that would also be my guess for one of the root causes. The default collect interval on x86 is about half of x64. Well, setting it to a tenth of this is one of the way to ease debugging those, at the cost of a terrible runtime of course. |
IIRC, there was a thread on the mailing list about this debuging technique (and other GC related ones). Is that a parameter that can be changed at runtime so that (one of?) the CI can be run that way? |
Yeah we have a few flags to debug different things in the GC. The problem being that 1) it makes it very slow and our CI is already kinda clogged and 2) the failure mode is usually to crash earlier but since we don't have an easy way to get the dump in gdb it doesn't help that much We could definitely make it a lot better though. I'm hopeful that something good will come out of the rr stuff and we may want to run this particular test build under heavy debugging configuration. |
Fix invoke for vararg methods.
I just did the following change to Is there anything that is clearly wrong about the change I've made to the GC (apart from making it >> 100x slower) (e.g. if it breaks some subtle assumptions in the GC itself) that can make this a false positive? diff --git a/src/gc.c b/src/gc.c
index 3e853b2..47b6ff1 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:
static inline int maybe_collect(void)
{
- if (should_collect()) {
+ if (should_collect() || getenv("JULIA_DEBUG_GC")) {
jl_gc_collect(0);
return 1;
}
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
static void clear_mark(int);
#endif
+static unsigned _gc_count = 0;
void jl_gc_collect(int full)
{
+ if (full < 0) {
+ full = 0;
+ } else if (getenv("JULIA_DEBUG_GC")) {
+ _gc_count++;
+ if (_gc_count % 10000 == 0)
+ jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+ full = 1;
+ }
if (!is_gc_enabled) return;
if (jl_in_gc) return;
jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
}
#endif
if (recollect)
- jl_gc_collect(0);
+ jl_gc_collect(-1);
}
// allocator entry points
|
And by reliable I mean it appears twe out of the two times I run it
|
You are on the good path. Few comments :
In this particular case it looks like it's hapenning in julia code so it may be an issue with the generated rooting in codegen (e.g. something introduced by the ccall change recently). You never know though because the corruption could have happenned before we got here, but the smaller the "collect interval", the less likely it is to be non-local. I'll try to reproduce this specific error in a bit. In the meantime feel free to ask anything, thanks for looking into this kind of things. |
I'll add that this kind of hack is usually what it ends up looking to find those errors. You can even run the beginning of bootstrap wile collecting after every allocation ! If the error is further down the code execution you can also use variations of |
@carnaval I've actually just realized that I haven't included you fix yet and I'm rerunning the test. |
Cool, I'll look into what you find. I don't think this fixes it since I see no use of staged functions around here but who knows, those bugs are hard to predict. |
Hmmm. I thought collecting more aggressively will always expose more problems. Is that not true?
Ah. Found it. I thought I did a search for "collect" in
Any suggestions on how to narrow this down.
No it's not random at all. And you are right that using the latest master (which AFAIK includes your fix) does not make the problem disappear. I'm using system
Well, I'm doing this mainly because I feel like this is a better way to utilize my extra processor power for myself than running, e.g., SETI@Home (which is not to say SETI@Home is wrong either) and I don't know more to do fancier debugging.
Do you mean basically generating the |
As a general principle yes, but in practice it changes the allocation pattern a lot, also you could be hitting a bug in the generational stuff in the gc of course :) I don't think your mpfr is the problem, it would probably crash every time if it was but who knows.
|
@carnaval So I've let 4 instances running the version I had above (so pool allocation will not always trigger a collection) in gdb and have got very repeatable segfault. According to the counter I added, all of them happens at exactly collection Backtrace here, not sure how much it can help.... =( |
The line in the |
and just for the record, |
Compressed core dump file here |
@carnaval Add the hack in pool allocate as you suggested (see below for the diff) and got a segfault much easier.... a segfault happens as early as base initialization...... (Full backtrace later...)
Current diff diff --git a/src/gc.c b/src/gc.c
index 3e853b2..c082132 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -753,7 +753,7 @@ no_decommit:
static inline int maybe_collect(void)
{
- if (should_collect()) {
+ if (should_collect() || getenv("JULIA_DEBUG_GC")) {
jl_gc_collect(0);
return 1;
}
@@ -1027,7 +1027,7 @@ static NOINLINE void add_page(pool_t *p)
static inline void *__pool_alloc(pool_t* p, int osize, int end_offset)
{
gcval_t *v, *end;
- if (__unlikely((allocd_bytes += osize) >= 0)) {
+ if (__unlikely((allocd_bytes += osize) >= 0) || getenv("JULIA_DEBUG_GC")) {
//allocd_bytes -= osize;
jl_gc_collect(0);
//allocd_bytes += osize;
@@ -2140,9 +2140,18 @@ void prepare_sweep(void)
static void clear_mark(int);
#endif
+static unsigned _gc_count = 0;
void jl_gc_collect(int full)
{
+ if (full < 0) {
+ full = 0;
+ } else if (getenv("JULIA_DEBUG_GC")) {
+ _gc_count++;
+ if (_gc_count % 10000 == 0)
+ jl_printf(JL_STDOUT, "GC triggered: %u\n", _gc_count);
+ full = 1;
+ }
if (!is_gc_enabled) return;
if (jl_in_gc) return;
jl_in_gc = 1;
@@ -2359,7 +2368,7 @@ void jl_gc_collect(int full)
}
#endif
if (recollect)
- jl_gc_collect(0);
+ jl_gc_collect(-1);
}
// allocator entry points |
My |
Dis-assemble of I've also added some code to use execinfo to print out the backtrace before the crash happens, and it seems that the call at |
Hey, glad to see you're diving into this kind of issue. I've tried running a few of those in "always-gc" mode but it doesn't segfault here. It may be a platform specific issue. What are you running on ? Unfortunately the coredumps are not very useful since I believe I would need the exact same binary & shared libs as you to have meaningful results. Looking at your investigation it smells a lot like a ccall problem. I'm gonna spend a few moments today looking at the generated code to see if there is an obvious issue but it's hard without reproducing. If this doesn't go anywhere I'll try to help you investigate further maybe tomorrow ? If you get bored of this we can also arrange so that I get ssh to the machine this is faulting on if that's ok with you. Thanks ! |
@carnaval I've also just tried running generation of here is the gdb session... too lazy to paste all pointer values but in short the segfault happens in the |
Is there a place that has the dump of llvm IR of the |
P.S. @carnaval I've sent you a private email about debuging on my machine to your git address. |
@staticfloat Hey, we found the cause (thanks to @yuyichao who actually came in person !). Would it be reasonable to at least run some of the buildbots with GC_VERIFY on ? |
Nice work guys.
My vote there is yes, we already have buildbot jobs for LLVM svn and coverage and a boatload of different platforms, I don't see any harm in adding one more for help with GC debugging. How would the build process for that job differ from a normal build? |
yeah, this should be possible. Can you give me an idea of how much extra runtime it adds? |
As much as we want :) I'll add a flag to the Makefile to run under real_slow_gc_debug and we'll see how long it takes and tune it down if the csail cloud is sad. |
Also, since I didn't say it in my original post, this is an incredible amount of work on both of your parts. Nice job! |
Agreed with @staticfloat. Hunting down segfaults is difficult work, but talk about impact... |
IMO stability of the runtime is as big a deal as breaking feature changes when it comes to declaring master release-worthy, so this kind of work is huge. |
When the invoke private method table is created we insert into the
method list manually instead of going through method_table_insert.
This codepath was forgetting to update the max_arg field which is
used to correctly compute the specialized signature for variable
argument methods. Should fix #11149.