-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to find/test/reproduce GC bugs #11358
Conversation
Seems good. Would you be willing to also add another flag to enable crazy-gc mode (but keep it disabled for CI) (~ collect every N pool alloc || M big alloc), as well as a flag to disable generational stuff ? (just have to change the only place with |
@carnaval Yeah, I was actually porting my Do you want the flag to disable generational stuff a runtime flag or a compile time flag? (or both) |
This slows down CI by about 25%, and more significantly it doesn't seem quite right to run every CI build in a configuration that very few people are testing with locally. |
@tkelman I guess it is important to make sure the default configuration works although the effect of I added the option to the CI build because I think it would have caught some GC issues much earlier. As for really-slow-gc tests (i.e. more aggressive GC), this PR should be flexible enough to cover a lot of conditions and for debugging, however, given how strange the condition is to trigger a few recently discovered GC issues, I'm not sure what kinds of test would be the best. Maybe sth like Fuzz.jl ? @staticfloat |
The buildbot results are already public, check out this page. An extra 25% runtime isn't something to cry over. Do I need to wait for this branch to be merged before adding Every time a commit comes in, it kicks off the quickbuild builders. These builders (Ubuntu, CentOS and OSX) build Julia and run the testsuite. If everything works, the build passes on from the quickbuild builders to the packaging builders which make the nightly Here's what I propose: either add |
@staticfloat If the quickbuild is what generates the nighly binaries I guess it should probably not have How hard it would be to setup some additional tests to stress test the GC on the buildbots? (And as I said, I need advice on what to do for those tests). These tests doesn't have to be run everyday. Even a weekly/monthly test would make identifying the issue much easier. As for time estimation, the whole julia test-suite (which will probably be included in additional to some randomized tests) can take ~ 1 week to finish on a high end mobile processor. (so running it nightly can be pointless....) |
I see, so you're looking for something to run, say, once a week at 1am on a Sunday or something. :P Well, once this is merged, ping me, and I'll setup something similar to what we do with the LLVM SVN nightlies. |
Yes, and it will probably run for a week until the next one starts or an issue is discovered.
Sounds good. Thanks. |
@carnaval is this what you want? I've also removed the |
Now with the GC debug option, the SegFault handler should print the GC counter to guide setting environment variables. |
@carnaval Is this looking good for you or is there anything that I should tweak/add. P.S. I haven't been running these tests for a month now and I'm wondering if sth silently breaks again.... |
So a real world example to show what this PR looks like. A few things to note here. The three There's another environment variable which controls whether the generational stuff is on. And when the segfault is triggered, the last line will print the current GC count so that you can set the collection around that count to trigger the issue without waiting for too long. yuyichao% JL_GC_ALLOC_PRINT=0:10 JL_GC_ALLOC_POOL=0 JL_GC_ALLOC_OTHER=0 ../julia -C native --build "${PWD}/"../inference0 -f coreimg.jl
Allocations: 13861 (Pool: 13861; Other: 0); GC: 9
Allocations: 13871 (Pool: 13871; Other: 0); GC: 19
Allocations: 13881 (Pool: 13881; Other: 0); GC: 29
Allocations: 13891 (Pool: 13891; Other: 0); GC: 39
Allocations: 13901 (Pool: 13901; Other: 0); GC: 49
Allocations: 13911 (Pool: 13911; Other: 0); GC: 59
Allocations: 13921 (Pool: 13921; Other: 0); GC: 69
Allocations: 13931 (Pool: 13931; Other: 0); GC: 79
Allocations: 13941 (Pool: 13941; Other: 0); GC: 89
Allocations: 13951 (Pool: 13951; Other: 0); GC: 99
Allocations: 13961 (Pool: 13961; Other: 0); GC: 109
Allocations: 13971 (Pool: 13971; Other: 0); GC: 119
Allocations: 13981 (Pool: 13981; Other: 0); GC: 129
Allocations: 13991 (Pool: 13991; Other: 0); GC: 139
Allocations: 14001 (Pool: 14001; Other: 0); GC: 149
Allocations: 14011 (Pool: 14011; Other: 0); GC: 159
Allocations: 14021 (Pool: 14021; Other: 0); GC: 169
Allocations: 14031 (Pool: 14031; Other: 0); GC: 179
Allocations: 14041 (Pool: 14041; Other: 0); GC: 189
Allocations: 14051 (Pool: 14051; Other: 0); GC: 199
Allocations: 14061 (Pool: 14061; Other: 0); GC: 209
Allocations: 14071 (Pool: 14071; Other: 0); GC: 219
Allocations: 14081 (Pool: 14081; Other: 0); GC: 229
Allocations: 14091 (Pool: 14091; Other: 0); GC: 239
Allocations: 14101 (Pool: 14101; Other: 0); GC: 249
Allocations: 14111 (Pool: 14111; Other: 0); GC: 259
Allocations: 14121 (Pool: 14121; Other: 0); GC: 269
Allocations: 14131 (Pool: 14131; Other: 0); GC: 279
Allocations: 14141 (Pool: 14141; Other: 0); GC: 289
Allocations: 14151 (Pool: 14151; Other: 0); GC: 299
Allocations: 14161 (Pool: 14161; Other: 0); GC: 309
signal (11): 段错误
gc_push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1615
push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1849
gc_push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1617
push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1856
jl_gc_collect at /home/yuyichao/projects/julia/gc-debug/src/gc.c:2337
__pool_alloc at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1187
new_binding at /home/yuyichao/projects/julia/gc-debug/src/module.c:73
jl_new_module at /home/yuyichao/projects/julia/gc-debug/src/module.c:39
jl_eval_module_expr at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:118
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:415
jl_eh_restore_state at /home/yuyichao/projects/julia/gc-debug/src/julia.h:1389
do_call at /home/yuyichao/projects/julia/gc-debug/src/interpreter.c:66
eval at /home/yuyichao/projects/julia/gc-debug/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:540
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:569
jl_load at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:616
unknown function (ip: 4200467)
unknown function (ip: 4201392)
unknown function (ip: 4199951)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4200041)
unknown function (ip: 0)
Allocations: 14162 (Pool: 14162; Other: 0); GC: 310
[1] 11915 segmentation fault (core dumped) JL_GC_ALLOC_PRINT=0:10 JL_GC_ALLOC_POOL=0 JL_GC_ALLOC_OTHER=0 ../julia -C |
Not sure how, but this appears to cause a win64 compilation failure? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.5664/job/jmss85q7y71473f7 |
@tkelman The win64 CI has been failing for the same error. I've started a build to check the current master. The command line of gcc does seems different. Not sure why though... |
I noticed, I pushed a debugging build on a branch, let's start a new issue if I can't figure it out right away. |
@yuyichao I think I have it figured out but what I just pushed is a temporary band-aid that might break IJulia on win64 nightly. See #11682 (comment) for more discussion, I'll use that issue for this. edit: FYI the root cause of the compilation failure is that the cached |
@carnaval Finally move the debugging bits out of |
6922b5b
to
115b937
Compare
@carnaval Ping. |
This branch was useful a few times recently for reproducing/debugging some crashes. I'd like to merge this later this week if there's no further comments. I'll rebase 2-3 times before that to make sure the CI is happy about it... |
thanks for the remainder, I'm a distracted person sorry about that :-) Why do you need to check for init in allocation ? That's not very expensive compared to what we are doing in this PR but is there a reason this would not work if done in jl_gc_init ? Also, minor thing but AllocNum should probably not be camel case for consistency. Other than that, let's go, who knows, we might even get other people to use this :-) |
Will change that (to
Wasn't aware of it ( |
Done. (and github finally got the commit order right...)
This could probably get as many user as |
Haha. Well anyway this lgtm, you can merge when CI is happy. |
Make it easier to find/test/reproduce GC bugs
@staticfloat We don't have a generic framework for randomized tests yet but since this is merged, it would be nice if you can setup a buildbot with |
I would like to see first how does the CI feel about
GC_VERIFY
.Not including the GC runtime controls yet.