Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to find/test/reproduce GC bugs #11358

Merged
merged 5 commits into from
Jul 14, 2015

Conversation

yuyichao
Copy link
Contributor

I would like to see first how does the CI feel about GC_VERIFY.
Not including the GC runtime controls yet.

@yuyichao
Copy link
Contributor Author

@carnaval

@carnaval
Copy link
Contributor

Seems good. Would you be willing to also add another flag to enable crazy-gc mode (but keep it disabled for CI) (~ collect every N pool alloc || M big alloc), as well as a flag to disable generational stuff ? (just have to change the only place with sweep_mask = GC_MARKED_NOESC into sweep_mask = GC_MARKED).

@yuyichao yuyichao changed the title WIP: Make it easier to find/test/reproduce GC bugs Make it easier to find/test/reproduce GC bugs May 24, 2015
@yuyichao
Copy link
Contributor Author

@carnaval Yeah, I was actually porting my gc-test branch just now (and have just pushed it out).

Do you want the flag to disable generational stuff a runtime flag or a compile time flag? (or both)

@tkelman
Copy link
Contributor

tkelman commented May 25, 2015

This slows down CI by about 25%, and more significantly it doesn't seem quite right to run every CI build in a configuration that very few people are testing with locally.

@yuyichao
Copy link
Contributor Author

@tkelman I guess it is important to make sure the default configuration works although the effect of GC_VERIFY seems to be pretty small apart from the extra GC time.

I added the option to the CI build because I think it would have caught some GC issues much earlier.
I can remove GC_VERIFY from the PR (when I add the flag @carnaval want to disable generational stuff) but I think we do need this and other GC tests running automatically somewhere. Doesn't has to be the github CI though.

As for really-slow-gc tests (i.e. more aggressive GC), this PR should be flexible enough to cover a lot of conditions and for debugging, however, given how strange the condition is to trigger a few recently discovered GC issues, I'm not sure what kinds of test would be the best. Maybe sth like Fuzz.jl ?

@staticfloat
I guess at least the GC_VERIFY configuration should be sth reasonable to run on a buildbot? @tkelman 's number should answer your question here #11151 (comment) .
Is it also possible to make the buildbots result publically accessible?

@staticfloat
Copy link
Member

The buildbot results are already public, check out this page. An extra 25% runtime isn't something to cry over. Do I need to wait for this branch to be merged before adding WITH_GC_VERIFY=1 to the buildbot make arguments? Here's how the buildbots work right now:

Every time a commit comes in, it kicks off the quickbuild builders. These builders (Ubuntu, CentOS and OSX) build Julia and run the testsuite. If everything works, the build passes on from the quickbuild builders to the packaging builders which make the nightly .tar.gz, .dmg, .exe, ubuntu PPA, Fedora RPM packages. If those work, then we run coverage statistics using the newly created .tar.gz binary.

Here's what I propose: either add WITH_GC_VERIFY=1 to the quickbuild jobs, or create a new, independent job that runs after quickbuild, in parallel with the packaging jobs.

@yuyichao
Copy link
Contributor Author

@staticfloat If the quickbuild is what generates the nighly binaries I guess it should probably not have WITH_GC_VERIFY (and yes you need this PR before doing that) on since it WILL make the GC much slower.

How hard it would be to setup some additional tests to stress test the GC on the buildbots? (And as I said, I need advice on what to do for those tests). These tests doesn't have to be run everyday. Even a weekly/monthly test would make identifying the issue much easier. As for time estimation, the whole julia test-suite (which will probably be included in additional to some randomized tests) can take ~ 1 week to finish on a high end mobile processor. (so running it nightly can be pointless....)

@staticfloat
Copy link
Member

I see, so you're looking for something to run, say, once a week at 1am on a Sunday or something. :P

Well, once this is merged, ping me, and I'll setup something similar to what we do with the LLVM SVN nightlies.

@yuyichao
Copy link
Contributor Author

I see, so you're looking for something to run, say, once a week at 1am on a Sunday or something. :P

Yes, and it will probably run for a week until the next one starts or an issue is discovered.

Well, once this is merged, ping me, and I'll setup something similar to what we do with the LLVM SVN nightlies.

Sounds good. Thanks.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2015

@carnaval is this what you want?
Most of the diff is moving the debugging stuff below the definition of the GC constants but the important part (setting sweep_mask) should be separated clearly from that.

I've also removed the WITH_GC_VERIFY flag from the ci build.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 4, 2015

Now with the GC debug option, the SegFault handler should print the GC counter to guide setting environment variables.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 8, 2015

@carnaval Is this looking good for you or is there anything that I should tweak/add.

P.S. I haven't been running these tests for a month now and I'm wondering if sth silently breaks again....

@simonster simonster added the GC Garbage collector label Jun 8, 2015
@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 9, 2015

So a real world example to show what this PR looks like.

A few things to note here. The three JL_GC_ALLOC_* environment variables controls the interval and frequency GC (or printing) is triggered. The format is start[:step[:end]] so 0:1000 means start from the first one and every 1000 (sorry it doesn't mean the same with 0:1000 in julia mainly because it's much easier to parse with scanf in c ....).

There's another environment variable which controls whether the generational stuff is on.

And when the segfault is triggered, the last line will print the current GC count so that you can set the collection around that count to trigger the issue without waiting for too long.

yuyichao% JL_GC_ALLOC_PRINT=0:10 JL_GC_ALLOC_POOL=0 JL_GC_ALLOC_OTHER=0 ../julia -C native --build "${PWD}/"../inference0 -f coreimg.jl
Allocations: 13861 (Pool: 13861; Other: 0); GC: 9
Allocations: 13871 (Pool: 13871; Other: 0); GC: 19
Allocations: 13881 (Pool: 13881; Other: 0); GC: 29
Allocations: 13891 (Pool: 13891; Other: 0); GC: 39
Allocations: 13901 (Pool: 13901; Other: 0); GC: 49
Allocations: 13911 (Pool: 13911; Other: 0); GC: 59
Allocations: 13921 (Pool: 13921; Other: 0); GC: 69
Allocations: 13931 (Pool: 13931; Other: 0); GC: 79
Allocations: 13941 (Pool: 13941; Other: 0); GC: 89
Allocations: 13951 (Pool: 13951; Other: 0); GC: 99
Allocations: 13961 (Pool: 13961; Other: 0); GC: 109
Allocations: 13971 (Pool: 13971; Other: 0); GC: 119
Allocations: 13981 (Pool: 13981; Other: 0); GC: 129
Allocations: 13991 (Pool: 13991; Other: 0); GC: 139
Allocations: 14001 (Pool: 14001; Other: 0); GC: 149
Allocations: 14011 (Pool: 14011; Other: 0); GC: 159
Allocations: 14021 (Pool: 14021; Other: 0); GC: 169
Allocations: 14031 (Pool: 14031; Other: 0); GC: 179
Allocations: 14041 (Pool: 14041; Other: 0); GC: 189
Allocations: 14051 (Pool: 14051; Other: 0); GC: 199
Allocations: 14061 (Pool: 14061; Other: 0); GC: 209
Allocations: 14071 (Pool: 14071; Other: 0); GC: 219
Allocations: 14081 (Pool: 14081; Other: 0); GC: 229
Allocations: 14091 (Pool: 14091; Other: 0); GC: 239
Allocations: 14101 (Pool: 14101; Other: 0); GC: 249
Allocations: 14111 (Pool: 14111; Other: 0); GC: 259
Allocations: 14121 (Pool: 14121; Other: 0); GC: 269
Allocations: 14131 (Pool: 14131; Other: 0); GC: 279
Allocations: 14141 (Pool: 14141; Other: 0); GC: 289
Allocations: 14151 (Pool: 14151; Other: 0); GC: 299
Allocations: 14161 (Pool: 14161; Other: 0); GC: 309

signal (11): 段错误
gc_push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1615
push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1849
gc_push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1617
push_root at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1856
jl_gc_collect at /home/yuyichao/projects/julia/gc-debug/src/gc.c:2337
__pool_alloc at /home/yuyichao/projects/julia/gc-debug/src/gc.c:1187
new_binding at /home/yuyichao/projects/julia/gc-debug/src/module.c:73
jl_new_module at /home/yuyichao/projects/julia/gc-debug/src/module.c:39
jl_eval_module_expr at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:118
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:415
jl_eh_restore_state at /home/yuyichao/projects/julia/gc-debug/src/julia.h:1389
do_call at /home/yuyichao/projects/julia/gc-debug/src/interpreter.c:66
eval at /home/yuyichao/projects/julia/gc-debug/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:540
jl_toplevel_eval_flex at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:569
jl_load at /home/yuyichao/projects/julia/gc-debug/src/toplevel.c:616
unknown function (ip: 4200467)
unknown function (ip: 4201392)
unknown function (ip: 4199951)
__libc_start_main at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 4200041)
unknown function (ip: 0)
Allocations: 14162 (Pool: 14162; Other: 0); GC: 310
[1]    11915 segmentation fault (core dumped)  JL_GC_ALLOC_PRINT=0:10 JL_GC_ALLOC_POOL=0 JL_GC_ALLOC_OTHER=0 ../julia -C    

@tkelman
Copy link
Contributor

tkelman commented Jun 13, 2015

Not sure how, but this appears to cause a win64 compilation failure? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.5664/job/jmss85q7y71473f7

@yuyichao
Copy link
Contributor Author

@tkelman The error happens in disasm.cpp which I didn't even touch. This also happens on the AppVeyor build right after this one for #11461 which doesn't even touch c/cpp files.

Is there anything changed for the build environment?

@yuyichao
Copy link
Contributor Author

@tkelman The win64 CI has been failing for the same error. I've started a build to check the current master.

The command line of gcc does seems different. Not sure why though...

@tkelman
Copy link
Contributor

tkelman commented Jun 13, 2015

I noticed, I pushed a debugging build on a branch, let's start a new issue if I can't figure it out right away.

@tkelman
Copy link
Contributor

tkelman commented Jun 13, 2015

@yuyichao I think I have it figured out but what I just pushed is a temporary band-aid that might break IJulia on win64 nightly. See #11682 (comment) for more discussion, I'll use that issue for this.

edit: FYI the root cause of the compilation failure is that the cached llvm-config.exe does not run correctly when using libstdc++-6.dll from latest nightlies, which comes from opensuse. The nightlies themselves are also broken here. Constant C++ ABI breakage sucks.

@yuyichao
Copy link
Contributor Author

yuyichao commented Jul 2, 2015

@carnaval Finally move the debugging bits out of gc.c. I had to move stuff a little bit so that the parts that I moved out fits in to a single file.

@yuyichao yuyichao force-pushed the gc-debug branch 3 times, most recently from 6922b5b to 115b937 Compare July 10, 2015 11:55
@yuyichao
Copy link
Contributor Author

@carnaval Ping.

@yuyichao
Copy link
Contributor Author

This branch was useful a few times recently for reproducing/debugging some crashes. I'd like to merge this later this week if there's no further comments. I'll rebase 2-3 times before that to make sure the CI is happy about it...

@carnaval
Copy link
Contributor

thanks for the remainder, I'm a distracted person sorry about that :-)

Why do you need to check for init in allocation ? That's not very expensive compared to what we are doing in this PR but is there a reason this would not work if done in jl_gc_init ?

Also, minor thing but AllocNum should probably not be camel case for consistency.

Other than that, let's go, who knows, we might even get other people to use this :-)

@yuyichao
Copy link
Contributor Author

Also, minor thing but AllocNum should probably not be camel case for consistency.

Will change that (to jl_alloc_num_t probably...).

Why do you need to check for init in allocation ? That's not very expensive compared to what we are doing in this PR but is there a reason this would not work if done in jl_gc_init ?

Wasn't aware of it (jl_gc_init) when I wrote the first version and didn't think about it more when I saw it later.... Will try that...

@yuyichao
Copy link
Contributor Author

Done. (and github finally got the commit order right...)

Other than that, let's go, who knows, we might even get other people to use this :-)

This could probably get as many user as GC_VERIFY.

@carnaval
Copy link
Contributor

This could probably get as many user as GC_VERIFY.

Haha. Well anyway this lgtm, you can merge when CI is happy.

yuyichao added a commit that referenced this pull request Jul 14, 2015
Make it easier to find/test/reproduce GC bugs
@yuyichao yuyichao merged commit 3c8d0fb into JuliaLang:master Jul 14, 2015
@yuyichao yuyichao deleted the gc-debug branch July 14, 2015 16:29
@yuyichao
Copy link
Contributor Author

Well, once this is merged, ping me, and I'll setup something similar to what we do with the LLVM SVN nightlies.

@staticfloat We don't have a generic framework for randomized tests yet but since this is merged, it would be nice if you can setup a buildbot with make option: WITH_GC_DEBUG_ENV=1. This should also automatically enable the gc verifier.

@tkelman tkelman mentioned this pull request Jul 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GC Garbage collector
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants