-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
report: catch internal segfaults #25915
Conversation
`--diagnostic-report-on-fatalerror` does not capture internal segfaults at present, instead limited to only faults that are detected by the Node / v8. libuv's signal handler deliveres result asynchronously which does not help for SIGSEGV as: i) the synchronous sequence restarts the failing code, ii) due to i) the callback is never delivered to the consumer. Install a SIGSEGV handler in Node and follow the sequence of fatalerror handling, as of now at least under report. Fixes: nodejs#25762
The report code generally isn’t signal-safe, and if I remember correctly, we had to deal with actual crashes because of that in #22712? |
@addaleax - yes, the issue we dealt there was recursion in report (in case if the signal is delivered again while the report is underway). But then that was asynchronous signals (externally injected) We could address that by uninstalling the handler when we enter the handler? (because we are anyways going to terminate and don't want to process any more signals) |
@gireeshpunathil I don’t think whether a signal is externally injected would have an effect on signal safety. The issue was not just recursion; I also remember that we ran into deadlocks when calling malloc() (because the main code was already inside malloc(), for example), and our report code definitely does lots of allocations. In the end, there’s a very limited set of things one can do from a signal handler, and our report code doesn’t really fit that profile… |
can we qualify the limitations? `man signal(7) shows an long list of calls that a signal handler can safely call. If ~50 system calls can be made, it pretty much can do many things? I see It is not the desire to have this in report that drives me; but we should have something better than raw termination on segfaults? many deployments inhibit core dumps on crashes, means we end up only having an exit code! Any little first failure data would improve the situation IMO. |
Getting a report on a SEGV would be great, but using signal unsafe functions is prone to failure. Node is in the process of terminating anyhow, maybe it doesn't matter if it just terminates some other way, but deadlocking and not terminating because its blocked on malloc in the stdlib is worse than terminating with SEGV and no report. It should be possible to rework node-report to only use signal safe functions. It would have to avoid iostreams/malloc/new, work inside a pre-allocated memory space, just interact directly with fds, etc. A lot of work, but nice to have. |
@addaleax / @sam-github : ok, just to clarify: I did not mean to argue on signal safety and report's ability to stay within the saftey limits. Keeping my intent clear (handle SIGSEGV in a reasonable manner and with some useful info) I am happy to amend this in the best ways as you would suggest. One of:
Please suggest! |
@richardlau What do you think of reworking node-report to not use signal unsafe functions? How much work do you think it is? It depends pretty heavily on iostreams, IIRC. |
Correct, at the moment both the stand alone module and the code in core is heavily dependent on iostreams. I think there was a stalled issue/pr in https://github.com/nodejs/node-report re. signal safety (I'll find references when I'm back on Wednesday after the Lunar New Year). |
My gut feeling would be to start small -- start with a minimal/empty signal safe report and then gradually rework common bits from the existing code. |
I guess this is the one nodejs/node-report#92 What are your thoughts on code range check? i.e., if faulty instruction found to belong to anything other than node.exe (c/c++ runtime or os) skip; else proceed. Still we cover a wide class of issues, while being safe? |
I'm not sure I understand the purpose of the code range check. Are you suggesting that if it's node code, that we can make calls to functions that are not signal safe? If so, I think you'd still be in "undefined behaviour" territory. It's definitely not safe for stdio, the macros would show up as "node code" but be manipulating the FILE object internals. I think @richardlau 's suggestion of outputting a small subset of the report from the signal handler, converting the code for that small section to signal-safe functions. See how that goes, and then work on extending the report later. |
@sam-github - IIUC, the whole premise of signal un-saftey is around arbitrary state of the program w.r.t locks that can be held when the signal arrives on an arbitrary thread; that can cause deadlock if attempted to acquire the same lock. Some random checks on APIs are in full agreement with this premise. safe methods, no locks. $ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble accept" | grep -E "lock|cmpxchg" | wc -l
0
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble fcntl" | grep -E "lock|cmpxchg" | wc -l
0
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble dup" | grep -E "lock|cmpxchg" | wc -l
0
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble chdir" | grep -E "lock|cmpxchg" | wc -l
0
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble kill" | grep -E "lock|cmpxchg" | wc -l
0 unsafe methods, lot of those. $ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble malloc" | grep -E "lock|cmpxchg" | wc -l
7
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble fflush" | grep -E "lock|cmpxchg" | wc -l
10
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble random" | grep -E "lock|cmpxchg" | wc -l
8
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble ftell" | grep -E "lock|cmpxchg" | wc -l
11
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble getc" | grep -E "lock|cmpxchg" | wc -l
10 I don't know if there are other usafety parameters. Unfortunately the documentation is not clear on how the unsafety materializes. Regarding the code range check:
|
It’s not just about locking (I wouldn’t even be sure if multithreading predates the concept of signal safety – it probably doesn’t?). It’s more about the fact that you might not be able to be sure which state a given part of the process is in, if it isn’t coded intentionally to be signal-safe; for example, |
Sure, but then our range check will tell us the process is in the red zone, right? Do we have such unsafe program state when the |
@gireeshpunathil I don’t know, but this sounds very brittle. I think it would be a good idea to stick with signal-safe-functions-only signal handlers. |
@gireeshpunathil we must have the same problem in Java and generating javacores. Can you find out what the solution was there as an additional piece of information? |
@mhdawson: we do seem to be using lock rich functions from the signal handler. I could locate many, here is one each from from IBM Java and open JDK ibm jdk: #0 0x000000393584eed0 in fprintf () from /lib64/libc.so.6
#1 0x00000039358e4d08 in __vsyslog_chk () from /lib64/libc.so.6
#2 0x00000039358e5350 in syslog () from /lib64/libc.so.6
#3 0x00007ffff6247334 in j9syslog_write () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#4 0x00007ffff6228457 in j9nls_vprintf () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#5 0x00007ffff62284e6 in j9nls_printf () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#6 0x00007ffff5a9bafb in triggerDumpAgents () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9dmp28.so
#7 0x00007ffff5a8924d in abortHandler () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9dmp28.so
#8 <signal handler called>
#9 0x0000003935832495 in raise () from /lib64/libc.so.6
#10 0x0000003935833c75 in abort () from /lib64/libc.so.6
#11 0x00007ffff6239557 in masterSynchSignalHandler () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#12 <signal handler called>
#13 0x0000003935c082fb in pthread_join () from /lib64/libpthread.so.0
#14 0x00007ffff7dd8b5d in ContinueInNewThread0 () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#15 0x00007ffff7dcc874 in ContinueInNewThread () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#16 0x00007ffff7dcfd4a in JLI_Launch () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#17 0x000000000040068a in main () open jdk: #0 0x0000003935867590 in ftell () from /lib64/libc.so.6
#1 0x00007ffff77449f0 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#2 0x00007ffff7744207 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#3 0x00007ffff76eccb2 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#4 0x00007ffff7a127be in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#5 0x00007ffff775ca84 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#6 0x00007ffff7b734a7 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#7 0x00007ffff7b73f13 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#8 0x00007ffff7a1a1c7 in JVM_handle_linux_signal () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#9 <signal handler called>
#10 0x0000003935c082fb in pthread_join () from /lib64/libpthread.so.0
#11 0x00000000004051e1 in ?? ()
#12 0x00000000004042a0 in ?? ()
#13 0x000000393581ed1d in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000401fe9 in ?? ()
#15 0x00007fffffffe308 in ?? ()
#16 0x000000000000001c in ?? ()
#17 0x0000000000000002 in ?? ()
#18 0x00007fffffffe5a6 in ?? ()
#19 0x00007fffffffe5b4 in ?? ()
#20 0x0000000000000000 in ?? ()
IMO Java is banking on the realistic probability of very low to 0 |
there is also a popular segfault-handler module in the ecosystem that catches Based on the feedback so far, let me go back and work on a revision, aiming to capture the most important data with safe code. |
Ping @gireeshpunathil |
@BridgeAR - this PR is stalled: the current approach is deemed to be signal-unsafe, and require rework in the |
I’m going to close this PR, because it requires a significant amount of preparation work, and conflicts heavily with changes that happened on master and would basically need to be re-created anyway. |
--diagnostic-report-on-fatalerror
does not capture internalsegfaults at present, instead limited to only faults that are
detected by the Node / v8. libuv's signal handler deliveres
result asynchronously which does not help for SIGSEGV as:
i) the synchronous sequence restarts the failing code,
ii) due to i) the callback is never delivered to the consumer.
Install a SIGSEGV handler in Node and follow the sequence of
fatalerror handling, as of now at least under report.
Fixes: #25762
cc @addaleax @sam-github @bnoordhuis
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes