Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report: catch internal segfaults #25915

Closed
wants to merge 1 commit into from

Conversation

gireeshpunathil
Copy link
Member

--diagnostic-report-on-fatalerror does not capture internal
segfaults at present, instead limited to only faults that are
detected by the Node / v8. libuv's signal handler deliveres
result asynchronously which does not help for SIGSEGV as:
i) the synchronous sequence restarts the failing code,
ii) due to i) the callback is never delivered to the consumer.

Install a SIGSEGV handler in Node and follow the sequence of
fatalerror handling, as of now at least under report.

Fixes: #25762

cc @addaleax @sam-github @bnoordhuis

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

`--diagnostic-report-on-fatalerror` does not capture internal
segfaults at present, instead limited to only faults that are
detected by the Node / v8. libuv's signal handler deliveres
result asynchronously which does not help for SIGSEGV as:
i) the synchronous sequence restarts the failing code,
ii) due to i) the callback is never delivered to the consumer.

Install a SIGSEGV handler in Node and follow the sequence of
fatalerror handling, as of now at least under report.

Fixes: nodejs#25762
@nodejs-github-bot nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Feb 4, 2019
@addaleax
Copy link
Member

addaleax commented Feb 4, 2019

The report code generally isn’t signal-safe, and if I remember correctly, we had to deal with actual crashes because of that in #22712?

@gireeshpunathil
Copy link
Member Author

@addaleax - yes, the issue we dealt there was recursion in report (in case if the signal is delivered again while the report is underway). But then that was asynchronous signals (externally injected)

We could address that by uninstalling the handler when we enter the handler? (because we are anyways going to terminate and don't want to process any more signals)

@addaleax
Copy link
Member

addaleax commented Feb 4, 2019

@gireeshpunathil I don’t think whether a signal is externally injected would have an effect on signal safety.

The issue was not just recursion; I also remember that we ran into deadlocks when calling malloc() (because the main code was already inside malloc(), for example), and our report code definitely does lots of allocations.

In the end, there’s a very limited set of things one can do from a signal handler, and our report code doesn’t really fit that profile…

@gireeshpunathil
Copy link
Member Author

@addaleax -

In the end, there’s a very limited set of things one can do from a signal handler

can we qualify the limitations? `man signal(7) shows an long list of calls that a signal handler can safely call. If ~50 system calls can be made, it pretty much can do many things?

I see malloc is not in that list, and I believe the reason is the said deadlocks. How about proceeding to collect report only if the failing instruction falls within the memory ranges of node.exe?

It is not the desire to have this in report that drives me; but we should have something better than raw termination on segfaults? many deployments inhibit core dumps on crashes, means we end up only having an exit code! Any little first failure data would improve the situation IMO.

@sam-github
Copy link
Contributor

Getting a report on a SEGV would be great, but using signal unsafe functions is prone to failure.

Node is in the process of terminating anyhow, maybe it doesn't matter if it just terminates some other way, but deadlocking and not terminating because its blocked on malloc in the stdlib is worse than terminating with SEGV and no report.

It should be possible to rework node-report to only use signal safe functions. It would have to avoid iostreams/malloc/new, work inside a pre-allocated memory space, just interact directly with fds, etc. A lot of work, but nice to have.

@gireeshpunathil
Copy link
Member Author

@addaleax / @sam-github : ok, just to clarify: I did not mean to argue on signal safety and report's ability to stay within the saftey limits. Keeping my intent clear (handle SIGSEGV in a reasonable manner and with some useful info) I am happy to amend this in the best ways as you would suggest.

One of:

  • slowly refactoring report to make parts of it signal-safe, and include segfault in it
  • Write a custom routine that is fully signal safe, and generate a different report: that has register content and other GPF context
  • simulate OnFatalError - collect very few but vital data

Please suggest!

@sam-github
Copy link
Contributor

@richardlau What do you think of reworking node-report to not use signal unsafe functions? How much work do you think it is? It depends pretty heavily on iostreams, IIRC.

@richardlau
Copy link
Member

Correct, at the moment both the stand alone module and the code in core is heavily dependent on iostreams. I think there was a stalled issue/pr in https://github.com/nodejs/node-report re. signal safety (I'll find references when I'm back on Wednesday after the Lunar New Year).

@richardlau
Copy link
Member

My gut feeling would be to start small -- start with a minimal/empty signal safe report and then gradually rework common bits from the existing code.

@addaleax addaleax added the report Issues and PRs related to process.report. label Feb 4, 2019
@gireeshpunathil
Copy link
Member Author

I think there was a stalled issue/pr in node-report

I guess this is the one nodejs/node-report#92

What are your thoughts on code range check? i.e., if faulty instruction found to belong to anything other than node.exe (c/c++ runtime or os) skip; else proceed. Still we cover a wide class of issues, while being safe?

@sam-github
Copy link
Contributor

What are your thoughts on code range check? i.e., if faulty instruction found to belong to anything other than node.exe (c/c++ runtime or os) skip; else proceed. Still we cover a wide class of issues, while being safe?

I'm not sure I understand the purpose of the code range check.

Are you suggesting that if it's node code, that we can make calls to functions that are not signal safe? If so, I think you'd still be in "undefined behaviour" territory. It's definitely not safe for stdio, the macros would show up as "node code" but be manipulating the FILE object internals.

I think @richardlau 's suggestion of outputting a small subset of the report from the signal handler, converting the code for that small section to signal-safe functions. See how that goes, and then work on extending the report later.

@gireeshpunathil
Copy link
Member Author

@sam-github - IIUC, the whole premise of signal un-saftey is around arbitrary state of the program w.r.t locks that can be held when the signal arrives on an arbitrary thread; that can cause deadlock if attempted to acquire the same lock. Some random checks on APIs are in full agreement with this premise.

safe methods, no locks.

$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble accept" | grep -E "lock|cmpxchg" | wc -l
0
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble fcntl" | grep -E "lock|cmpxchg" | wc -l
0
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble dup" | grep -E "lock|cmpxchg" | wc -l
0
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble chdir" | grep -E "lock|cmpxchg" | wc -l
0
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble kill" | grep -E "lock|cmpxchg" | wc -l
0

unsafe methods, lot of those.

$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble malloc" | grep -E "lock|cmpxchg" | wc -l
7
$ gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble fflush" | grep -E "lock|cmpxchg" | wc -l
10
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble random" | grep -E "lock|cmpxchg" | wc -l
8
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble ftell" | grep -E "lock|cmpxchg" | wc -l
11
$  gdb -batch -ex 'file /lib64/libc.so.6' -ex "disassemble getc" | grep -E "lock|cmpxchg" | wc -l
10

I don't know if there are other usafety parameters. Unfortunately the documentation is not clear on how the unsafety materializes.

Regarding the code range check:

  • the signal handler callback provides the failing context, that includes the program counter as well.

  • any code (including that belongs to c or c++ runtime) will show up correctly in this context.

  • the only case it will be inaccurate is when we issue a system call: the pc will show the callsite, but the work will be happening in the kernel. But then crashes happen inside the kernel do not issue a signal to the process, instead causes irrecoverable failure to the system. so that case is not applicable in this scenario.

  • if the pc is anywhere outside the range of node executable, there is a chance that it is interrupted while holding a lock. So we bail out.

  • if the pc is anywhere inside the range of node executable, the only locks that the threads would have acquired are program defined locks. If we don't attempt on that, we should be safe?

@addaleax
Copy link
Member

addaleax commented Feb 6, 2019

IIUC, the whole premise of signal un-saftey is around arbitrary state of the program w.r.t locks that can be held when the signal arrives on an arbitrary thread

It’s not just about locking (I wouldn’t even be sure if multithreading predates the concept of signal safety – it probably doesn’t?). It’s more about the fact that you might not be able to be sure which state a given part of the process is in, if it isn’t coded intentionally to be signal-safe; for example, malloc() is signal-unsafe not just because of locks, but also because it manages global data. If it was just in the middle of modifying such data when the process crashed, it’s not a good idea to call it again.

@gireeshpunathil
Copy link
Member Author

@addaleax -

If it was just in the middle of modifying such data when the process crashed, it’s not a good idea to call it again.

Sure, but then our range check will tell us the process is in the red zone, right?

Do we have such unsafe program state when the $pc is in our bounds? I am unable to think of any.

@addaleax
Copy link
Member

addaleax commented Feb 6, 2019

@gireeshpunathil I don’t know, but this sounds very brittle.

I think it would be a good idea to stick with signal-safe-functions-only signal handlers.

@mhdawson
Copy link
Member

mhdawson commented Feb 6, 2019

@gireeshpunathil we must have the same problem in Java and generating javacores. Can you find out what the solution was there as an additional piece of information?

@gireeshpunathil
Copy link
Member Author

@mhdawson: we do seem to be using lock rich functions from the signal handler. I could locate many, here is one each from from IBM Java and open JDK

ibm jdk:

#0  0x000000393584eed0 in fprintf () from /lib64/libc.so.6
#1  0x00000039358e4d08 in __vsyslog_chk () from /lib64/libc.so.6
#2  0x00000039358e5350 in syslog () from /lib64/libc.so.6
#3  0x00007ffff6247334 in j9syslog_write () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#4  0x00007ffff6228457 in j9nls_vprintf () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#5  0x00007ffff62284e6 in j9nls_printf () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#6  0x00007ffff5a9bafb in triggerDumpAgents () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9dmp28.so
#7  0x00007ffff5a8924d in abortHandler () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9dmp28.so
#8  <signal handler called>
#9  0x0000003935832495 in raise () from /lib64/libc.so.6
#10 0x0000003935833c75 in abort () from /lib64/libc.so.6
#11 0x00007ffff6239557 in masterSynchSignalHandler () from /opt/ibm/java-x86_64-80/jre/lib/amd64/compressedrefs/libj9prt28.so
#12 <signal handler called>
#13 0x0000003935c082fb in pthread_join () from /lib64/libpthread.so.0
#14 0x00007ffff7dd8b5d in ContinueInNewThread0 () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#15 0x00007ffff7dcc874 in ContinueInNewThread () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#16 0x00007ffff7dcfd4a in JLI_Launch () from /opt/ibm/java-x86_64-80/jre/bin/../lib/amd64/jli/libjli.so
#17 0x000000000040068a in main ()

open jdk:

#0  0x0000003935867590 in ftell () from /lib64/libc.so.6
#1  0x00007ffff77449f0 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#2  0x00007ffff7744207 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#3  0x00007ffff76eccb2 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#4  0x00007ffff7a127be in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#5  0x00007ffff775ca84 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#6  0x00007ffff7b734a7 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#7  0x00007ffff7b73f13 in ?? () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#8  0x00007ffff7a1a1c7 in JVM_handle_linux_signal () from /usr/lib/jvm/java-1.6.0-openjdk-1.6.0.41.x86_64/jre/lib/amd64/server/libjvm.so
#9  <signal handler called>
#10 0x0000003935c082fb in pthread_join () from /lib64/libpthread.so.0
#11 0x00000000004051e1 in ?? ()
#12 0x00000000004042a0 in ?? ()
#13 0x000000393581ed1d in __libc_start_main () from /lib64/libc.so.6
#14 0x0000000000401fe9 in ?? ()
#15 0x00007fffffffe308 in ?? ()
#16 0x000000000000001c in ?? ()
#17 0x0000000000000002 in ?? ()
#18 0x00007fffffffe5a6 in ?? ()
#19 0x00007fffffffe5b4 in ?? ()
#20 0x0000000000000000 in ?? ()

openj9 should not be too different I guess, here is their handler entry point for synchronous signals:
https://github.com/eclipse/omr/blob/2e2934e92284f113cfff87d93bbdd41175739d44/port/unix/omrsignal.c#L901

IMO Java is banking on the realistic probability of very low to 0 gpf rate from the stdlib functions. Double free is a common case in glibc but that leads to safe SIGABRT.

@gireeshpunathil
Copy link
Member Author

there is also a popular segfault-handler module in the ecosystem that catches SIGSEGV from node apps and prints some useful information.

Based on the feedback so far, let me go back and work on a revision, aiming to capture the most important data with safe code.

@gireeshpunathil gireeshpunathil added the wip Issues and PRs that are still a work in progress. label Feb 7, 2019
@BridgeAR
Copy link
Member

BridgeAR commented Mar 5, 2019

Ping @gireeshpunathil

@gireeshpunathil gireeshpunathil added stalled Issues and PRs that are stalled. and removed wip Issues and PRs that are still a work in progress. labels Mar 6, 2019
@gireeshpunathil
Copy link
Member Author

@BridgeAR - this PR is stalled: the current approach is deemed to be signal-unsafe, and require rework in the report module to use only signal-safe code for this approach to work reliably. I have updated the labels accordingly. thanks!

@addaleax
Copy link
Member

I’m going to close this PR, because it requires a significant amount of preparation work, and conflicts heavily with changes that happened on master and would basically need to be re-created anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. report Issues and PRs related to process.report. stalled Issues and PRs that are stalled.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIGSEGV handler causes infinite loop if signal raised from within
7 participants