Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][CI] Enable libSegFault for C++ tests #32399

Closed
asfimport opened this issue Jul 15, 2022 · 19 comments
Closed

[C++][CI] Enable libSegFault for C++ tests #32399

asfimport opened this issue Jul 15, 2022 · 19 comments

Comments

@asfimport
Copy link
Collaborator

Adding libSegFault.so could make it easier to diagnose CI failures. It will print a backtrace on segfault.


  env SEGFAULT_SIGNALS=all \
      LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so

This will give a backtrace like this on segfault:


Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f8f4a0b900b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f8f4a098859]
/lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7f8f4a10326e]
/lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7f8f4a10b2fc]
/lib/x86_64-linux-gnu/libc.so.6(+0x96f6d)[0x7f8f4a10cf6d]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x39)[0x5557a9a83b19]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt8_Rb_treeISt10shared_ptrIN5arrow8DataTypeEES3_St9_IdentityIS3_ESt4lessIS3_ESaIS3_EE8_M_eraseEPSt13_Rb_tree_nodeIS3_E+0x1f)[0x5557a9a83aff]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/flight-test-integration-client(_ZNSt3setISt10shared_ptrIN5arrow8DataTypeEESt4lessIS3_ESaIS3_EED1Ev+0x33)[0x5557a9a83b83]
/lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xce)[0x7f8f4a0bcfde]
/tmp/arrow-HEAD.y8UwB/cpp-build/release/libarrow.so.900(+0x440b67)[0x7f8f47d56b67]

Caveats:

  • The path is OS-specific
  • We could integrate it into the build tooling instead of doing it via env var
  • Are there easily accessible equivalents for MacOS and Windows we could use?

Reporter: David Li / @lidavidm

Note: This issue was originally created as ARROW-17093. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
cc @assignUser

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
This could be useful to better diagnose tests which occasionally timeout on CI, for example by adding a trap-on-timeout facility:

class TrapOnTimeoutGuard {
 public:
  explicit TrapOnTimeoutGuard(double seconds) {
    auto fut = finished_;
    bg_thread_ = std::thread([fut, seconds]() {
      if (!fut.Wait(seconds)) {
        psnip_trap();
      }
    });
  }

  ~TrapOnTimeoutGuard() {
    finished_.MarkFinished();
    bg_thread_.join();
  }

 private:
  Future<> finished_ = Future<>::Make();
  std::thread bg_thread_;
};

cc @westonpace

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Interesting we have a little-known ARROW_WITH_BACKTRACE option that seems to link with https://github.com/ianlancetaylor/libbacktrace . I'm not sure it works, though?

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
I think what that does is use libbacktrace to get backtraces for assertions, but AIUI that library doesn't (automatically) install a fault handler the way libSegFault does.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Edit: ARROW_WITH_BACKTRACE links with the glibc-specific backtrace support: https://www.gnu.org/software/libc/manual/html_node/Backtraces.html

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ah... Unfortunately, the glibc backtrace support only prints a backtrace for the current thread, which makes it useless for the interesting cases :-(

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
From the looks of it, https://github.com/ianlancetaylor/libbacktrace doesn't handle multi-threaded backtraces either.
And it seems libSegFault was removed from glibc: https://lists.gnu.org/archive/html/info-gnu/2022-02/msg00002.html

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Even the more sophisticated (and portable) https://github.com/bombela/backward-cpp seems limited to a single-thread backtrace.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
There is an ugly workaround suggested here: https://stackoverflow.com/questions/44900256/print-all-threads-stack-trace-of-a-process-in-c-c-on-linux-platform

i use pthread_kill in one thread to send SIGUSR2 to other threads, when that threads receive the signal, it delivery to user defined signal handler function. In that function, use backtrace() to print the thread stack

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Really, I'm afraid the least unreasonable solution here is to script our CI to automatically find core dumps and script the debugger.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
That frankly sounds fairly reasonable. With gdb if we get it set up right we can get Python backtraces too where applicable.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
That said, https://github.com/bombela/backward-cpp could be better than nothing on Windows, or in the cases where a working debugger isn't available.

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:

an ugly workaround

Even if it's ugly, this is more or less what GDB itself does to suspend all threads when pausing execution: a single thread in the tracee receives SIGTRAP or SIGINT, to which GDB then responds by sending SIGSTOP to all the other threads. I'm not sure it's possible to do better

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
The problem is how to enumerate all running threads? I suspect GDB uses kernel APIs to do that, which we can probably not reasonably do. Moreover, we're running from a signal handler: only async-signal-safe functions should ideally be called...

I'll add that pthread_kill() expects a POSIX thread id but /proc/self/task/ lists process ids...

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
pthread_kill is async-signal-safe per https://man7.org/linux/man-pages/man7/signal-safety.7.html and as long as we're working with pthreads we can acquire each thread's pthread_t using pthread_self and maintain a vector of traced threads. Should be as simple as implementing void AddThisThreadToAllThreadTraceVector(), void RemoveThisThreadToAllThreadTraceVector(). Win32 supports explicit enumeration of threads in a process so I'm sure we could make it work there also. I'll try it out and make a gist/repo/...

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Well, Windows doesn't have pthreads, so no need to worry about that anyway :-)

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Ben Kietzman / @bkietz:
... however, having written that I think the correct solution to the all-threads-trace problem is allowing the process to core dump then reading stacks out of that. This has two advantages over in-process tracing:

  • When a signal handler exists, the non-signaled threads continue execution until they receive signals of their own. However if a signal is known to be fatal, the OS can shut threads down more aggressively- this means we can get less out-of-date traces from the threads which didn't segfault than we can with interthread signals
  • We'd probably be reading the core dump with gdb or another debugger and we'd have access to the process' full memory, so we could print not just snippets of the source files but values of local variables as well

@pitrou
Copy link
Member

pitrou commented Nov 13, 2024

Automatic traceback generation using gdb on CI was done in PR #43937, closing

@pitrou pitrou closed this as completed Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants