Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Register individual FDEs for musl libc. #1914

Merged
merged 1 commit into from
Jun 25, 2020

Conversation

peterhuene
Copy link
Member

When targeting musl, libunwind is used for the __register_frame
implementation.

Unlike when targeting libgcc which expects an entire frame table, the libunwind
implementation expects a single FDE.

This change ensures Wasmtime registers each individual FDE when targeting musl.

Fixes #1904.

Copy link
Contributor

@yurydelendik yurydelendik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shall work. (FWIW I know little about cfg(target_env))

@alexcrichton
Copy link
Member

FWIW I did a mild amount of poking with this, but I'm not sure if this works for the purposes of backtrace because cargo test --target x86_64-unknown-linux-musl was still failing.

Also I think the logic may be a bit clearer inverted for glibc-specific bits perhaps? I'm not sure how prevalent libgcc's logic is, but the "write a 0 length at the end of the table" on line 94 above these changes I think is only relevant for glibc.

@peterhuene
Copy link
Member Author

peterhuene commented Jun 23, 2020

It is only relevant for libgcc's implementation, but doesn't really hurt to put it in the entire frame table and just skip it for the libunwind iteration.

You're right, it appears stack tracing isn't working even with __register_frame appearing to properly register the individual FDEs with these changes.

Let me dig into that before we merge this.

@peterhuene
Copy link
Member Author

peterhuene commented Jun 24, 2020

It looks like the unwind cursor can't get past the signal handler trampoline frame and hence doesn't even get to the Wasm frames.

From lldb trace:

(lldb) bt
...
    frame #17: 0x0000000001af8ca0 wasmtime`__restore_rt
    frame #18: 0x00007ffff7de0075  <--- faulting wasm frame with a simple unreachable trap
...
(lldb) di -a 0x0000000001af8ca0
wasmtime`__restore_rt:
    0x1af8ca0 <+0>: movq   $0xf, %rax
    0x1af8ca7 <+7>: syscall 

lldb appears to special case the signal handler function names and hence can walk past this frame.

However, libunwind::UnwindCursor<A, R>::setInfoBasedOnIPRegister with pc=0x0000000001af8ca0 results in _unwindInfoMissing being set to true for the frame. Note that it did search Wasmtime's dynamically registered JIT frames for this one, except it isn't one of ours so that search was fruitless.

There's a nop instruction prior to this trampoline which is important as the PC is subtracted by 1 for the FDE search (it assumes there's been a call to create a frame, after all), so the FDE start for this function would actually be 0x0000000001af8c9f. Without the nop, the adjusted PC would actually be in a different function.

I think this may work from glib rather than musl because it appears an FDE is emitted for __restore_rt.

I'm leaning towards this being a musl bug where no FDE is emitted for __restore_rt.

@peterhuene
Copy link
Member Author

I'd thus expect libunwind::_Unwind_Backtrace to be busted on musl for any walk from a signal handler. I'll see if I can create a small repro in C.

@peterhuene
Copy link
Member Author

peterhuene commented Jun 24, 2020

A simple test program compiled with clang test.c -static -lunwind -llzma (clang 10):

#include <stdio.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#define UNW_LOCAL_ONLY
#include <libunwind.h>

void backtrace() {
  unw_cursor_t cursor; unw_context_t uc;
  unw_word_t ip, sp;
  unw_word_t offset;
  char name[256];
  unw_getcontext(&uc);
  unw_init_local(&cursor, &uc);
  while (unw_step(&cursor) > 0) {
    unw_get_reg(&cursor, UNW_REG_IP, &ip);
    unw_get_reg(&cursor, UNW_REG_SP, &sp);
    unw_get_proc_name(&cursor, name, sizeof(name), &offset);
    printf ("ip = %lx, sp = %lx: %s + %lx\n", (long) ip, (long) sp, name, (long) offset);
  }
}

void handler(int sig) {
  backtrace();
  exit(1);
}

void crash() {
  int* p = (int*) 0;
  *p = 1;
}

void bar() {
  crash();
}

void foo() {
  bar();
}

int main() {
  signal(SIGSEGV, handler);
  foo();
}

Interestingly, this does create an FDE for __restore_rt, but the starting PC is not the nop instruction:

$ ldd a.out
/lib/ld-musl-x86_64.so.1: a.out: Not a valid dynamic program

$ readelf --debug-dump=frames -a a.out | grep 000000000041359a
   471: 000000000041359a     9 FUNC    GLOBAL HIDDEN     2 __restore_rt
00001540 000000000000001c 00001528 FDE cie=00001528 pc=000000000041359a..00000000004135a3

As a result, this program is able to walk the stack, albeit it showing sigaction+27 for __restore_rt as sigaction is the nearest non-hidden symbol.

$ ./a.out
ip = 401270, sp = 7ffc962fbba0: handler + 10
ip = 41359a, sp = 7ffc962fbbc0: sigaction + 27
ip = 4012a9, sp = 7ffc962fbbd0: bar + 9
ip = 4012b9, sp = 7ffc962fc180: foo + 9
ip = 4012e2, sp = 7ffc962fc190: main + 22
ip = 40f8eb, sp = 7ffc962fc1b0: libc_start_main_stage2 + 27

A Rust port of the program:

use backtrace::Backtrace;
use nix::sys::signal;

extern "C" fn handler(_: nix::libc::c_int) {
    let bt = Backtrace::new();
    println!("{:?}", bt);
    std::process::exit(1);
}

fn crash() {
    unsafe {
        let p: *mut i32 = std::ptr::null_mut();
        *p = 1;
    }
}

fn bar() {
    crash();
}

fn foo() {
    bar();
}

fn main() {
    unsafe {
        signal::signal(
            signal::Signal::SIGSEGV,
            signal::SigHandler::Handler(handler),
        );
    }

    foo();
}

However, no FDE was emitted for __restore_rt:

$ ldd test
/lib/ld-musl-x86_64.so.1: test: Not a valid dynamic program

$ readelf --debug-dump=frames -a test | grep 0000000000507047
  6473: 0000000000507047     9 FUNC    GLOBAL HIDDEN     2 __restore_rt

As a result, it only prints the top frame:

$ ./test
   0: test::handler
             at /test/src/main.rs:5

@alexcrichton any insights into why the Rust toolchain might be missing this FDE? This repros with stable (llvm 9) and nightly (llvm 10).

@alexcrichton
Copy link
Member

Oh wow awesome investigation!

In your C example you're compiling with clang but are you sure that's linking to musl? By default I'd imagine that links to glibc and I think -static works for simple-ish programs with glibc.

I tried your program in an alpine container and I can't seem to get good results, but I'm not the best at investigating this too. There's two libunwind implementations (libunwind-dev and llvm-libunwind-dev), and I'm pretty sure the one we ship with Rust is the latter (the LLVM one). AFAICT the Rust toolchain ships with a musl built with musl-cross-make and it may be pulling in a too-old version? In any case in Alpine the results I'm getting are:

  • libunwind - backtrace works
  • libunwind plus -static - backtrace works
  • llvm-libunwind - backtrace doesn't get past the first frame
  • llvm-libunwind plus -static - doesn't get past the first frame

The odd part is that with llvm-libunwind I can see a CIE for the backtrace function, so I don't know why it can't get past the first frame. I wonder if this is maybe just a bug with llvm-libunwind?

In any case I think this is pretty far afield from what changes we should make to wasmtime itself. It seems clear here that we should register individual FDEs rather than the whole set at once like we do with glibc. In terms of changes to this PR, it seems like we now have 2/3 platforms (macos, musl) which require individual FDEs so it seems like glibc is the outlier taking the whole set at once. In that case could the code be restructured to have the individual FDE be the general case and the glibc path is just an optimization for glibc? For example the zero-length push at the end is only needed for glibc, and I think the cfg_if! can probably be removed in favor of just a simple if cfg!(linux_gnu) { ... }

@alexcrichton
Copy link
Member

Oh and to make matters worse I don't know how the source of __restore_rt uses eh_frame anywhere...

@peterhuene
Copy link
Member Author

peterhuene commented Jun 24, 2020

I can confirm that my test cases are linking against musl libc (it's an alpine image using a musl-libc targeted clang).

I've been comparing the implementations between llvm-libunwind and libunwind for stack walking. Both support having an S augmentation in the CIE to mark related FDE entries as "signal handler frames". Unfortunately, the CIE for __restore_rt has no augmentations in these test executables.

For llvm-libunwind, that appears to be the only way to determine signal handler frames. Even so, the walking algorithm doesn't seem to do anything for special for these frames (such as skipping over them). This appears to be the crux of Rust's (or the C program linked against llvm-libunwind's) inability to walk the stack from a signal handler.

On the other hand, libunwind, about 10 years ago, used to detect __restore_rt by checking the bytes of the trampoline directly. It would explicitly know how to unwind that frame and skip over it, hence giving more reliable walks. Today a comment claims it uses "kernel provided information for the trampoline", but I haven't yet figured out where it does that.

Regarding if cfg!(linux_gnu) { ... } (are there docs on this variable somewhere? my googlefu is weak), my concern is that there might be non-linux glibc platforms (e.g. kfreebsd, etc.) we'd be doing the wrong thing for in the future. But if you're comfortable with libunwind's __register_frame implementation being the default, I'm good with it too.

@bjorn3
Copy link
Contributor

bjorn3 commented Jun 24, 2020

That comment was introduced in libunwind/libunwind@dac2d00

@alexcrichton
Copy link
Member

Oh geez that's quite a lot of history... Anyway definitely sounds like fuel for the fire of "not something we can fix in wasmtime". That being said this is also more fuel to the fire of "we should not rely on the native system unwinder to generate a backtrace on every single platform" since this mean wasm traps on musl will always be wrong.

@peterhuene were you able to reproduce how libunwind was working better than llvm-libunwind?

Oh also for cfg!(linux_gnu) that's not actually a thing, it's bad shorthand for cfg!(all(target_os = "linux", target_env = "gnu")). In general it just seems like frame registration is pretty likely to be different per platform and we just have it where 2/3 platforms (musl/macos out of those plus linux) require iteration rather than whole-chunk registration. Ideally we'd have a clause here which says "you're on an unknown platform, please test this and see what's right"

@peterhuene
Copy link
Member Author

I'll update the PR to use cfg!(all(target_os = "linux", target_env = "gnu")) and limit the push of the ending record marker in that case too.

I agree we can declare the inability to walk the stack for musl targets using llvm-libunwind an external issue to Wasmtime. I think we should be somehow limiting our walks to sections of the stack that contain Wasm frames (as discussed in #1832) as it would both address this issue and #1845.

@peterhuene
Copy link
Member Author

peterhuene commented Jun 24, 2020

@alexcrichton PR updated.

I added a check for target_env = "" based on the docs. I can remove it if that's not correct.

Copy link
Contributor

@yurydelendik yurydelendik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming if cfg!(all(target_os = "linux", any(target_env = "gnu", target_env = ""))) is the right way to detect libgcc's unwind library, looks good to me.

When targeting musl, libunwind is used for the `__register_frame`
implementation.

Unlike when targeting libgcc which expects an entire frame table, the libunwind
implementation expects a single FDE.

This change ensures Wasmtime registers each individual FDE when targeting musl.

Fixes bytecodealliance#1904.
@alexcrichton alexcrichton changed the base branch from master to main June 25, 2020 18:48
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Jun 25, 2020
* This PR is against a branch called `main`
* Internally all docs/CI/etc is updated
* The default branch of the repo is now `main`
* All active PRs have been updated to retarget `main`

Closes bytecodealliance#1914
alexcrichton added a commit that referenced this pull request Jun 25, 2020
* This PR is against a branch called `main`
* Internally all docs/CI/etc is updated
* The default branch of the repo is now `main`
* All active PRs have been updated to retarget `main`

Closes #1914
@peterhuene peterhuene reopened this Jun 25, 2020
@peterhuene peterhuene merged commit 9ce67d8 into bytecodealliance:main Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"bad fde" warning when compiled for x86_64-unknown-linux-musl target
4 participants