Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stacks broken due to clobbered frame pointer (fix: support DWARF/LBR) #1006

Open
aktau opened this issue Dec 10, 2019 · 14 comments
Open

stacks broken due to clobbered frame pointer (fix: support DWARF/LBR) #1006

aktau opened this issue Dec 10, 2019 · 14 comments
Labels

Comments

@aktau
Copy link

aktau commented Dec 10, 2019

bpftrace version: bpftrace v0.9.2-139-g873d
kernel: Linux 4.19

I modified offwake.bt to print userspace stacks, in order to get more information about what's causing the program in question (neovim) to go to sleep.

NOTE: I recompiled it with -fno-omit-frame-pointer to be sure frame-pointer based stack walking would work.

I've made a simplified bpftrace program to debug this issue:

profile:hz:99 /comm == "nvim"/ {
	@samplestacks[ustack(perf, 10)] = count();
	printf("99hz: %s\n", ustack(perf, 10));
}

kprobe:finish_task_switch /comm == "nvim"/ {
	@switchstacks[ustack(perf, 10)] = count();
	printf("finish_task_switch: %s\n", ustack(perf, 10));
}

Usually I get stacks like:

finish_task_switch:                                                                                                                                            
        7fa2516d1f4e epoll_pwait+110 (/lib/x86_64-linux-gnu/libc-2.28.so)   
finish_task_switch:
        7fa2516cc869 syscall+25 (/lib/x86_64-linux-gnu/libc-2.28.so)
finish_task_switch:
        7fa252347f0c __lll_lock_wait+28 (/lib/x86_64-linux-gnu/libpthread-2.28.so)
        558c66768c78 uv__async_io.part.0+280 (/home/aktau/neovim/bin/nvim)
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])
        4 0x4 ([unknown])

The syscall is the xstat system call, verified by checking the kstack counterpart of the wake-up). I'm not sure what those 0x4 instructions are for, or how it's even stack walking this, it may be a a hint as to what's going wrong.

Sometimes I do get complete stacks in finish_task_switch, but it is rare:

finish_task_switch:                                                                                                                                            
        7fa252347f0c __lll_lock_wait+28 (/lib/x86_64-linux-gnu/libpthread-2.28.so)                                                                             
        558c66534b7f tui_scheduler+54 (/home/aktau/neovim/bin/nvim)                                                                           
        558c66528583 ui_bridge_flush+109 (/home/aktau/neovim/bin/nvim)                                                                        
        558c6652a80e ui_call_flush+51 (/home/aktau/neovim/bin/nvim)                                                                           
        558c6653473c ui_flush+80 (/home/aktau/neovim/bin/nvim)                                                                                
        558c665fba8e normal_execute.lto_priv.220+1311 (/home/aktau/neovim/bin/nvim)                                                           
        558c6655264d state_enter+71 (/home/aktau/neovim/bin/nvim)                                                                             
        558c665f87c5 normal_enter+136 (/home/aktau/neovim/bin/nvim)                                                                           
        558c6664b605 main+2252 (/home/aktau/neovim/bin/nvim)                                                                                  
        7fa2515fc52b __libc_start_main+235 (/lib/x86_64-linux-gnu/libc-2.28.so)  

This is likely because these wakeup events are rarer, epoll_pwait is by far the most frequent (neovim uses it to wait for user input). But why is the stack truncated? How can I debug this?

By contrast, @samplestacks are usually full stacks. There are a couple of truncated ones though:

99hz:
        7fa25165cc89 __GI___libc_malloc+345 (/lib/x86_64-linux-gnu/libc-2.28.so)                                                                               

99hz:
        558c667740e0 uv_write2+464 (/home/aktau/neovim/bin/nvim)                                                                                                                                                                                                                                              
        3 0x3 ([unknown])  

99hz:
        558c66777f67 uv__hrtime+71 (/home/aktau/neovim/bin/nvim)  

99hz:
        7fa25165cb89 __GI___libc_malloc+89 (/lib/x86_64-linux-gnu/libc-2.28.so)

I don't know why these are broken either. I've once even observed a stack inside the 99hz sampler which looked the same as the most frequent one in finish_task_switch: 7fa2516d1f4e epoll_pwait+110 (/lib/x86_64-linux-gnu/libc-2.28.so) . Though this is hard to reproduce.

I thought maybe this was related to some incomplete process state inside of finish_task_switch. But changing to a kretprobe didn't fix it and then I noticed that every so often I get similar broken stacks from the 99hz profiler.

It seems like a wakeup from a select number of syscalls provokes these broken stacks. I've

Also very curious, in the aggregated stacks, I get two keys in the same map which look identical to me, but apparently aren't to bpftrace.

@switchstacks[
        7fa2516d1f4e epoll_pwait+110 (/lib/x86_64-linux-gnu/libc-2.28.so)
]: 565
@switchstacks[
        7fa2516d1f4e epoll_pwait+110 (/lib/x86_64-linux-gnu/libc-2.28.so)
]: 3446

What gives?

@aktau aktau changed the title Sometimes broken userspace, only shwoing (offwake.bt) Sporadic broken userspace stacks, easy to reproduce in finish_task_switch kprobe (offwake.bt) Dec 10, 2019
@aktau
Copy link
Author

aktau commented Dec 10, 2019

I just had a thought that this may be due to frame pointers being omitted in libraries, like glibc (see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767756, I'm on Debian). But that can't be the whole story, I've seen full stacks that end in glibc functions:

finish_task_switch: 
        7fa2516c3617 __GI_chdir+7 (/lib/x86_64-linux-gnu/libc-2.28.so)                                                                                         
        558c665bf339 path_full_dir_name+173 (/home/aktau/neovim/bin/nvim)                                                                     
        558c665bf409 path_to_absolute+140 (/home/aktau/neovim/bin/nvim)                                                                       
        558c665bf4da vim_FullName+98 (/home/aktau/neovim/bin/nvim)                                                                            
        558c665c3054 FullName_save+55 (/home/aktau/neovim/bin/nvim)                                                                           
        558c665c308e fix_fname+14 (/home/aktau/neovim/bin/nvim)                                                                               
        558c666a0125 do_source+84 (/home/aktau/neovim/bin/nvim)                                                                               
        558c666a0efb source_callback+19 (/home/aktau/neovim/bin/nvim)                                                                         
        558c6669989c do_in_path+743 (/home/aktau/neovim/bin/nvim)                                                                             
        558c66699b9e do_in_path_and_pp+86 (/home/aktau/neovim/bin/nvim)  

@aktau
Copy link
Author

aktau commented Dec 10, 2019

Dissasembly of epoll_pwait, uv__hrtime and uv_write2 (mentioned broken stack in the original post):

0000000000000000 <uv_hrtime>:
       0:       31 ff                   xor    %edi,%edi
       2:       e9 00 00 00 00          jmpq   7 <uv_hrtime+0x7>
       7:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
       e:       00 00 

0000000000001780 <uv_write2>:
    1780:       41 57                   push   %r15
    1782:       41 56                   push   %r14
    1784:       41 55                   push   %r13
    1786:       41 54                   push   %r12
    1788:       55                      push   %rbp
    1789:       53                      push   %rbx
    178a:       48 83 ec 18             sub    $0x18,%rsp
    178e:       85 c9                   test   %ecx,%ecx
    1790:       0f 84 1c 02 00 00       je     19b2 <uv_write2+0x232>
     ...

00000000000f8ee0 <epoll_pwait@@GLIBC_2.6>:
   f8ee0:       48 8d 05 09 58 2c 00    lea    0x2c5809(%rip),%rax        # 3be6f0 <argp_program_version_hook@@GLIBC_2.2.5+0x1d0>
   f8ee7:       41 89 ca                mov    %ecx,%r10d
   f8eea:       8b 00                   mov    (%rax),%eax
   f8eec:       85 c0                   test   %eax,%eax
   f8eee:       75 20                   jne    f8f10 <epoll_pwait@@GLIBC_2.6+0x30>
   f8ef0:       41 b9 08 00 00 00       mov    $0x8,%r9d
   f8ef6:       b8 19 01 00 00          mov    $0x119,%eax
   f8efb:       0f 05                   syscall 
   f8efd:       48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
   f8f03:       77 73                   ja     f8f78 <epoll_pwait@@GLIBC_2.6+0x98>
   f8f05:       c3                      retq   
   f8f06:       66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
   f8f0d:       00 00 00 
   f8f10:       41 57                   push   %r15
   f8f12:       49 89 f7                mov    %rsi,%r15
   f8f15:       41 56                   push   %r14
   f8f17:       4d 89 c6                mov    %r8,%r14
   f8f1a:       41 55                   push   %r13
   f8f1c:       41 89 cd                mov    %ecx,%r13d
   f8f1f:       41 54                   push   %r12
   f8f21:       41 89 d4                mov    %edx,%r12d
   f8f24:       55                      push   %rbp
   f8f25:       53                      push   %rbx
   f8f26:       89 fb                   mov    %edi,%ebx
   f8f28:       48 83 ec 18             sub    $0x18,%rsp
   f8f2c:       e8 9f d1 00 00          callq  1060d0 <pthread_setcanceltype@@GLIBC_2.2.5+0x80>
   f8f31:       41 b9 08 00 00 00       mov    $0x8,%r9d
   f8f37:       4d 89 f0                mov    %r14,%r8
   f8f3a:       89 df                   mov    %ebx,%edi
   f8f3c:       89 c5                   mov    %eax,%ebp      <---- frame pointer clobbering
   f8f3e:       45 89 ea                mov    %r13d,%r10d
   f8f41:       44 89 e2                mov    %r12d,%edx
   f8f44:       4c 89 fe                mov    %r15,%rsi
   f8f47:       b8 19 01 00 00          mov    $0x119,%eax
   f8f4c:       0f 05                   syscall                                     
   f8f4e:       48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax   <----- instruction pointer

Which indeed looks like none of them save the stack pointer.

What about the glibc function that was walkable (__GI_chdir)?

00000000000ea610 <chdir@@GLIBC_2.2.5>:
   ea610:       b8 50 00 00 00          mov    $0x50,%eax
   ea615:       0f 05                   syscall 
   ea617:       48 3d 01 f0 ff ff       cmp    $0xfffffffffffff001,%rax  <----- instruction pointer
   ea61d:       73 01                   jae    ea620 <chdir@@GLIBC_2.2.5+0x10>
   ...

So it seems like the essential difference is that chdir is simple enough that it didn't clobber the frame pointer.

Since recompiling all libraries with frame pointers would be an arduous task, it's probably best to wait for the resolution of iovisor/bcc#1234 (which seems like an old issue). Also ref iovisor/bcc#1803, which looks related.

@aktau aktau changed the title Sporadic broken userspace stacks, easy to reproduce in finish_task_switch kprobe (offwake.bt) stacks broken due to clobbered frame pointer (fix: support DWARF/LBR) Dec 10, 2019
@danobi
Copy link
Member

danobi commented Dec 10, 2019

Will take a closer look at your writeup soon, but I can comment about LBR stuff.

I've spent some time hacking on lbr+bpf in the kernel (have a patch on lkml pending) and I think it would be quite tricky to get lbr based stacks into bpf progs. You'd have to save + restore lbr data on every context switch for complete stacks. That'd involve quite a bit of arch specific hackery not to mention the constant overhead of saving + restoring lbr data because you can never know when someone will read the data. My guess is even if someone managed to upstream this code, few distros would enable this option b/c of the perf hit.

DWARF based unwinding won't happen for the kernel -- they tried that and went with something custom instead (ORC). Not sure we have a good answer for stacks without frame pointers at the moment.

@fbs
Copy link
Member

fbs commented Dec 10, 2019

Also very curious, in the aggregated stacks, I get two keys in the same map which look identical to me, but apparently aren't to bpftrace.

ustack stores both the pid and the stackid as the pid is needed to translate addresses to symbols, so they're unique entries which just get printed the same way :)

@aktau
Copy link
Author

aktau commented Dec 11, 2019

ustack stores both the pid and the stackid as the pid is needed to translate addresses to symbols, so they're unique entries which just get printed the same way :)

I had no idea. Thanks! For now, I think this should be documented.

Isn't the only thing required the mappings (/proc/$PID/maps), which (if saved) would also allow resolving symbols after the process as exited. It's reasonably annoying that quiting bpftrace after the target process has exited yields unsymbolized stacks.

This makes me think about what it would take to normalize stacks so that one could decide whether two stacks are essentially identical (imagine the same binary with different mapping offsets which when subtracted yield the same stack). I would prefer using this ustacknorm.

In my case, it even seems like different pids have the same address layout (mapping), so one wouldn't even have to normalize. Just save a single /proc/$PID/maps and save the stack as <mapsref, stack> instead of <pid, stack>. But I have to admit I have no idea how common it is that multiple processes launched in the same way have the same mapping.

@aktau
Copy link
Author

aktau commented Dec 11, 2019

I've spent some time hacking on lbr+bpf in the kernel (have a patch on lkml pending) and I think it would be quite tricky to get lbr based stacks into bpf progs. You'd have to save + restore lbr data on every context switch for complete stacks.

Ouch. That sounds expensive.

DWARF based unwinding won't happen for the kernel -- they tried that and went with something custom instead (ORC). Not sure we have a good answer for stacks without frame pointers at the moment.

Does it need to happen in the kernel though? If we have the stack and the mapping it would be possible to symbolize user-space stacks in user-space. Of course, you would need to actually know which process to get mappings for. Since aggregation happens in the kernel, bpftrace may be notified long after the process is dead. Some thoughts:

  • Register a hidden tracepoint/probe that saves maps whenever a new process is created. (I realize this is extremely hacky and expensive, as I suspect the maps aren't even loaded when exec(2) is done.) Likely won't work.
  • Some stub code whenever ustack is called that saves the important part of the process address space to some hidden map if it's not present yet. (I don't know how to deal with PID re-use here, but that seems like a minor problem given all the others.)

@danobi
Copy link
Member

danobi commented Dec 11, 2019

Does it need to happen in the kernel though? If we have the stack and the mapping it would be possible to symbolize user-space stacks in user-space

Right, so symbolization always happens in userspace. And we could definitely save /proc/pid/maps and figure out base addrs and stuff to enabled symbolization after process exit. I'm planning on adding that to bcc soon.

But getting the raw addrs has to be in kernel b/c that's where the bpf prog runs. And without frame pointers enabled, the kernel can't walk the stack to give us the raw addrs. Theoretically, the kernel could parse DWARF info to figure out where the top of the frame is given current PC, but Linus really hates the DWARF state machines.

@fbs
Copy link
Member

fbs commented Dec 11, 2019

Does it need to happen in the kernel though? If we have the stack and the mapping it would be possible to symbolize user-space stacks in user-space.

For dwarf we would have to copy the stack so we can walk through it. bpf won't know which stacks are the same either, so none of the aggregation will work. Instead every single stack will have to be copied to us. I think you're better of using perf in that case.

Isn't the only thing required the mappings (/proc/$PID/maps), which (if saved) would also allow resolving symbols after the process as exited

Yes this pid is used to find the relevant maps in prod/pid/maps.

Just save a single /proc/$PID/maps and save the stack as <mapsref, stack> instead of <pid, stack>

What would mapsref be in this case? Afaik there is no mapsref in bpf, which is why we use the pid.

@aktau
Copy link
Author

aktau commented Dec 15, 2019

For dwarf we would have to copy the stack so we can walk through it. bpf won't know which stacks are the same either, so none of the aggregation will work. Instead every single stack will have to be copied to us. I think you're better of using perf in that case.

You're right. Throughout this entire thread I've made a mess of symbolization and stack walking. The former can indeed be improved in bpftrace, but as you state, stack-walking must be done by the kernel to be able to reap the advantages of in-kernel aggregation. Alas.

Thinking on it, it would have helped me during my investigation to get full stack traces whenever I chose to do per-event output (without aggregation). I'm not sure how feasible such a thing is in the bpftrace model.

@fbs
Copy link
Member

fbs commented Dec 16, 2019

Thinking on it, it would have helped me during my investigation to get full stack traces whenever I chose to do per-event output (without aggregation). I'm not sure how feasible such a thing is in the bpftrace model.

You can printf individual stacks, like: bpftrace -e 'p:hz:1 { printf("%s\n", ustack()); }'

@aktau
Copy link
Author

aktau commented Dec 16, 2019

You can printf individual stacks, like: bpftrace -e 'p:hz:1 { printf("%s\n", ustack()); }'

Yes, I can. But what I was trying to say is that this uses the same kernel stack walking which is limited to "only when the frame pointer is used as a frame pointer". It can't/doesn't use DWARF, which is more likely to be available in user space. What caused me to write about this is the realization that I was missing important information because the application I'm tracing is using glibc which is not compiled with -fno-omit-frame-pointer on my workstation, making many syscall-terminated traces' ustacks useless.

Something like perf(1), which also does per-event tracing, is able to use DWARF. I'm not sure how this process works, as I'd assume the target thread needs to be stopped and its stack walked which could be slow.

@NotAFile
Copy link

Just ran into this.

It's very frustrating, as it effectively makes ustack, by far my favorite bpftrace feature, useless.

Since it looks like this will be harder to solve, could a note be added to the ustack docs to save future souls a few hours of wondering what is wrong with their stack traces?

@NotAFile
Copy link

I went around and had a look at some of the prior discussions I could find to understand the problem:

I think for most of the things I use ustack for, 20% overhead would actually be perfectly acceptable. The things I usually want them for is trying to find the exact source of rare events. 20% overhead is very acceptable in that case.

@dvc94ch
Copy link

dvc94ch commented Jan 31, 2021

I am trying to get user space stack traces without frame pointers. My initial approach was to compile the .eh_frame section to bpf based on [0] and the corresponding c++ code in [1]. For this I ported it to rust, as I couldn't get the code to compile and I happen to be a rust guy. Anyway, the issue that came up was that for some reason clang thinks that the stack size of 512 bytes is exceeded. I'm not sure if that is really the case or if it just makes sure that pointer offsets are smaller than 512, even if those offsets are into code and not the stack.

However I think that this tool could easily be extended to support generating orc sections. If someone kernel savy would like to collaborate, that would be cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants