-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stacks broken due to clobbered frame pointer (fix: support DWARF/LBR) #1006
Comments
I just had a thought that this may be due to frame pointers being omitted in libraries, like glibc (see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=767756, I'm on Debian). But that can't be the whole story, I've seen full stacks that end in glibc functions:
|
Dissasembly of
Which indeed looks like none of them save the stack pointer. What about the glibc function that was walkable (
So it seems like the essential difference is that chdir is simple enough that it didn't clobber the frame pointer. Since recompiling all libraries with frame pointers would be an arduous task, it's probably best to wait for the resolution of iovisor/bcc#1234 (which seems like an old issue). Also ref iovisor/bcc#1803, which looks related. |
Will take a closer look at your writeup soon, but I can comment about LBR stuff. I've spent some time hacking on lbr+bpf in the kernel (have a patch on lkml pending) and I think it would be quite tricky to get lbr based stacks into bpf progs. You'd have to save + restore lbr data on every context switch for complete stacks. That'd involve quite a bit of arch specific hackery not to mention the constant overhead of saving + restoring lbr data because you can never know when someone will read the data. My guess is even if someone managed to upstream this code, few distros would enable this option b/c of the perf hit. DWARF based unwinding won't happen for the kernel -- they tried that and went with something custom instead (ORC). Not sure we have a good answer for stacks without frame pointers at the moment. |
|
I had no idea. Thanks! For now, I think this should be documented. Isn't the only thing required the mappings ( This makes me think about what it would take to normalize stacks so that one could decide whether two stacks are essentially identical (imagine the same binary with different mapping offsets which when subtracted yield the same stack). I would prefer using this In my case, it even seems like different pids have the same address layout (mapping), so one wouldn't even have to normalize. Just save a single |
Ouch. That sounds expensive.
Does it need to happen in the kernel though? If we have the stack and the mapping it would be possible to symbolize user-space stacks in user-space. Of course, you would need to actually know which process to get mappings for. Since aggregation happens in the kernel, bpftrace may be notified long after the process is dead. Some thoughts:
|
Right, so symbolization always happens in userspace. And we could definitely save /proc/pid/maps and figure out base addrs and stuff to enabled symbolization after process exit. I'm planning on adding that to bcc soon. But getting the raw addrs has to be in kernel b/c that's where the bpf prog runs. And without frame pointers enabled, the kernel can't walk the stack to give us the raw addrs. Theoretically, the kernel could parse DWARF info to figure out where the top of the frame is given current PC, but Linus really hates the DWARF state machines. |
For dwarf we would have to copy the stack so we can walk through it. bpf won't know which stacks are the same either, so none of the aggregation will work. Instead every single stack will have to be copied to us. I think you're better of using
Yes this pid is used to find the relevant maps in prod/pid/maps.
What would mapsref be in this case? Afaik there is no mapsref in bpf, which is why we use the pid. |
You're right. Throughout this entire thread I've made a mess of symbolization and stack walking. The former can indeed be improved in bpftrace, but as you state, stack-walking must be done by the kernel to be able to reap the advantages of in-kernel aggregation. Alas. Thinking on it, it would have helped me during my investigation to get full stack traces whenever I chose to do per-event output (without aggregation). I'm not sure how feasible such a thing is in the bpftrace model. |
You can printf individual stacks, like: |
Yes, I can. But what I was trying to say is that this uses the same kernel stack walking which is limited to "only when the frame pointer is used as a frame pointer". It can't/doesn't use DWARF, which is more likely to be available in user space. What caused me to write about this is the realization that I was missing important information because the application I'm tracing is using glibc which is not compiled with Something like perf(1), which also does per-event tracing, is able to use DWARF. I'm not sure how this process works, as I'd assume the target thread needs to be stopped and its stack walked which could be slow. |
Just ran into this. It's very frustrating, as it effectively makes Since it looks like this will be harder to solve, could a note be added to the ustack docs to save future souls a few hours of wondering what is wrong with their stack traces? |
I went around and had a look at some of the prior discussions I could find to understand the problem:
I think for most of the things I use |
I am trying to get user space stack traces without frame pointers. My initial approach was to compile the However I think that this tool could easily be extended to support generating orc sections. If someone kernel savy would like to collaborate, that would be cool. |
bpftrace version: bpftrace v0.9.2-139-g873d
kernel: Linux 4.19
I modified
offwake.bt
to print userspace stacks, in order to get more information about what's causing the program in question (neovim) to go to sleep.NOTE: I recompiled it with
-fno-omit-frame-pointer
to be sure frame-pointer based stack walking would work.I've made a simplified bpftrace program to debug this issue:
Usually I get stacks like:
The
syscall
is thexstat
system call, verified by checking thekstack
counterpart of the wake-up). I'm not sure what those 0x4 instructions are for, or how it's even stack walking this, it may be a a hint as to what's going wrong.Sometimes I do get complete stacks in
finish_task_switch
, but it is rare:This is likely because these wakeup events are rarer,
epoll_pwait
is by far the most frequent (neovim uses it to wait for user input). But why is the stack truncated? How can I debug this?By contrast,
@samplestacks
are usually full stacks. There are a couple of truncated ones though:I don't know why these are broken either. I've once even observed a stack inside the 99hz sampler which looked the same as the most frequent one in
finish_task_switch
:7fa2516d1f4e epoll_pwait+110 (/lib/x86_64-linux-gnu/libc-2.28.so)
. Though this is hard to reproduce.I thought maybe this was related to some incomplete process state inside of
finish_task_switch
. But changing to a kretprobe didn't fix it and then I noticed that every so often I get similar broken stacks from the 99hz profiler.It seems like a wakeup from a select number of syscalls provokes these broken stacks. I've
Also very curious, in the aggregated stacks, I get two keys in the same map which look identical to me, but apparently aren't to bpftrace.
What gives?
The text was updated successfully, but these errors were encountered: