-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The weird stack/page-table/general memory(?) corruption issue #56
Comments
I've been meaning to fuzz a few things to make sure they don't have bugs that could mess with stuff elsewhere:
|
This seems to be because the return address popped by the |
I've built some basic stacktracing abilities to help debug this issue, which has created some interesting output:
Somehow, between the start of the system call and the end, the user stack is being corrupted, even though we've switched to the kernel stack and done all our work on that. This stack corruption can then cause either a |
We now execute all the way through From the output of Kernel stack area, showing one valid kernel stack, and then a corrupted area:
(adding some logging to the syscall stuff seems to replace this kernel stack corruption with the same bug as before (i.e. the user stack being messed up, so we end up returning to |
Just braindumping for now because I picked this up while trying to solve another issue, but after updating to Serial output from when we drop to usermode, up to second syscall of second task
|
What with med school, I unfortunately don't really have the mental bandwidth to sit down and work out what I'm sure is just some dumb bug somewhere in the paging code or something. This is a bit of a shame since the actual idea is coming together nicely when it bloody works. The current incarnation of the bug is page-table corruption again, this time localised specifically to the TLS area created for the task (only two are loaded, Serial output:
Output of `info tlb` after #PF
Edit: I think this is an issue with the page tables constructed for each task, but I'm not entirely sure. Subtle changes (probably changing code size of the kernel or whatever) create corrupted mappings for varying memory regions (usually single regions (e.g. one stack, one image segment) but not exclusively), often ending up pointing to completely wrong addresses or with wrong flags. This is not exclusive to the user mappings either - we've also seen user kernel stacks become user-accessible, but we have yet to see part of the kernel's actual page tables become corrupted. |
Re-enabled all the stuff we'd turned off to try and isolate the issue today and it uh, got worse. However, it's more obvious what has actually happened here: the entire set of page tables have been overwritten (all real mappings are gone, unlike most of the time), and the entire thing is now filled with the exact same value (same address and flags) in a bunch of P3 entries (each address offset if 1GiB). In fact, this looks like almost all (but not quite??) of one whole P4 entry (one P3 table). I wonder if this explains the corruption of a single "thing" (e.g. a single kernel stack) - since we spread everything out quite a lot, a single page being overwritten could lead to this? Output of `info tlb`:
|
So I've fixed the problem where we needed to keep the UEFI boot services memory mapped in the kernel page tables, which does not fix this problem, but gets rid of something I was uncomfortable about and had (unfounded, clearly) worries about. The fact that is was It turns out it was actually our loader's stack, created in a weird region we weren't keeping mapped in the kernel page tables. Before This makes it seem like UEFI is behaving itself after all (I guess we have to take back some of the things we said about it), putting the ball squarely in our court on the corruption thing. |
Since starting to flesh out the syscall layer and userspace functionality, we've been seeing an on-and-off issue that presents in a couple of different ways:
0x0
upon asysret
, even though the correct RIP is saved to the stack and (seems to be) restored to RCX (after a bunch of successful system calls)#GP
in userspace on aret
instruction after asysret
instead of a#PF
from returning to address0x0
. Again, this is after a bunch of successful system calls.efiloader
and switched to by the scheduler. The presence of a second task can even change the behaviour of the first task, which suggests a deeper issue.I am running off the assumption that all these issues are caused by an elusive root issue that is causing UB that presents in strange ways, but this is not a known and there could well be multiple distinct issues. The most perculiar thing about this problem is that it has been 'fixed' a few times (notably by
cbed8cd
which fixed it until2b81d5d
), but always ends up showing back up with a (seemingly) unrelated change.I'm using this issue to track progress on fixing this issue, which I'm imagining will also involve expanding our kernel test coverage to try and confirm that things are working as intended.
The text was updated successfully, but these errors were encountered: