-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework/simplify unwind infrastructure, implement Windows unwind, and add Windows/new-backend to CI. #2710
Conversation
You are missing unwindinfo for the Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf) |
Can you be more specific about what you believe is missing? The unwind directives in the Windows case are: In the latter case (SysV), I looked at the output of But if there's something non-idiomatic about the output in some case, then I'm happy to fix it!
That's interesting. I expect that we'll need to tackle landingpads at some point, driven at least by Wasm exceptions (when that proposal advances) as well as by other use-cases like this one. The complexity around them, though (correctly modeling control-flow, including during optimizations) is something we would need to carefully study, and fully asynchronous exceptions would also require CFI info for epilogues (we don't generate this today because we never unwind in the middle of an epilogue -- there's simply no way for the CLIF to insert any user action there, and there are no asynchronous interrupts). So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point. |
I only saw the
I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach. |
Hmm. Possibly this has to do with the placement of the unwindinfo opcodes; the Push and SetFP are both after the (I did check with clang on Msys/MinGW just now and the unwind opcodes we're generating are exactly the same as what it generates for its prologue, modulo the above placement issue, so I hope this does it...)
Ah, I see; if anything it will be easier with a more direct approach, I think. Most of the work for landingpads will be dealing with the control-flow implications in the rest of the compiler; adding the exception handler info to the tables will be relatively straightforward, I think. |
@bjorn3, I've updated and verified now that stepping through instruction-by-instruction with I also ripped out the old SysV unwind infra in the new-x64 and aarch64 backends; this was dead code, replaced by this unified new implementation, but I had not removed it previously. Happy to see diff-stats show net-negative lines of code after doing that even with Fastcall and epilogues added! |
79bded4
to
cb401b2
Compare
cb401b2
to
e0a3793
Compare
Updated again; I realized that we need a few more ops around mid-function epilogues ( |
07c6ca1
to
efc0a31
Compare
So it turns out that the epilogue CFA info was causing segfaults on CI but not locally when switching to the new backend in #2718, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the |
@alexcrichton would you be willing to review this given that you took a look above already? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure would be happy to!
FWIW while I'm pretty confident in the dwarf unwind information I'm much less confident in the Windows bits. This was the first time I looked at the documentation for windows unwinding codes and to really get a good understanding I'd need to set aside a bigger chunk of time to dig in and see how what we do maps to the specification and such. I suspect it's all fine since tests are all green, but there were definitely lots of places I couldn't connect the dots to reason about why things were the way they were.
a35b6f8
to
317c072
Compare
Updated based on comments; thanks! |
e038b2b
to
62a21ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for all the new comments! Definitely makes sense to me now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fantastic!
I glanced over the aarch64-related changes for the most part, although I did pay attention to the common unwind representation and read over your detailed comments explaining things.
Therefore I paid the most attention to the x64 backend changes and the changes to the unwind generation for both SystemV and Windows and have just three minor comments.
Our previous implementation of unwind infrastructure was somewhat complex and brittle: it parsed generated instructions in order to reverse-engineer unwind info from prologues. It also relied on some fragile linkage to communicate instruction-layout information that VCode was not designed to provide. A much simpler, more reliable, and easier-to-reason-about approach is to embed unwind directives as pseudo-instructions in the prologue as we generate it. That way, we can say what we mean and just emit it directly. The usual reasoning that leads to the reverse-engineering approach is that metadata is hard to keep in sync across optimization passes; but here, (i) prologues are generated at the very end of the pipeline, and (ii) if we ever do a post-prologue-gen optimization, we can treat unwind directives as black boxes with unknown side-effects, just as we do for some other pseudo-instructions today. It turns out that it was easier to just build this for both x64 and aarch64 (since they share a factored-out ABI implementation), and wire up the platform-specific unwind-info generation for Windows and SystemV. Now we have simpler unwind on all platforms and we can delete the old unwind infra as soon as we remove the old backend. There were a few consequences to supporting Fastcall unwind in particular that led to a refactor of the common ABI. Windows only supports naming clobbered-register save locations within 240 bytes of the frame-pointer register, whatever one chooses that to be (RSP or RBP). We had previously saved clobbers below the fixed frame (and below nominal-SP). The 240-byte range has to include the old RBP too, so we're forced to place clobbers at the top of the frame, just below saved RBP/RIP. This is fine; we always keep a frame pointer anyway because we use it to refer to stack args. It does mean that offsets of fixed-frame slots (spillslots, stackslots) from RBP are no longer known before we do regalloc, so if we ever want to index these off of RBP rather than nominal-SP because we add support for `alloca` (dynamic frame growth), then we'll need a "nominal-BP" mode that is resolved after regalloc and clobber-save code is generated. I added a comment to this effect in `abi_impl.rs`. The above refactor touched both x64 and aarch64 because of shared code. This had a further effect in that the old aarch64 prologue generation subtracted from `sp` once to allocate space, then used stores to `[sp, offset]` to save clobbers. Unfortunately the offset only has 7-bit range, so if there are enough clobbered registers (and there can be -- aarch64 has 384 bytes of registers; at least one unit test hits this) the stores/loads will be out-of-range. I really don't want to synthesize large-offset sequences here; better to go back to the simpler pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like a "push". It's likely not much worse microarchitecturally (dependence chain on SP, but oh well) and it actually saves an instruction if there's no other frame to allocate. As a further advantage, it's much simpler to understand; simpler is usually better. This PR adds the new backend on Windows to CI as well.
dd9be5f
to
2d5db92
Compare
@sunfishcode it looks like there might be an intermittent test with |
Hi @cfallin, I finally got around to updating my s390x patch to this new unwind method. This ran into a couple of problems, so I'm wondering whether I'm on the right track here. I'd appreciate your input ...
|
Hi @uweigand -- thanks for the efforts to update and sorry about the churn! I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!). In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/ To the specific questions:
Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using
Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?
We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add
To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make (This is useful pathfinding for our eventual omit-FP optimization on other architectures too, by the way, so thank you for probing these issues now!) |
No problem, thanks for the reply!
I agree, in particular because there now also seems to be renewed interest in s390x support in wasmtime due to additional use cases. I'll try to get things ready for an initial merge soon, hopefully within the next few weeks.
I'm using On my platform, I have another use case in target-specific code. This again has to do with the fact that I am not using a frame pointer register, and therefore the code generator has to rewrite accesses to the incoming argument slots (which are expressed by common code via offsets relative to the FP) in terms of offsets relative to the (nominal) SP. This is done by simply adding the nominal_sp_to_fp offset, which used to work fine, but has now stopped working. So we can either fix the frame_size implementation so that nominal_sp_to_fp again returns the actual offset between those two points in the stack, or else I'll have to find some other way to pass the correct offset including clobbers to the back-end codegen (e.g. via a pseudo instruction). In case you agree with changing frame_size, here's a PR to do so: #2836
Here's a PR to add this support back: #2837 I'd be happy to implement this solely in the back-end, but the main problem I was having with this is that there does not appear to be any back-end specific (mutable) state associated with an ABICallee. I need this state to track the required outgoing argument space as we go through the whole function and lower it; each call instruction that is found will have to update the required size accordingly. Finally, in the prologue and epilogue code generator I need to access that accumulated value so I can alloc/dealloc the correct amount of extra stack space. Without the common code implementation, do you have a suggestion where I should hold this extra state?
Here's a PR to add support for operation without frame pointer along the lines suggested above: #2838 In particular, this extends DefineNewFrame as you suggest for FP-less operation, and adds a new StackAlloc operation. With those two, I can define the unwind information in my backend in a pretty straightforward manner. |
The PRs all look good -- will land all three. Thanks! |
Thank you! |
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.
A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.
The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.
It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.
There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for
alloca
(dynamic frame growth),then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
abi_impl.rs
.The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from
sp
once to allocate space, then used stores to[sp, offset]
to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 768 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index
stp r1, r2, [sp, #-16]
form that works just likea "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.
This PR adds the new backend on Windows to CI as well.