Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework/simplify unwind infrastructure, implement Windows unwind, and add Windows/new-backend to CI. #2710

Merged
merged 2 commits into from
Mar 12, 2021

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Mar 6, 2021

Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.

A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.

The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.

It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.

There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for alloca (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
abi_impl.rs.

The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from sp once to allocate space, then used stores to [sp, offset]
to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 768 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index stp r1, r2, [sp, #-16] form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.

This PR adds the new backend on Windows to CI as well.

@cfallin cfallin requested review from peterhuene and abrown March 6, 2021 23:49
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. labels Mar 7, 2021
@bjorn3
Copy link
Contributor

bjorn3 commented Mar 7, 2021

You are missing unwindinfo for the mov %rsp,%rbp instruction in the prologue.

Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)

@cfallin
Copy link
Member Author

cfallin commented Mar 8, 2021

You are missing unwindinfo for the mov %rsp,%rbp instruction in the prologue.

Can you be more specific about what you believe is missing?

The unwind directives in the Windows case are: PushRegister with rbp (this corresponds with the push rbp); SetFPReg with rbp (this corresponds with the mov rbp, rsp); StackAlloc if a frame is allocated; and a SaveRegister for each clobber-save. (Code here.) The CFI directives in the SysV case are: CfaOffset and Offset for the push rbp, indicating that the CFA is at RSP+16 and that RBP is saved at CFA-16; Offset for LR on aarch64 with CFA-8; CfaRegister with RBP on the mov rbp, rsp indicating that the CFA is now defined in terms of rbp; and Offset for each saved clobbered register. (Code here.)

In the latter case (SysV), I looked at the output of gcc -g -S and clang -g -S to match the CFI directives they were generating. On all platforms tests are passing (and a number of unit tests actually exercise unwind, as I can attest from debugging!), so it would seem that whatever directives that are needed are present.

But if there's something non-idiomatic about the output in some case, then I'm happy to fix it!

Another thing: How hard would implementing asynchronous exceptions with landingpads for cleanup be after this PR compared to before? For example Theseus OS depends on unwinding after exceptions for fault recovery. LLVM's lack of asynchronous exception support is responsible for a significant part of the failed recovery attempts after a fault is injected. (See table 1 of https://kevinaboos.web.rice.edu/docs/theseus_boos_osdi2020.pdf)

That's interesting. I expect that we'll need to tackle landingpads at some point, driven at least by Wasm exceptions (when that proposal advances) as well as by other use-cases like this one. The complexity around them, though (correctly modeling control-flow, including during optimizations) is something we would need to carefully study, and fully asynchronous exceptions would also require CFI info for epilogues (we don't generate this today because we never unwind in the middle of an epilogue -- there's simply no way for the CLIF to insert any user action there, and there are no asynchronous interrupts). So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

@bjorn3
Copy link
Contributor

bjorn3 commented Mar 8, 2021

Can you be more specific about what you believe is missing?

I only saw the push rbp and sub rsp, .... instructions have a corresponding unwindinfo entry. In addition if I stepped to the mov rbp, rsp there was an extra ??? frame in the backtrace produced by gdb just below the top frame.

So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.

@cfallin
Copy link
Member Author

cfallin commented Mar 8, 2021

Can you be more specific about what you believe is missing?

I only saw the push rbp and sub rsp, .... instructions have a corresponding unwindinfo entry. In addition if I stepped to the mov rbp, rsp there was an extra ??? frame in the backtrace produced by gdb just below the top frame.

Hmm. Possibly this has to do with the placement of the unwindinfo opcodes; the Push and SetFP are both after the push rbp / mov rbp, rsp sequence, whereas if a debugger is stopped in between the two, it should see the Push but not the SetFP. I'll fix that tomorrow and try stepping through it. Thanks for the report!

(I did check with clang on Msys/MinGW just now and the unwind opcodes we're generating are exactly the same as what it generates for its prologue, modulo the above placement issue, so I hope this does it...)

So I wouldn't call it easy, exactly, and I can't personally justify spending time on it in the short term, but it should be addressed at some point.

I understand that it is not a priority right now. I just wanted you to consider if landing this PR would make it harder to implement that to avoid possibly having to revert back to the previous approach.

Ah, I see; if anything it will be easier with a more direct approach, I think. Most of the work for landingpads will be dealing with the control-flow implications in the rest of the compiler; adding the exception handler info to the tables will be relatively straightforward, I think.

@cfallin
Copy link
Member Author

cfallin commented Mar 8, 2021

@bjorn3, I've updated and verified now that stepping through instruction-by-instruction with gdb, the backtrace/unwind is precise at every program point. I went ahead and added epilogue info too since this was not much more work. (Note also that the old impl was missing epilogue info because it was tricky to work out exactly where the epilogues were and plumb this through to the right place.)

I also ripped out the old SysV unwind infra in the new-x64 and aarch64 backends; this was dead code, replaced by this unified new implementation, but I had not removed it previously. Happy to see diff-stats show net-negative lines of code after doing that even with Fastcall and epilogues added!

@cfallin cfallin force-pushed the x64-fastcall-unwind branch 4 times, most recently from 79bded4 to cb401b2 Compare March 9, 2021 07:18
@cfallin cfallin force-pushed the x64-fastcall-unwind branch from cb401b2 to e0a3793 Compare March 9, 2021 19:40
@cfallin
Copy link
Member Author

cfallin commented Mar 9, 2021

Updated again; I realized that we need a few more ops around mid-function epilogues (RememberState / RestoreState) because the CFA parser doesn't trace control flow, but rather just scans the function linearly.

@cfallin cfallin force-pushed the x64-fastcall-unwind branch 2 times, most recently from 07c6ca1 to efc0a31 Compare March 11, 2021 02:49
@cfallin
Copy link
Member Author

cfallin commented Mar 11, 2021

So it turns out that the epilogue CFA info was causing segfaults on CI but not locally when switching to the new backend in #2718, and I wasn't able to find the root cause after much head-scratching. Since epilogue info is not necessary for our unwind needs (it only becomes necessary if one has a trap somewhere in the middle of the mov rsp, rbp / pop rbp / ret sequence, or if one wants to instruction-step through that sequence and have gdb print a nice backtrace while in the middle), and since the current state of both the new x64 backend and the aarch64 backend is that no epilogue CFA info is provided, I've gone ahead and re-simplified to provide only prologue info.

@cfallin cfallin requested a review from alexcrichton March 11, 2021 17:22
@cfallin
Copy link
Member Author

cfallin commented Mar 11, 2021

@alexcrichton would you be willing to review this given that you took a look above already?

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure would be happy to!

FWIW while I'm pretty confident in the dwarf unwind information I'm much less confident in the Windows bits. This was the first time I looked at the documentation for windows unwinding codes and to really get a good understanding I'd need to set aside a bigger chunk of time to dig in and see how what we do maps to the specification and such. I suspect it's all fine since tests are all green, but there were definitely lots of places I couldn't connect the dots to reason about why things were the way they were.

cranelift/codegen/src/isa/unwind/winx64.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/unwind/winx64.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/unwind/systemv.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/unwind.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/abi.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/abi.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/abi.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/aarch64/abi.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/unwind.rs Outdated Show resolved Hide resolved
@cfallin cfallin force-pushed the x64-fastcall-unwind branch from a35b6f8 to 317c072 Compare March 12, 2021 00:01
@cfallin
Copy link
Member Author

cfallin commented Mar 12, 2021

Updated based on comments; thanks!

@cfallin cfallin force-pushed the x64-fastcall-unwind branch from e038b2b to 62a21ad Compare March 12, 2021 00:11
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for all the new comments! Definitely makes sense to me now :)

Copy link
Member

@peterhuene peterhuene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic!

I glanced over the aarch64-related changes for the most part, although I did pay attention to the common unwind representation and read over your detailed comments explaining things.

Therefore I paid the most attention to the x64 backend changes and the changes to the unwind generation for both SystemV and Windows and have just three minor comments.

cranelift/codegen/src/isa/unwind.rs Show resolved Hide resolved
cranelift/filetests/filetests/isa/x64/fastcall.clif Outdated Show resolved Hide resolved
cranelift/codegen/src/machinst/abi_impl.rs Outdated Show resolved Hide resolved
Our previous implementation of unwind infrastructure was somewhat
complex and brittle: it parsed generated instructions in order to
reverse-engineer unwind info from prologues. It also relied on some
fragile linkage to communicate instruction-layout information that VCode
was not designed to provide.

A much simpler, more reliable, and easier-to-reason-about approach is to
embed unwind directives as pseudo-instructions in the prologue as we
generate it. That way, we can say what we mean and just emit it
directly.

The usual reasoning that leads to the reverse-engineering approach is
that metadata is hard to keep in sync across optimization passes; but
here, (i) prologues are generated at the very end of the pipeline, and
(ii) if we ever do a post-prologue-gen optimization, we can treat unwind
directives as black boxes with unknown side-effects, just as we do for
some other pseudo-instructions today.

It turns out that it was easier to just build this for both x64 and
aarch64 (since they share a factored-out ABI implementation), and wire
up the platform-specific unwind-info generation for Windows and SystemV.
Now we have simpler unwind on all platforms and we can delete the old
unwind infra as soon as we remove the old backend.

There were a few consequences to supporting Fastcall unwind in
particular that led to a refactor of the common ABI. Windows only
supports naming clobbered-register save locations within 240 bytes of
the frame-pointer register, whatever one chooses that to be (RSP or
RBP). We had previously saved clobbers below the fixed frame (and below
nominal-SP). The 240-byte range has to include the old RBP too, so we're
forced to place clobbers at the top of the frame, just below saved
RBP/RIP. This is fine; we always keep a frame pointer anyway because we
use it to refer to stack args. It does mean that offsets of fixed-frame
slots (spillslots, stackslots) from RBP are no longer known before we do
regalloc, so if we ever want to index these off of RBP rather than
nominal-SP because we add support for `alloca` (dynamic frame growth),
then we'll need a "nominal-BP" mode that is resolved after regalloc and
clobber-save code is generated. I added a comment to this effect in
`abi_impl.rs`.

The above refactor touched both x64 and aarch64 because of shared code.
This had a further effect in that the old aarch64 prologue generation
subtracted from `sp` once to allocate space, then used stores to `[sp,
offset]` to save clobbers. Unfortunately the offset only has 7-bit
range, so if there are enough clobbered registers (and there can be --
aarch64 has 384 bytes of registers; at least one unit test hits this)
the stores/loads will be out-of-range. I really don't want to synthesize
large-offset sequences here; better to go back to the simpler
pre-index/post-index `stp r1, r2, [sp, #-16]` form that works just like
a "push". It's likely not much worse microarchitecturally (dependence
chain on SP, but oh well) and it actually saves an instruction if
there's no other frame to allocate. As a further advantage, it's much
simpler to understand; simpler is usually better.

This PR adds the new backend on Windows to CI as well.
@cfallin cfallin force-pushed the x64-fastcall-unwind branch from dd9be5f to 2d5db92 Compare March 12, 2021 04:04
@cfallin
Copy link
Member Author

cfallin commented Mar 12, 2021

@sunfishcode it looks like there might be an intermittent test with wasi_cap_std_sync on Windows, namely poll_oneoff_stdio -- see logs. Watching the latest run to see if it fares better this time...

@uweigand
Copy link
Member

Hi @cfallin, I finally got around to updating my s390x patch to this new unwind method. This ran into a couple of problems, so I'm wondering whether I'm on the right track here. I'd appreciate your input ...

  1. Frame size definition. The total_frame_size value returned form frame_size is documented to equal the "distance between FP and nominal SP", and users (even outside of my new back-end) appear to rely on that fact, e.g. it is ultimately used to implement the get_nominal_sp_to_fp routine used in spillslots_to_stack_map. However, with the new approach, the stack space used for clobbers (clobber_size) now lies between the nominal SP and the FP, so I believe this value now needs to be added to total_frame_size. (At least in the non-baldrdash case. I'm not really familiar with how baldrdash works, in particular in the new approach ...)
  2. Accumulate outgoing args. We've added this feature a while back since I'm using it in my back-end. This patch completely removed the feature again, breaking my back-end. Would you accept a patch adding it back in the new approach?
  3. Operation without frame pointer. I don't really need or want a frame pointer, but in the new approach I'm basically forced to define a FP register. Would it be OK to add support for non-FP operation? One option would be to make the FP register optional (like the LR currently is) and update the handling of UnwindInst::PushFrameRegs and UnwindInst::DefineNewFrame if no FP is present -- this is what I've implemented so far. Another option would be to define a new UnwindInst::AllocStack that I could use instead of those two if there is no FP.
  4. Negative offsets to UnwindInst::SaveReg. In particular in the absence of a FP (where I'm not using DefineNewFrame), the offsets to SaveReg are more naturally expressed relative to the incoming SP (or CFA). But then they will be negative, which makes it a bit awkward to use the current interface which defines the offsets as u32.

@cfallin
Copy link
Member Author

cfallin commented Apr 13, 2021

Hi @uweigand -- thanks for the efforts to update and sorry about the churn!

I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).

In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/

To the specific questions:

Frame size definition ... [add clobber_size to nominal-SP-to-FP distance]

Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using abi_impl.rs or implementing the ABICaller / ABICallee traits directly?

Accumulate outgoing args

Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?

Operation without frame pointer

We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add AllocStack as you suggest and then omit PushFrameRegs. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.

Negative offsets to UnwindInst::SaveReg

To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make DefineNewFrame work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?

(This is useful pathfinding for our eventual omit-FP optimization on other architectures too, by the way, so thank you for probing these issues now!)

@uweigand
Copy link
Member

Hi @uweigand -- thanks for the efforts to update and sorry about the churn!

No problem, thanks for the reply!

I'll note first as an aside that I'm quite open to merging the s390x backend sooner rather than later, if you think it is at a place for that, as it would allow me and others to consider its constraints and update it along with the others when refactors occur. I am of course also happy to work out followup changes to resolve issues after-the-fact, as we're doing here, but e.g. if I had known that the "accumulate outgoing args" mechanism was used by s390x as well, it might have affected my cleanup/simplification calculus a bit (sorry about that!).

In particular, I'm currently working on an update to the register allocator, which involves some churn in the way it is used in the lowering backends. I'm happy to do the update in all backends in-tree, including s390x if it is merged; keeping it out-of-tree creates more rebasing pain for you as we refactor and evolve, which I do regret :-/

I agree, in particular because there now also seems to be renewed interest in s390x support in wasmtime due to additional use cases. I'll try to get things ready for an initial merge soon, hopefully within the next few weeks.

To the specific questions:

Frame size definition ... [add clobber_size to nominal-SP-to-FP distance]

Yes, I think so, though without seeing your ABI implementation in more detail it's hard to say for sure. Are you using abi_impl.rs or implementing the ABICaller / ABICallee traits directly?

I'm using abi_impl.rs just like x64 and aarch64 do. I had thought that the problem with the get_nominal_sp_to_fp routine should also occur on those platforms, but at a second glance, it probably does not: the only user of this routine in common code is
spillslots_to_stack_map, but this uses the value only as an upper bound of the stack map size. Given that there are no stack slots in the clobber area, for this particular use case the difference probably does not matter.

On my platform, I have another use case in target-specific code. This again has to do with the fact that I am not using a frame pointer register, and therefore the code generator has to rewrite accesses to the incoming argument slots (which are expressed by common code via offsets relative to the FP) in terms of offsets relative to the (nominal) SP. This is done by simply adding the nominal_sp_to_fp offset, which used to work fine, but has now stopped working.

So we can either fix the frame_size implementation so that nominal_sp_to_fp again returns the actual offset between those two points in the stack, or else I'll have to find some other way to pass the correct offset including clobbers to the back-end codegen (e.g. via a pseudo instruction).

In case you agree with changing frame_size, here's a PR to do so: #2836

Accumulate outgoing args

Sorry about this! I worked to simplify the common ABI implementation as I refactored it and this seemed to me to be implementing something we could do in a simpler way, at least on the x64 and aarch64 backends. I hadn't realized it was used on s390x. I'm happy to take a patch to add it back, or, if it is simpler, perhaps the s390x backend could directly implement this in its ABI code (without indirecting through the traits)?

Here's a PR to add this support back: #2837

I'd be happy to implement this solely in the back-end, but the main problem I was having with this is that there does not appear to be any back-end specific (mutable) state associated with an ABICallee. I need this state to track the required outgoing argument space as we go through the whole function and lower it; each call instruction that is found will have to update the required size accordingly. Finally, in the prologue and epilogue code generator I need to access that accumulated value so I can alloc/dealloc the correct amount of extra stack space.

Without the common code implementation, do you have a suggestion where I should hold this extra state?

Operation without frame pointer

We do definitely want to make this possible eventually. Without thinking through the implications a bit more, I don't have a strong opinion which of the two approaches you name makes more sense, but my initial thought is that perhaps we add AllocStack as you suggest and then omit PushFrameRegs. The SysV unwind info generator then just needs to track that it is in FP-less mode and continue to define the CFA in terms of SP instead.

Negative offsets to UnwindInst::SaveReg

To unify the view between the different platforms, I think it does make sense to continue to view the clobber locations in terms of a "clobber frame" that starts below them; I think that if we make DefineNewFrame work in an FP-less environment (but still have the same notion of an offset from some anchor to the start of clobbers) this should be possible? Or is there a fundamental conflict here?

Here's a PR to add support for operation without frame pointer along the lines suggested above: #2838

In particular, this extends DefineNewFrame as you suggest for FP-less operation, and adds a new StackAlloc operation. With those two, I can define the unwind information in my backend in a pretty straightforward manner.

@cfallin
Copy link
Member Author

cfallin commented Apr 14, 2021

The PRs all look good -- will land all three. Thanks!

@uweigand
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants