Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix paltest_pal_sxs_test1 on illumos #105207

Merged
merged 2 commits into from
Aug 21, 2024

Conversation

AustinWise
Copy link
Contributor

@AustinWise AustinWise commented Jul 21, 2024

These changes taken together make paltest_pal_sxs_test1 pass. Along the way it also fixes building the rootfs for illumos. Each commit description contains details about the changes; here is a summary:

  • Disable alternate stacks for signal handlers. illumos has limitations on the use of alternate stacks in signal handlers that .NET is incompatible with.
  • Define the RPATH for the paltest_pal_sxs_test1 executable so that it can find the dynamic libraries used in the test.

Tested on:

I did the testing using the updated cross rootfs from dotnet/arcade#14978.

  • OpenIndiana Hipster 2024.04 - Global Zone
  • OpenIndiana Hipster 2024.04 - ipkg zone
  • SmartOS 20240701T205528Z - base-64-lts 23.4.0 zone

Fixes #35362

@AustinWise
Copy link
Contributor Author

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:

  • ifdefing out on illumos (this PR currently)
  • adding a new define like FEATURE_SIGALTSTACK, enable it for all operating systems except macOS and illumos, and using that to ifdefing out the feature.
  • adding a config switch for enabling using sigaltstack. Make it default disabled for illumos and enabled for everyone else.
  • detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

@gwr
Copy link
Contributor

gwr commented Jul 24, 2024

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:
[...]

  • detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

Could you please open a bug at illumos.org/issues for the sigaltstack problem?
You're probably in a better position to describe what's wrong, what should happen,
and (hopefully) provide a simple test that demonstrates the problem. Thanks!

@am11 am11 added the os-SunOS SunOS, currently not officially supported label Jul 24, 2024
@AustinWise
Copy link
Contributor Author

I updated how RPATH is set to be unconditionally turned on. This is consistent with how RPATH is set in a couple other places in this repo:

https://github.com/dotnet/runtime/blob/main/src/coreclr/dlls/mscordbi/CMakeLists.txt#L4

https://github.com/dotnet/runtime/blob/main/src/coreclr/debug/createdump/CMakeLists.txt#L52

This fixes this test for FreeBSD as well.

@am11 am11 requested a review from janvorli July 27, 2024 15:46
@am11
Copy link
Member

am11 commented Jul 27, 2024

This fixes this test for FreeBSD as well.

Awesome! @Thefrank, another happy news for you. ;)

Copy link
Contributor

@gwr gwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@Thefrank
Copy link
Contributor

Confirming this resolves the test failure on FreeBSD as well! Thank you!

Finished running PAL tests.

PAL Test Results:
  Passed: 337
  Failed: 0

Copied PAL test output files to /root/runtime/artifacts/bin/coreclr/freebsd.x64.Debug/paltestsoutputs.

@AustinWise
Copy link
Contributor Author

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:
[...]

  • detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

Could you please open a bug at illumos.org/issues for the sigaltstack problem? You're probably in a better position to describe what's wrong, what should happen, and (hopefully) provide a simple test that demonstrates the problem. Thanks!

I filed a bug asking for advice:

https://www.illumos.org/issues/16682

I took the relevant pieces of the PAL and made a C program that illustrates the difference in behaviors on different systems. Linux and FreeBSD always use the alternate stack, illumos and macOS won't use the alternate stack again if the handler does not return.

https://github.com/AustinWise/CrashRepro/tree/master/c

I also an open bug for LX zones on SmartOS, where they are not emulating the Linux behavior correctly:

TritonDataCenter/illumos-joyent#163

@gwr
Copy link
Contributor

gwr commented Jul 28, 2024

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

@AustinWise
Copy link
Contributor Author

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

One of the options Joshua mentioned:

I would look at the more nuanced path of modifying the stable parts of the existing ucontext_t in place, and then return from the signal handler.

As an existence proof that this is possible for .NET, NativeAOT takes this approach:

https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/unix/HardwareExceptions.cpp#L508

CoreCLR currently is in a transition phase where it supports its original exception dispatch code and a new model based on NativeAOT's system (see references to g_isNewExceptionHandlingEnabled). I think if we want to change the CoreCLR exception dispatcher to always return from the signal handler, it would be easier to make this change once the old exception handling system has been removed.

@AustinWise
Copy link
Contributor Author

I hacked together a proof of concept to show that it probably possible to change the CoreCLR hardware exception handling to return from the signal handler before dispatching the exception:

3dcc2cb

There is probably a lot more work to make this production quality. For now I'd prefer to go with the approach in this PR (disable alternate stacks) and revisiting changing how signal handlers work after the old exception handling mode is removed.

@AustinWise
Copy link
Contributor Author

Now that .NET 9 has branched, I've rebased this PR and resolved the merge conflicts.

@janvorli
Copy link
Member

@AustinWise thank you for looking into the alternate stack vs return from signal handler issue. I've hit the same thing in the past when I was considering moving macOS to handle hardware exceptions via signals to unify it with other Unix platforms. macOS also keeps track of the alt stack and when we don't return from the handler and redirect execution instead, next time we hit a hardware exception, it calls the handler on the original stack.
I think it would make sense to try to move to the way you've tried in your proof of context. I've actually tried that with macOS in the past too, but I ended up not making the change. It was about 2-3 years ago, so I don't remember all the details. I have a feeling that I've hit some subtle problem that made me stop trying to finalize it at that time, but I don't remember the details anymore. So it seems to be worth revisiting to make it cleaner,
Based on what I've just said, I would prefer making this PR smaller - just by not passing the SA_ONSTACK to the handle_signal and returning false from IsRunningOnAlternateStack for illumos and keeping everything else unchanged (besides the PAL test, of course).

@janvorli
Copy link
Member

Just after posting the message, I've remembered what the problem I've hit was. The issue was that in the stack overflow case (which is the actual reason why we use the alternate stack at all), the stack unwinding was broken. We use yet another stack for stack overflow handling so that we can use as small alternate stack as possible. For the function we would return to from the signal handler, I was unable to come up with a CFI annotation that would allow the unwinder to properly unwind to the original failure point location (due to no fixed relation between the original SP and the SP in the helper stack). It is not a problem for .NET stack walking, as we already have a mechanism to unwind from the EH code to the actual hardware exception location, a debugger and other similar tools would not be able to walk the stack past the EH code in the stack overflow case.
This doesn't sound that bad, but there might have been some other issue that I have forgotten.

When .NET translates SIGSEV to NullReferenceException, it does not return
from the signal handler. Instead it resumes execution at the catch handler
for the exception. This is not recommend by the manpage for sigaction(2):

> It is not recommended that [the ucontext] arg be used by the handler to
> restore the context from before the signal delivery.

The practical effect of resuming execution without returning from a handler
is that the alternate stack will not be used for subsequent signal delivery.
This is in contrast to the behavior on linux, which will always use the
alternate stack if the stack pointer at the time of fault does not fall on
the alternate stack.

Since the alternate stack is only usable for a single exception, don't
bother using it for any exceptions.
Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you!

@gwr
Copy link
Contributor

gwr commented Aug 19, 2024

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

One of the options Joshua mentioned:
[ modifying ... the existing ucontext_t in place, and then return from the signal handler. ]

As an existence proof that this is possible for .NET, NativeAOT takes this approach:

https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/unix/HardwareExceptions.cpp#L508

CoreCLR currently is in a transition phase where it supports its original exception dispatch code and a new model based on NativeAOT's system (see references to g_isNewExceptionHandlingEnabled). I think if we want to change the CoreCLR exception dispatcher to always return from the signal handler, it would be easier to make this change once the old exception handling system has been removed.

If the runtime is "just trying to unwind" to some deeper frame on the stack, then it should be possible to use getcontext (at the deeper frame to which we might unwind) and then in the exception handler, call setcontext to restore that context. That too does the OS-specific stuff about altstacks that returning from a signal handler does. Is "just trying to unwind" an accurate summary of what the runtime is doing? That was not clear to me after a quick look at the code.

Should we have a separate issue for possibly improving the runtime exception handling?
I'd rather that not block this work-around, even if we plan to improve it later.

@janvorli
Copy link
Member

Should we have a separate issue for possibly improving the runtime exception handling?

Yes, we should. I've created it here: #106645

If the runtime is "just trying to unwind" to some deeper frame on the stack, then it should be possible to use getcontext (at the deeper frame to which we might unwind) and then in the exception handler, call setcontext to restore that context.

The SIGSEGV handler allows .NET runtime to handle NullReferenceException that can occur at any arbitrary location in the code that our JIT produces (and also in a few handwritten assembler helpers). The exception handling code walks the stack until it finds a frame that handles the exception (it has a catch handler for it). Then it calls that catch handler and after it returns, it restores context to the code location after the catch handler to continue normal execution. It is also possible that the catch handler never returns, because it ends up throwing some exception again (could be any software / hardware exception).

So we cannot reasonably use the getcontext as we would essentially have to call it before every call (as we don't know which method can end up throwing an exception). And even if we could, it would add unnecessary and non-trivial overhead to the non-exceptional code execution. I believe that returning from the signal handler with instruction pointer in the context modified to point to a helper function that then invokes the exception handling code is the right thing to do here. It will actually be a bit more involved so that the failing frame is still visible on the stack when running the exception handling code.

@AustinWise
Copy link
Contributor Author

Based on what I've just said, I would prefer making this PR smaller - just by not passing the SA_ONSTACK to the handle_signal and returning false from IsRunningOnAlternateStack for illumos and keeping everything else unchanged (besides the PAL test, of course).

Done

For the function we would return to from the signal handler, I was unable to come up with a CFI annotation that would allow the unwinder to properly unwind to the original failure point location (due to no fixed relation between the original SP and the SP in the helper stack).

I bet RhpThrowHwEx in NativeAOT has this same problem. I see that the Windows version constructs an IRET compatible frame to enable unwinding. I'll investigate if something similar could be done in the Unix version.

@janvorli janvorli merged commit 0a962fc into dotnet:main Aug 21, 2024
88 of 90 checks passed
@AustinWise AustinWise deleted the austin/illumos-exceptions branch August 21, 2024 19:32
@github-actions github-actions bot locked and limited conversation to collaborators Sep 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member os-SunOS SunOS, currently not officially supported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

paltest_pal_sxs_test1 failed on SmartOS x64
5 participants