Fix paltest_pal_sxs_test1 on illumos #105207

AustinWise · 2024-07-21T22:35:25Z

These changes taken together make paltest_pal_sxs_test1 pass. Along the way it also fixes building the rootfs for illumos. Each commit description contains details about the changes; here is a summary:

Disable alternate stacks for signal handlers. illumos has limitations on the use of alternate stacks in signal handlers that .NET is incompatible with.
Define the RPATH for the paltest_pal_sxs_test1 executable so that it can find the dynamic libraries used in the test.

Tested on:

I did the testing using the updated cross rootfs from dotnet/arcade#14978.

OpenIndiana Hipster 2024.04 - Global Zone
OpenIndiana Hipster 2024.04 - ipkg zone
SmartOS 20240701T205528Z - base-64-lts 23.4.0 zone

Fixes #35362

AustinWise · 2024-07-22T13:56:13Z

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:

ifdefing out on illumos (this PR currently)
adding a new define like FEATURE_SIGALTSTACK, enable it for all operating systems except macOS and illumos, and using that to ifdefing out the feature.
adding a config switch for enabling using sigaltstack. Make it default disabled for illumos and enabled for everyone else.
detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

gwr · 2024-07-24T14:59:20Z

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:
[...]

detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

Could you please open a bug at illumos.org/issues for the sigaltstack problem?
You're probably in a better position to describe what's wrong, what should happen,
and (hopefully) provide a simple test that demonstrates the problem. Thanks!

eng/common/cross/build-rootfs.sh

src/coreclr/pal/src/exception/signal.cpp

eng/native/configurecompiler.cmake

AustinWise · 2024-07-27T15:23:01Z

I updated how RPATH is set to be unconditionally turned on. This is consistent with how RPATH is set in a couple other places in this repo:

https://github.com/dotnet/runtime/blob/main/src/coreclr/dlls/mscordbi/CMakeLists.txt#L4

https://github.com/dotnet/runtime/blob/main/src/coreclr/debug/createdump/CMakeLists.txt#L52

This fixes this test for FreeBSD as well.

am11 · 2024-07-27T15:47:31Z

This fixes this test for FreeBSD as well.

Awesome! @Thefrank, another happy news for you. ;)

gwr

LGTM. Thanks.

Thefrank · 2024-07-27T17:51:39Z

Confirming this resolves the test failure on FreeBSD as well! Thank you!

Finished running PAL tests.

PAL Test Results:
  Passed: 337
  Failed: 0

Copied PAL test output files to /root/runtime/artifacts/bin/coreclr/freebsd.x64.Debug/paltestsoutputs.

AustinWise · 2024-07-27T20:24:09Z

I am not sure what is the best way to disable alternate stacks for signal handlers. Possibilities include:
[...]

detect whether sigaltstack works as expected on startup.

The last two might be handy for cases where other platforms or emulators have the same problem with alternate stacks as illumos. The specific case I'm thinking of is the Linux emulation support included in some distributions of illumos. It has the same problem last I checked.

Could you please open a bug at illumos.org/issues for the sigaltstack problem? You're probably in a better position to describe what's wrong, what should happen, and (hopefully) provide a simple test that demonstrates the problem. Thanks!

I filed a bug asking for advice:

https://www.illumos.org/issues/16682

I took the relevant pieces of the PAL and made a C program that illustrates the difference in behaviors on different systems. Linux and FreeBSD always use the alternate stack, illumos and macOS won't use the alternate stack again if the handler does not return.

https://github.com/AustinWise/CrashRepro/tree/master/c

I also an open bug for LX zones on SmartOS, where they are not emulating the Linux behavior correctly:

TritonDataCenter/illumos-joyent#163

gwr · 2024-07-28T00:19:01Z

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

AustinWise · 2024-07-28T15:03:41Z

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

One of the options Joshua mentioned:

I would look at the more nuanced path of modifying the stable parts of the existing ucontext_t in place, and then return from the signal handler.

As an existence proof that this is possible for .NET, NativeAOT takes this approach:

https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/unix/HardwareExceptions.cpp#L508

CoreCLR currently is in a transition phase where it supports its original exception dispatch code and a new model based on NativeAOT's system (see references to g_isNewExceptionHandlingEnabled). I think if we want to change the CoreCLR exception dispatcher to always return from the signal handler, it would be easier to make this change once the old exception handling system has been removed.

AustinWise · 2024-08-03T20:09:21Z

I hacked together a proof of concept to show that it probably possible to change the CoreCLR hardware exception handling to return from the signal handler before dispatching the exception:

3dcc2cb

There is probably a lot more work to make this production quality. For now I'd prefer to go with the approach in this PR (disable alternate stacks) and revisiting changing how signal handlers work after the old exception handling mode is removed.

AustinWise · 2024-08-17T20:52:35Z

Now that .NET 9 has branched, I've rebased this PR and resolved the merge conflicts.

janvorli · 2024-08-19T12:53:04Z

@AustinWise thank you for looking into the alternate stack vs return from signal handler issue. I've hit the same thing in the past when I was considering moving macOS to handle hardware exceptions via signals to unify it with other Unix platforms. macOS also keeps track of the alt stack and when we don't return from the handler and redirect execution instead, next time we hit a hardware exception, it calls the handler on the original stack.
I think it would make sense to try to move to the way you've tried in your proof of context. I've actually tried that with macOS in the past too, but I ended up not making the change. It was about 2-3 years ago, so I don't remember all the details. I have a feeling that I've hit some subtle problem that made me stop trying to finalize it at that time, but I don't remember the details anymore. So it seems to be worth revisiting to make it cleaner,
Based on what I've just said, I would prefer making this PR smaller - just by not passing the SA_ONSTACK to the handle_signal and returning false from IsRunningOnAlternateStack for illumos and keeping everything else unchanged (besides the PAL test, of course).

janvorli · 2024-08-19T13:18:12Z

Just after posting the message, I've remembered what the problem I've hit was. The issue was that in the stack overflow case (which is the actual reason why we use the alternate stack at all), the stack unwinding was broken. We use yet another stack for stack overflow handling so that we can use as small alternate stack as possible. For the function we would return to from the signal handler, I was unable to come up with a CFI annotation that would allow the unwinder to properly unwind to the original failure point location (due to no fixed relation between the original SP and the SP in the helper stack). It is not a problem for .NET stack walking, as we already have a mechanism to unwind from the EH code to the actual hardware exception location, a debugger and other similar tools would not be able to walk the stack past the EH code in the stack overflow case.
This doesn't sound that bad, but there might have been some other issue that I have forgotten.

When .NET translates SIGSEV to NullReferenceException, it does not return from the signal handler. Instead it resumes execution at the catch handler for the exception. This is not recommend by the manpage for sigaction(2): > It is not recommended that [the ucontext] arg be used by the handler to > restore the context from before the signal delivery. The practical effect of resuming execution without returning from a handler is that the alternate stack will not be used for subsequent signal delivery. This is in contrast to the behavior on linux, which will always use the alternate stack if the stack pointer at the time of fault does not fall on the alternate stack. Since the alternate stack is only usable for a single exception, don't bother using it for any exceptions.

janvorli

LGTM thank you!

gwr · 2024-08-19T16:09:27Z

Got a helpful response on the illumos issue. If I understand correctly, what dotnet is doing today with contexts and sigaltstack is highly system-dependent (non-portable). I wonder if it might be possible to let dotnet have what it needs using only getcontext/setcontext etc? (basically adhere to the interface descriptions pointed to in the comments on that illumos bug.). I assume that would be a fair bit of work.

One of the options Joshua mentioned:
[ modifying ... the existing ucontext_t in place, and then return from the signal handler. ]

As an existence proof that this is possible for .NET, NativeAOT takes this approach:

https://github.com/dotnet/runtime/blob/main/src/coreclr/nativeaot/Runtime/unix/HardwareExceptions.cpp#L508

CoreCLR currently is in a transition phase where it supports its original exception dispatch code and a new model based on NativeAOT's system (see references to g_isNewExceptionHandlingEnabled). I think if we want to change the CoreCLR exception dispatcher to always return from the signal handler, it would be easier to make this change once the old exception handling system has been removed.

If the runtime is "just trying to unwind" to some deeper frame on the stack, then it should be possible to use getcontext (at the deeper frame to which we might unwind) and then in the exception handler, call setcontext to restore that context. That too does the OS-specific stuff about altstacks that returning from a signal handler does. Is "just trying to unwind" an accurate summary of what the runtime is doing? That was not clear to me after a quick look at the code.

Should we have a separate issue for possibly improving the runtime exception handling?
I'd rather that not block this work-around, even if we plan to improve it later.

janvorli · 2024-08-19T16:59:31Z

Should we have a separate issue for possibly improving the runtime exception handling?

Yes, we should. I've created it here: #106645

If the runtime is "just trying to unwind" to some deeper frame on the stack, then it should be possible to use getcontext (at the deeper frame to which we might unwind) and then in the exception handler, call setcontext to restore that context.

The SIGSEGV handler allows .NET runtime to handle NullReferenceException that can occur at any arbitrary location in the code that our JIT produces (and also in a few handwritten assembler helpers). The exception handling code walks the stack until it finds a frame that handles the exception (it has a catch handler for it). Then it calls that catch handler and after it returns, it restores context to the code location after the catch handler to continue normal execution. It is also possible that the catch handler never returns, because it ends up throwing some exception again (could be any software / hardware exception).

So we cannot reasonably use the getcontext as we would essentially have to call it before every call (as we don't know which method can end up throwing an exception). And even if we could, it would add unnecessary and non-trivial overhead to the non-exceptional code execution. I believe that returning from the signal handler with instruction pointer in the context modified to point to a helper function that then invokes the exception handling code is the right thing to do here. It will actually be a bit more involved so that the failing frame is still visible on the stack when running the exception handling code.

AustinWise · 2024-08-19T19:11:24Z

Based on what I've just said, I would prefer making this PR smaller - just by not passing the SA_ONSTACK to the handle_signal and returning false from IsRunningOnAlternateStack for illumos and keeping everything else unchanged (besides the PAL test, of course).

Done

For the function we would return to from the signal handler, I was unable to come up with a CFI annotation that would allow the unwinder to properly unwind to the original failure point location (due to no fixed relation between the original SP and the SP in the helper stack).

I bet RhpThrowHwEx in NativeAOT has this same problem. I see that the Windows version constructs an IRET compatible frame to enable unwinding. I'll investigate if something similar could be done in the Unix version.

dotnet-issue-labeler bot added the area-PAL-coreclr label Jul 21, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 21, 2024

am11 added the os-SunOS SunOS, currently not officially supported label Jul 24, 2024

am11 reviewed Jul 24, 2024

View reviewed changes

eng/common/cross/build-rootfs.sh Outdated Show resolved Hide resolved

am11 reviewed Jul 24, 2024

View reviewed changes

src/coreclr/pal/src/exception/signal.cpp Outdated Show resolved Hide resolved

gwr mentioned this pull request Jul 24, 2024

SunOS process and thread support #105403

Open

AustinWise mentioned this pull request Jul 25, 2024

Update illumos cross rootfs to fix CoreCLR PAL test dotnet/arcade#14978

Merged

AustinWise force-pushed the austin/illumos-exceptions branch from 10d7551 to e7a547a Compare July 25, 2024 05:11

am11 reviewed Jul 25, 2024

View reviewed changes

eng/native/configurecompiler.cmake Outdated Show resolved Hide resolved

am11 mentioned this pull request Jul 26, 2024

Assertion failed 'compCyclesAtEnd >= m_compCyclesAtEndOfInlining fails intermittently on freebsd-x64 Debug build #105472

Open

am11 requested a review from janvorli July 27, 2024 15:46

gwr approved these changes Jul 27, 2024

View reviewed changes

This was referenced Jul 27, 2024

ExplicitConversion_FromSingle failing due to NaN != NaN #103347

Open

TimeProviderTests.TestProviderTimer failed in CI #103459

Closed

WasmTestOnChrome timeouts in CI #105363

Open

gwr mentioned this pull request Aug 16, 2024

Tracking issue for illumos and Solaris x86-64 port work #34944

Open

19 tasks

AustinWise force-pushed the austin/illumos-exceptions branch from f692901 to 2622c0b Compare August 17, 2024 20:31

AustinWise added 2 commits August 19, 2024 08:17

Fix RPATH for exception handling PAL test

6ed5db5

AustinWise force-pushed the austin/illumos-exceptions branch from 2622c0b to 85b5c51 Compare August 19, 2024 15:30

janvorli approved these changes Aug 19, 2024

View reviewed changes

build-analysis bot mentioned this pull request Aug 19, 2024

'chrome-GetPropertiesTests' timing out #106625

Closed

janvorli merged commit 0a962fc into dotnet:main Aug 21, 2024
88 of 90 checks passed

AustinWise deleted the austin/illumos-exceptions branch August 21, 2024 19:32

github-actions bot locked and limited conversation to collaborators Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix paltest_pal_sxs_test1 on illumos #105207

Fix paltest_pal_sxs_test1 on illumos #105207

AustinWise commented Jul 21, 2024 •

edited

Loading

AustinWise commented Jul 22, 2024

gwr commented Jul 24, 2024

AustinWise commented Jul 27, 2024

am11 commented Jul 27, 2024

gwr left a comment

Thefrank commented Jul 27, 2024

AustinWise commented Jul 27, 2024

gwr commented Jul 28, 2024

AustinWise commented Jul 28, 2024

AustinWise commented Aug 3, 2024

AustinWise commented Aug 17, 2024

janvorli commented Aug 19, 2024

janvorli commented Aug 19, 2024

janvorli left a comment

gwr commented Aug 19, 2024

janvorli commented Aug 19, 2024

AustinWise commented Aug 19, 2024

Fix paltest_pal_sxs_test1 on illumos #105207

Fix paltest_pal_sxs_test1 on illumos #105207

Conversation

AustinWise commented Jul 21, 2024 • edited Loading

Tested on:

AustinWise commented Jul 22, 2024

gwr commented Jul 24, 2024

AustinWise commented Jul 27, 2024

am11 commented Jul 27, 2024

gwr left a comment

Choose a reason for hiding this comment

Thefrank commented Jul 27, 2024

AustinWise commented Jul 27, 2024

gwr commented Jul 28, 2024

AustinWise commented Jul 28, 2024

AustinWise commented Aug 3, 2024

AustinWise commented Aug 17, 2024

janvorli commented Aug 19, 2024

janvorli commented Aug 19, 2024

janvorli left a comment

Choose a reason for hiding this comment

gwr commented Aug 19, 2024

janvorli commented Aug 19, 2024

AustinWise commented Aug 19, 2024

AustinWise commented Jul 21, 2024 •

edited

Loading