Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] xcode15+ linker issue #97745

Open
VSadov opened this issue Jan 31, 2024 · 26 comments
Open

[NativeAOT] xcode15+ linker issue #97745

VSadov opened this issue Jan 31, 2024 · 26 comments
Labels
area-NativeAOT-coreclr os-mac-os-x macOS aka OSX tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@VSadov
Copy link
Member

VSadov commented Jan 31, 2024

There is a chance that this is something with my machine, but it looks like I cannot build run tests with 8.0
It could be related to xcode 15+

I have 15.2 and tried downgrade to 15.0 with the same result. It does not seem possible to downgrade further as older versions claim to be incompatible with the new OS.

I am using the following command:

/build.sh clr.alljits+clr.tools+clr.nativeaotlibs+clr.nativeaotruntime+libs -rc Release -lc Release ; src/tests/build.sh nativeaot Release 'tree nativeaot' ; src/tests/run.sh --runnativeaottests Release

That works fine on main branch (product and tests build and tests pass).
But on release/8.0 branch I get a bunch of errors like:

/Users/vs/Hosting01/runtime/artifacts/bin/coreclr/osx.arm64.Release/build/Microsoft.NETCore.Native.targets(308,5): error MSB3073: The command ""/Users/vs/Hosting01/runtime/artifacts/bin/coreclr/osx.arm64.Release/ilc-published/ilc" @"/Users/vs/Hosting01/runtime/artifacts/tests/coreclr/obj/osx.arm64.Release/Managed/nativeaot/SmokeTests/StackTraceMetadata/StackTraceMetadata_Stripped/native/StackTraceMetadata_Stripped.ilc.rsp"" exited with code 139.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 31, 2024
@ghost
Copy link

ghost commented Jan 31, 2024

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

There is a chance that this is something with my machine, but it looks like I cannot build run tests with 8.0
It could be related to xcode 15+

I have 15.2 and tried downgrade to 15.0 with the same result. It does not seem possible to downgrade further as older versions claim to be incompatible with the new OS.

I am using the following command:

/build.sh clr.alljits+clr.tools+clr.nativeaotlibs+clr.nativeaotruntime+libs -rc Release -lc Release ; src/tests/build.sh nativeaot Release 'tree nativeaot' ; src/tests/run.sh --runnativeaottests Release

That works fine on main branch (product and tests build and tests pass).
But on release/8.0 branch I get a bunch of errors like:

/Users/vs/Hosting01/runtime/artifacts/bin/coreclr/osx.arm64.Release/build/Microsoft.NETCore.Native.targets(308,5): error MSB3073: The command ""/Users/vs/Hosting01/runtime/artifacts/bin/coreclr/osx.arm64.Release/ilc-published/ilc" @"/Users/vs/Hosting01/runtime/artifacts/tests/coreclr/obj/osx.arm64.Release/Managed/nativeaot/SmokeTests/StackTraceMetadata/StackTraceMetadata_Stripped/native/StackTraceMetadata_Stripped.ilc.rsp"" exited with code 139.

Author: VSadov
Assignees: -
Labels:

untriaged, area-NativeAOT-coreclr

Milestone: -

@VSadov
Copy link
Member Author

VSadov commented Jan 31, 2024

@filipnavara - could this be another linker issue similar to #92520?
Or maybe even the same issue, just showing up in a different way?

@VSadov
Copy link
Member Author

VSadov commented Jan 31, 2024

I think the lab is still on xcode 14 and that could be the reason this does not happen in CI

@MichalStrehovsky
Copy link
Member

The command "".../ilc" @"..,/StackTraceMetadata_Stripped.ilc.rsp"" exited with code 139.

This is ILC crashing. One option to troubleshoot would be just to re-run this line under a debugger.

@VSadov
Copy link
Member Author

VSadov commented Feb 1, 2024

Passing -ld64 to ILC build and to the Microsoft.NETCore.Native.Unix.targets makes the problem go away - the ILC is functional can build tests and tests pass*

This does seem to be some kind of incompatibility with the new linker.

  • all tests pass - except Reflection_FromUsage. Possibly something unrelated, or just something was left from previous tries to build. I will try a clean rebuild.
  • on a clean rebuild all smoke tests pass, but we do need -ld64 for both ILC and the tests.

@filipnavara
Copy link
Member

could this be another linker issue similar to #92520?

Generally speaking, yes. I won’t have time to look into it until next week, unfortunately.

@VSadov
Copy link
Member Author

VSadov commented Feb 1, 2024

I think nothing is blocked on this yet, so this is mostly a heads up.
Since the lab uses xcode14, it is not affected. Also I am unsure if xcode15+ is a supported end user configuration for 8.0. Since the lab is still on 14, probably not yet.

@jkotas
Copy link
Member

jkotas commented Feb 1, 2024

I am unsure if xcode15+ is a supported end user configuration for 8.0.

Yes, it is supported. Prerequisites in our documentation do not ask for a specific old XCode version.

@VSadov
Copy link
Member Author

VSadov commented Feb 1, 2024

Yes, it is supported. Prerequisites in our documentation do not ask for a specific old XCode version.

That is desirable, but practically it is hard to guarantee support for something newer than what the lab runs.
That said, moving to new xcode has been mostly uneventful in the past. It is just v15 brings a new linker with a bunch of incompatibilities. I would not rule out that some of those incompatibilities are unintentional and basically bugs that may eventually be fixed.

Maybe the right course of action for 8.0 is to use -ld_classic until ld_prime becomes more stable?

@agocke agocke added this to the 9.0.0 milestone Feb 1, 2024
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Feb 1, 2024
@jkotas
Copy link
Member

jkotas commented Feb 1, 2024

it is hard to guarantee support for something newer than what the lab runs

Yes, we have same problem with support of new OS versions. We take fixes for these types of issues in servicing.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 2, 2024
@filipnavara
Copy link
Member

Shouldn't the milestone be set to 8.0.x instead of 9.0.0 since this apparently doesn't happen on main?

@filipnavara
Copy link
Member

filipnavara commented Feb 3, 2024

Nevermind, there is definitely something broken even on main with the new linker. I updated to Xcode 15.2 and I get crash in nativeaot/SmokeTests/FrameworkStrings/Baseline test at the end. Relinking with -ld_classic and no other changes makes it go away. So does linking against Debug build of libRuntime.WorkstationGC.a. I'll investigate next week.

I did some debug build earlier and I started seeing this, which may be related:

        --------------------------------------------------
        Debug Assertion Violation
        
        Expression: 'm_pInstance->IsManaged(m_ControlPC) && "unwind from throw site stub failed"'
        
        File: /Users/filipnavara/Projects/runtime/src/coreclr/nativeaot/Runtime/StackFrameIterator.cpp, Line: 1289
        --------------------------------------------------

@filipnavara
Copy link
Member

The issue I get in the Baseline test is trashed SP after a caught exception (before: 0x000000016fdff1a0; after: 0x00000001814790e0). The unwinding information may be corrupted.

@filipnavara
Copy link
Member

So far I traced it to UnwindHelpers::GetUnwindProcInfo returning bogus info belonging to a different method than asked for. The unwinding data itself in the executable is likely fine. I did a spot check on it, and lldb unwinds the same stack trace just fine.

@filipnavara
Copy link
Member

filipnavara commented Feb 4, 2024

So, the linker indeed produces garbage unwind tables (as verified by objdump -u):

    Second level index[6]: offset in section=0x0000a3f8, base function offset=0x000f55f0
      [0]: function offset=0x000f55f0, encoding=0x44000000
      [1]: function offset=0x000f5640, encoding=0x03011500
      [2]: function offset=0x000f56f0, encoding=0x03011528
      [3]: function offset=0x000f5740, encoding=0x0301154c
...
      [106]: function offset=0x000f92d0, encoding=0x03011e00
      [107]: function offset=0x000f9320, encoding=0x03011e24
      [108]: function offset=0x0018a5c8, encoding=0x00000000
      [109]: function offset=0x000f93e0, encoding=0x03011e4c
      [110]: function offset=0x0018a608, encoding=0x00000000
      [111]: function offset=0x000f9440, encoding=0x03011e70
      [112]: function offset=0x0018a668, encoding=0x00000000
      [113]: function offset=0x000f94c0, encoding=0x03011e94
      [114]: function offset=0x0018a6e8, encoding=0x00000000
      [115]: function offset=0x000f9580, encoding=0x03011eb8
      [116]: function offset=0x0018a7a8, encoding=0x00000000
      [117]: function offset=0x000f9620, encoding=0x03011ee0
      [118]: function offset=0x0018a848, encoding=0x00000000
      [119]: function offset=0x000f96b0, encoding=0x44000000
      [120]: function offset=0x0018a8d8, encoding=0x00000000
      [121]: function offset=0x000f96e0, encoding=0x44000000
      [122]: function offset=0x0018a908, encoding=0x00000000
      [123]: function offset=0x000f9710, encoding=0x44000000
      [124]: function offset=0x0018a938, encoding=0x00000000
      [125]: function offset=0x000f9740, encoding=0x44000000
      [126]: function offset=0x0018a968, encoding=0x00000000
      [127]: function offset=0x000f9790, encoding=0x44000000
      [128]: function offset=0x0018a9b8, encoding=0x00000000
      [129]: function offset=0x000f97b0, encoding=0x03011f04
      [130]: function offset=0x0018a9d8, encoding=0x00000000
      [131]: function offset=0x000f9960, encoding=0x03011f30
      [132]: function offset=0x0018ab88, encoding=0x00000000
…

The function offsets are supposed to be sorted. I'll collect the build artifacts and submit a report to Apple.

@filipnavara
Copy link
Member

filipnavara commented Feb 4, 2024

Feedback sent to Apple (FB13584275). For reference, here's the repro with files and a build.sh script:
ld64-repro.zip

Broken linker will produce unsorted/corrupted unwind tables, which can be verified with objdump -u Baseline.

@filipnavara
Copy link
Member

filipnavara commented Feb 5, 2024

The bogus unwinding information comes from functions in the __unbox section. We currently don't generate any unwind information for that so presumably the linker generates its own "dummy" one and fails to sort it properly. I'll investigate whether generating it ourselves makes a difference.

Update: It does not. :/

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 12, 2024
@filipnavara
Copy link
Member

Response from Apple:

Thank you for filing the feedback report. This is a bug in our linker.

You can use the following workaround until a fix become available:
Please use the -Wl,-Id_classic linker option

@MichalStrehovsky MichalStrehovsky added the tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly label Feb 20, 2024
@jkotas jkotas changed the title [NativeAOT][8.0] Possible problems with building tests [NativeAOT] xcode15+ linker issue Feb 20, 2024
@ivanpovazan
Copy link
Member

Regarding:

Maybe the right course of action for 8.0 is to use -ld_classic until ld_prime becomes more stable?

Apple announced that they are removing this linker option in https://developer.apple.com/documentation/xcode-release-notes/xcode-16-release-notes so this problem becomes more pressing.

@filipnavara
Copy link
Member

Apple updated my Feedback tickets with request to try Xcode 16 and report back.

@filipnavara
Copy link
Member

filipnavara commented Jun 12, 2024

The original problem with corrupted unwind tables seems to be fixed [with Xcode 16 Beta 1] on the few small samples that I tried. We need to do a full run to make sure that there are no other issues.

@am11
Copy link
Member

am11 commented Sep 23, 2024

FWIW, with Xcode 16 (released last Monday), we get: ld: warning: -ld_classic is deprecated and will be removed in a future release with PublishAot=true.

@filipnavara
Copy link
Member

filipnavara commented Sep 23, 2024

FWIW, with Xcode 16 (released last Monday), we get: ld: warning: -ld_classic is deprecated and will be removed in a future release with PublishAot=true.

I am aware of this. I run the smoke tests with ld-prime from Xcode 16 release and it seems to be ok. Likewise, the macOS/iOS/tvOS workloads were updated last week to use ld-prime on Xcode 16+: xamarin/xamarin-macios#21231.

Technically we are still relying on some unwritten guarantees about code/data layout. I have two issues open with Apple to further investigate the usage of .alt_entry to mitigate that. The gist is that ld-prime has a bug with multiple symbols defined at the same address which could lead to incorrect dead-stripping (workaroundable with the N_NO_DEAD_STRIP flag on ILC output). Unfortunately, ld64 has a different bug which limits the number of consecutive .alt_entry symbols due to recursive algorithm running out of stack space.

@agocke
Copy link
Member

agocke commented Oct 21, 2024

I think this is fixed right?

@jkotas
Copy link
Member

jkotas commented Oct 21, 2024

Should we stop passing -ld_classic to Xcode 16+ before closing this issue?

<CustomLinkerArg Condition="'$(_XcodeVersion)' &gt;= '15'" Include="-ld_classic" />

Note that Xamarin iOS made that change in their build targets https://github.com/xamarin/xamarin-macios/pull/21231/files#diff-2e24f1b6f44df7580244d00f106fa1a9e2257536fbca9d7bbdc74571549324a9R1631

@filipnavara
Copy link
Member

filipnavara commented Oct 21, 2024

Agreed with the comment above, we should align the behavior with Xamarin workload and drop -ld_classic on Xcode 16+.

As far as the the reliance on unwritten guarantees about code/data layout goes, I had some feedback from Apple. Basically, they suggest that .alt_entry is the right mechanism to use to produce labels inside data with fixed layout (or code, for that matter). However, classic ld has a limit on number of consecutive .alt_entry entries before it runs out of stack space due to recursive marking algorithm, and it cannot handle .alt_entry symbols at the start of section. The new ld doesn't have any algorithmic limit on number of consecutive .alt_entry symbols but it incorrectly handles .alt_entry along with other symbols at the same location.

We may opt to not do anything about the .alt_entry situation for now since there are no known issues with either classic ld (ld64) or new ld (ld-prime) reordering subsections. It would likely be possible to create a working solution within the limits of the new ld by doing some extra work to not produce overlapping symbols. I'm not sure if we should keep a tracking issue for that work or just wait if it actually becomes a problem.


For the posterity, here are the excerpts from Apple feedback.

Re: Stack overflow in classic ld with many consecutive .alt_entry symbols (test case: big-alt-entry.zip):

Thank you for sharing all the details! .alt_entry is indeed the preferred way to model this. The problem is that in ld64 all the alt entries produce a chain of references, ld-prime’s model is different and having many alt entries isn’t a problem. With the release of Xcode 16 ld64 will be deprecated, so we’d recommend focusing on ld-prime. To use ld64 I think you’d have to opt out of dead code stripping.
It’s just a bit unfortunate that you hit that other edge case with overlapping non-alt-entry symbols.

As for the section start symbols, linker can synthesize these for you. You can for example refer to section$start$__DATA$__data name and linker will resolve it for you. That way there’s no collision with the real symbol names in your object file.

Re: Dead stripping incorrectly stripping code when .alt_entry is used for code without the no-dead-strip flag (test case: alt_entry_repro.zip):

The case new linker isn’t handling correctly is a combination of symbols at overlapping addresses and alt-entry symbols, like here:
00000000000602c0 (__TEXT,__managedcode) external _GetRuntimeException
00000000000602c0 (__TEXT,__managedcode) private external _S_P_CoreLib_System_RuntimeExceptionHelpers__GetRuntimeException
0000000000060684 (__TEXT,__managedcode) private external [alt entry] _fram1_S_P_CoreLib_System_RuntimeExceptionHelpers__GetRuntimeException

In this specific case marking _S_P_CoreLib_System_RuntimeExceptionHelpers__GetRuntimeException as an alt entry symbol would resolve the problem. That may not always work when using -ld_classic though (e.g., in the test_data.S sample) because old linker doesn’t handle alt-entry symbols at a start of a section. So if that can be a problem, we recommend using -ld_classic until the fix is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-coreclr os-mac-os-x macOS aka OSX tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
Status: No status
Development

No branches or pull requests

7 participants