Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI ONLY] [NativeAOT] ObjWriter in C# #92705

Closed
wants to merge 120 commits into from

Conversation

filipnavara
Copy link
Member

@filipnavara filipnavara commented Sep 27, 2023

Ref: #77178

This is reimplementation of NativeAOT ObjWriter in pure C# instead of depending on LLVM. It implements Mach-O, ELF, COFF object file emitter with DWARF and CodeView debugging information. Only x64 and arm64 targets are implemented to cover officially supported platforms. Certain features are not implemented yet, e.g. COMDAT in ELF. Other features like DWARF debugging info generation are currently slower than the previous implementation. A limited testing was done on osx-arm64, win-x64, and linux-x64. Previous version of the branch was also tested on osx-x64, win-arm64, and linux-arm64.

Caveat: This is NOT for review, the draft PR is opened specifically to run smoke tests only. The code was rebased over current main branch and updated to reflect most ObjWriter changes from the past year (both on the runtime repo side and the LLVM fork repo). The performance and structure of the code is not the final shape and I expect to rewrite certain parts before submitting this for actual review.

cc @TIHan

Tracking list of issues found by the CI:

  • linux-x64: System.NotSupportedException: Unsupported relocation: IMAGE_REL_TLSGD
  • linux-arm64: relocation R_AARCH64_TLSDESC_ADR_PAGE21 cannot be used against symbol 'tls_InlinedThreadStatics'
  • linux-arm64: libunwind: malformed DW_CFA_register DWARF unwind, reg too big
  • linux-arm64: Different pointer encoding is used in EH frames which seems to cause issues
  • iOS/tvOS: warning: ignoring file /tmp/helix/working/B5450932/w/C6420AD0/e/publish/native/iOS.Device.Aot.Test.o, building for iOS-arm64 but attempting to link with file built for unknown-unsupported file format ( 0x7F 0x45 0x4C 0x46 0x02 0x01 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 )
  • win-x64: Framework.lib(System.Private.CoreLib.obj) : fatal error LNK1243: invalid or corrupt file: COMDAT section 0x20001 associated with following section 0x0 (MultiModule test)
  • win-arm64: CoreFXTestLibrary.AssertTestException: Assert.AreEqual: Expected: [System.Func2[CommonType10[],CommonType10[]]]. Actual: [System.Func2[CommonType10[],CommonType10[]]].

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Sep 27, 2023
@ghost
Copy link

ghost commented Sep 27, 2023

Tagging subscribers to this area: @dotnet/area-system-reflection-metadata
See info in area-owners.md if you want to be subscribed.

Issue Details

Ref: #77178

This is reimplementation of NativeAOT ObjWriter in pure C# instead of depending on LLVM. It implements Mach-O, ELF, COFF object file emitter with DWARF and CodeView debugging information. Only x64 and arm64 targets are implemented to cover officially supported platforms. Certain features are not implemented yet, e.g. COMDAT in ELF. Other features like DWARF debugging info generation are currently slower than the previous implementation. A limited testing was done on osx-arm64, win-x64, and linux-x64. Previous version of the branch was also tested on osx-x64, win-arm64, and linux-arm64.

Caveat: This is NOT for review, the draft PR is opened specifically to run smoke tests only. The code was rebased over current main branch and updated to reflect most ObjWriter changes from the past year (both on the runtime repo side and the LLVM fork repo). The performance and structure of the code is not the final shape and I expect to rewrite certain parts before submitting this for actual review.

cc @TIHan

Author: filipnavara
Assignees: -
Labels:

area-System.Reflection.Metadata, community-contribution

Milestone: -

@ghost
Copy link

ghost commented Sep 27, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Ref: #77178

This is reimplementation of NativeAOT ObjWriter in pure C# instead of depending on LLVM. It implements Mach-O, ELF, COFF object file emitter with DWARF and CodeView debugging information. Only x64 and arm64 targets are implemented to cover officially supported platforms. Certain features are not implemented yet, e.g. COMDAT in ELF. Other features like DWARF debugging info generation are currently slower than the previous implementation. A limited testing was done on osx-arm64, win-x64, and linux-x64. Previous version of the branch was also tested on osx-x64, win-arm64, and linux-arm64.

Caveat: This is NOT for review, the draft PR is opened specifically to run smoke tests only. The code was rebased over current main branch and updated to reflect most ObjWriter changes from the past year (both on the runtime repo side and the LLVM fork repo). The performance and structure of the code is not the final shape and I expect to rewrite certain parts before submitting this for actual review.

cc @TIHan

Author: filipnavara
Assignees: -
Labels:

community-contribution, area-NativeAOT-coreclr

Milestone: -

@filipnavara filipnavara added NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it labels Sep 27, 2023
@filipnavara
Copy link
Member Author

Still not sure what is going on with the win-arm64 tests. They pass locally on my Windows Dev Kit 2023.

@am11
Copy link
Member

am11 commented Sep 28, 2023

This one looks a bit less mysterious:

Running Test: ExistingInstantiations.Test.TestWithExistingInst
Caught Unexpected exception:System.NotSupportedException: Cannot retrieve a MethodInfo for this delegate because the method it targeted was not enabled for metadata.
   at Internal.Reflection.Extensions.NonPortable.DelegateMethodInfoRetriever.GetDelegateMethodInfo(Delegate) + 0x29c
   at ExistingInstantiations.Test.TestWithExistingInst() + 0x518
   at CoreFXTestLibrary.Internal.Runner.RunTestMethod(TestInfo) + 0x364
   at CoreFXTestLibrary.Internal.Runner.RunTest(TestInfo) + 0x20
---- Test FAILED ---------------

perhaps TestReverseLookupsWithArrayArg failure has the same underlying reason?

BTW, CI is using precisely this toolchain version:

  -- The C compiler identification is MSVC 19.36.32537.0
  -- The CXX compiler identification is MSVC 19.36.32537.0

to cross-compile for arm64 on x64: build.cmd -ci -arch arm64 -os windows -s clr.aot+host.native+libs+tools.illink -c Release -rc Release -lc Release -hc Release

@filipnavara
Copy link
Member Author

This one looks a bit less mysterious:

Thanks, @am11. I will check that one. I will likely wait till tomorrow to have direct access to the Win/ARM machine instead of just RDP-over-RDP-over-Tailscale. I have an uncommitted code where I match more closely the COFF output from the old ObjWriter so it's diffable to certain extent, and easier to spot differences. The DynamicGenerics test produces high number of sections and relocations and triggers the "big obj" code paths which didn't get much testing, so that's my primary suspicion.

@TIHan
Copy link
Contributor

TIHan commented Sep 28, 2023

Thank you for making the PR @filipnavara .

I'm out for a bit and hopefully will look at this next week. So far it looks promising.

@filipnavara
Copy link
Member Author

Still not sure what is going on with the win-arm64 tests. They pass locally on my Windows Dev Kit 2023.

Turns out the error was somewhere between the chair and the keyboard. I had a wrong branch checked out.

@filipnavara
Copy link
Member Author

I found the issue in the COFF/ARM64 code. I was incorrectly ignoring the addend for IMAGE_REL_BASED_ARM64_PAGEOFFSET_12A (and it was actually an Assert on debug build). This broke fat pointers that use the +2 address offset to be distinguished from real pointers. The same issue didn't happen for Mach-O since it rewrites this particular relocation into a more complex one. On ELF we emit the RELA section with addend inside the relocations, not inside the code, so it happened to work. COFF is the only platform that embeds this kind of relocation in the data.

While searching for the root cause I found couple more issues that are mostly harmless but should be fixed nevertheless. I'll clean up the code and commit it soon.

@filipnavara
Copy link
Member Author

filipnavara commented Sep 30, 2023

I think the code now reached a point where it passes the smoke tests for the supported platforms, which was the primary purpose of this PR. I'll keep it open for the moment, but it served its purpose.

Aside from some general structural improvements these are the areas I intend to explore next (in no particular order):

  • Emitting DWARF debugging info eagerly and directly into corresponding sections without the intermediate representation. (partially done, on par with LLVM ObjWriter)
  • Resolve relative relocations within same section before writing them down to the object files. This should have no effect on output after linking. It does, however, produce slightly smaller object files, and it may avoid hitting limits on the number of relocations in certain formats.
  • Optimize the string table building to take advantage of suffix matches. All the formats use some variation of string table with null-terminated string referenced by offset. It's possible to generate smaller object files by producing _unwind0_XYZ at offset N, and then reusing the suffix XYZ at offset N+9 as a different string.
  • Explore generating XDATA/PDATA unwinding info in COFF to be closer to the structure that MSVC produces. (done)
  • Implement COMDAT handling for ELF file format. It's the one big missing thing for feature parity. (done)

@am11
Copy link
Member

am11 commented Oct 1, 2023

According to dotnet-trace on linux-arm64. ilc.dll spends > 11 seconds in

libobjectfile!LibObjectFile.Elf.ElfWriter`1[LibObjectFile.Elf.ElfEncoderDirect].WriteSections()

during System.Runtime.Tests publishing (and output object size is 273M vs. main's 244M). With published-ilc, this test OOMs (code 137, the current CI failure).

speedscope-trace.zip

@filipnavara
Copy link
Member Author

@am11 Thanks for looking into it, really appreciated!

I didn't focus on the performance and memory usage outside of isolated scenarios. I profiled some code paths for COFF and CodeView but there's very little overlap with what ELF and DWARF does. I'll check the trace you provided and do some profiling on my side as well.

The output size difference is expected. It's mostly caused by extra relocations in the output and non-optimized string table (as mentioned in the "future work" list above). There's also some difference in debugging info size but not nearly as big.

Notably, the memory usage of the DWARF debugging info emitter is pretty high. It's not easy to refactor without significant changes to LibObjectFile. It operates on a "document" model (akin to JsonDocument but binary and multiple cross-linked documents), while we really need something closer to the "writer" model (akin to Utf8JsonWriter). There's already an internal model present in the compiler and it is memory intensive to convert from one model to the other. Additionally, the DWARF model in LibObjectFile has additional abstraction for relocations which are then converted to ELF relocations. This all adds up both in memory and time profiles.

I did some comparisons with the debugging info turned off, and the results were largely favorable to the C# implementation. That said, I did it primarily on Mach-O, not ELF.

@filipnavara
Copy link
Member Author

speedscope-trace.zip

Wow, I totally didn't expect ElfObjectFile.Verify to take 5 times as much as ElfObjectFile.WriteSections. I knew ElfStringTable.GetOrCreateIndex is extremely inefficient but this exceeded all my expectations. It tries to do suffix matching but in the process creates a ton of strings. Given all the time spent in the method I suspect the memory usage culprit could be there as well.

@filipnavara
Copy link
Member Author

filipnavara commented Oct 2, 2023

Since some people are apparently following the PR and looking at some of the performance issues, I made an isolated sample showing the problem with ElfStringTable: https://github.com/filipnavara/StringTableBenchmark

I used three different algorithms to build the string table:

  • Naive produces string table with no suffix deduplication (similar to the code used for COFF)
  • MultiKeySort sorts the input strings and then produces optimized string table with deduplicated suffixes (similar to the LLVM string table builder)
  • ElfStringTable uses LibObjectFile's ElfStringTable, which incrementally builds the table and uses dictionary to deduplicate suffixes

Note that ElfStringTable technically provides an incremental API where the full set of string is not known beforehand. However, in reality it's pre-populated with the full string set anyway, and this property can be used for optimizing the common case without changing the external API.

Without further ado, here are the results on my MacBook Air M1 for a sample string set taken from UseSystemResourceKeys.o in the smoke tests. The set contains slightly less than 25k strings where many can be suffix compressed.

Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ElfStringTable 350.838 ms 6.8811 ms 8.1914 ms 38000.0000 21000.0000 7000.0000 316.81 MB
MultiKeySort 11.456 ms 0.0469 ms 0.0439 ms 1062.5000 1046.8750 968.7500 4.87 MB
Naive 2.374 ms 0.0060 ms 0.0050 ms 1042.9688 1015.6250 996.0938 5.82 MB

You can clearly see that the approach used by ElfStringTable doesn't scale well.

--

The input is pre-sorted, so the 3-way radix quick sort in MultiKeySort happens to choose unoptimal pivot, which skews the result a bit. I didn't spend much time micro-optimizing it, so it serves more as a ballpark figure.

--

I updated the MultiKeySort version a bit to get 30% speed improvement. The size of the output is 2,457,077 bytes for the Naive implementation, and 1,896,752 for the MultiKeySort one.

@filipnavara filipnavara force-pushed the objwriter3 branch 2 times, most recently from 00bfd74 to 1c0d928 Compare October 2, 2023 16:06
@am11
Copy link
Member

am11 commented Oct 2, 2023

Great optimizations! linux-arm64 object size is 241M vs. 244M on main, and CI leg isn't jamming. :)

@agocke
Copy link
Member

agocke commented Oct 2, 2023

@filipnavara really good info. Agreed that ElfStringTable isn't looking great

@filipnavara
Copy link
Member Author

Great optimizations! linux-arm64 object size is 241M vs. 244M on main, and CI leg isn't jamming. :)

I really appreciate that you checked and helped diagnose this issue. 👍 I am really happy to have some working baseline version of the changes before I proceed to do further optimizations and experiments.

@filipnavara
Copy link
Member Author

@filipnavara really good info. Agreed that ElfStringTable isn't looking great

I committed an experimental fix and it passed the CI. I'll submit it upstream and then focus on the further work I mentioned above.

@TIHan
Copy link
Contributor

TIHan commented Oct 2, 2023

@filipnavara , this is looking really good and glad others were able to look at it. The performance improvements do look great.

How far do you think your solution is from matching the existing functionality? It looks like the smoke tests are passing on all the platforms. @agocke , are there any scenarios that we need to cover that are not covered by the tests?

- Section names need to come first in the string table because of limited space for their reference by offset. This caused the "managedcode$I" and "modules$I" section names to be garbage when there were many symbols.
- Fix missing array pool return.
Remove LibObjectFile dependency
@@ -238,7 +238,7 @@ private CodeViewRegister GetCVRegNum(uint regNum)

foreach (var sequencePoint in sequencePoints)
{
if (lastFileName == null || lastFileName != sequencePoint.FileName)
if (lastFileName is null || lastFileName != sequencePoint.FileName)
Copy link
Member

@am11 am11 Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer pattern matching terse syntax, we can reduce the verbosity a bit in those long RelocType.XX conditions:

using static ILCompiler.DependencyAnalysis.RelocType;
...
if (relocType is IMAGE_REL_BASED_ARM64_BRANCH26 or IMAGE_REL_BASED_ARM64_PAGEBASE_REL21 or
     IMAGE_REL_BASED_ARM64_PAGEOFFSET_12A or IMAGE_REL_AARCH64_TLSLE_ADD_TPREL_HI12 or ..)
...

(the duplication due to IMAGE_REL_ prefix is enough as-is 😅)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used several different code styles as I progressed, sometimes intentionally, sometimes as a consequence of reusing existing code... I am generally open to ideas how to keep the code as terse and readable as possible ;-)

@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr community-contribution Indicates that the PR has been added by a community member NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) NO-REVIEW Experimental/testing PR, do NOT review it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants