Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Inline TLS access for windows/x64 #89472

Merged
merged 64 commits into from
Jan 17, 2024

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jul 25, 2023

Inline the TLS access for windows/x64 so we avoid calls to get ThreadStaticBase.

Current code gen:

 0000000000001AC4: E8 00 00 00 00     call        __GetThreadStaticBase_repro_Program
 0000000000001AC9: C7 40 08 0D F0 0D  mov         dword ptr [rax+8],900DF00Dh
...
...
__GetThreadStaticBase_repro_Program:
  000000000000104D: 8B 0D 00 00 00 00  mov         ecx,dword ptr [_tls_index]
  0000000000001053: 65 48 8B 04 25 58  mov         rax,qword ptr gs:[58h]
                    00 00 00
  000000000000105C: 48 8B 04 C8        mov         rax,qword ptr [rax+rcx*8]
  0000000000001060: B9 00 00 00 00     mov         ecx,offset tls_InlinedThreadStatics
  0000000000001065: 48 01 C1           add         rcx,rax
  0000000000001068: 48 8B 01           mov         rax,qword ptr [rcx]
  000000000000106B: 48 85 C0           test        rax,rax
  000000000000106E: 0F 84 00 00 00 00  je          S_P_CoreLib_Internal_Runtime_ThreadStatics__GetInlinedThreadStaticBaseSlow
  0000000000001074: C3                 ret

New code gen:

  0000000000001AE5: 65 48 8B 04 25 58  mov         rax,qword ptr gs:[58h]
                    00 00 00
  0000000000001AEE: 8B 0D 00 00 00 00  mov         ecx,dword ptr [_tls_index]
  0000000000001AF4: BA 00 00 00 00     mov         edx,offset tls_InlinedThreadStatics
  0000000000001AF9: 48 03 14 C8        add         rdx,qword ptr [rax+rcx*8]
  0000000000001AFD: 48 8B CA           mov         rcx,rdx
  0000000000001B00: 48 8B 19           mov         rbx,qword ptr [rcx]
  0000000000001B03: 48 85 DB           test        rbx,rbx
  0000000000001B06: 74 1D              je          0000000000001B25
  0000000000001B08: C7 43 08 0D F0 0D  mov         dword ptr [rbx+8],900DF00Dh

Contributes to #79521

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 25, 2023
@ghost ghost assigned kunalspathak Jul 25, 2023
@ghost
Copy link

ghost commented Jul 25, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Inline the TLS access for windows/x64 so we avoid calls to get ThreadStaticBase.

Current code gen:

 0000000000001AC4: E8 00 00 00 00     call        __GetThreadStaticBase_repro_Program
 0000000000001AC9: C7 40 08 0D F0 0D  mov         dword ptr [rax+8],900DF00Dh
...
...
__GetThreadStaticBase_repro_Program:
  000000000000104D: 8B 0D 00 00 00 00  mov         ecx,dword ptr [_tls_index]
  0000000000001053: 65 48 8B 04 25 58  mov         rax,qword ptr gs:[58h]
                    00 00 00
  000000000000105C: 48 8B 04 C8        mov         rax,qword ptr [rax+rcx*8]
  0000000000001060: B9 00 00 00 00     mov         ecx,offset tls_InlinedThreadStatics
  0000000000001065: 48 01 C1           add         rcx,rax
  0000000000001068: 48 8B 01           mov         rax,qword ptr [rcx]
  000000000000106B: 48 85 C0           test        rax,rax
  000000000000106E: 0F 84 00 00 00 00  je          S_P_CoreLib_Internal_Runtime_ThreadStatics__GetInlinedThreadStaticBaseSlow
  0000000000001074: C3                 ret

New code gen:

  0000000000001AE5: 65 48 8B 04 25 58  mov         rax,qword ptr gs:[58h]
                    00 00 00
  0000000000001AEE: 8B 0D 00 00 00 00  mov         ecx,dword ptr [_tls_index]
  0000000000001AF4: BA 00 00 00 00     mov         edx,offset tls_InlinedThreadStatics
  0000000000001AF9: 48 03 14 C8        add         rdx,qword ptr [rax+rcx*8]
  0000000000001AFD: 48 8B CA           mov         rcx,rdx
  0000000000001B00: 48 8B 19           mov         rbx,qword ptr [rcx]
  0000000000001B03: 48 85 DB           test        rbx,rbx
  0000000000001B06: 74 1D              je          0000000000001B25
  0000000000001B08: C7 43 08 0D F0 0D  mov         dword ptr [rbx+8],900DF00Dh
Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@am11
Copy link
Member

am11 commented Jul 26, 2023

@kunalspathak, some platforms seem to have tls_get_addr in RO section. We've started to get this on illumos-x64 after #87082:

$ docker run --rm -v$(pwd):/runtime -e ROOTFS_DIR=/crossrootfs/x64 \
    mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-cross-illumos-20220531132048-f13d79e  \
    /runtime/build.sh -c Release -os illumos -cross -gcc

...
  [ 95%] Built target clrjit_win_x64_x64
  [ 95%] Built target coreclr_static
  /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: warning: relocation against `__tls_get_addr@@SUNWprivate_1.1' in read-only section `.text'
  /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: relocation R_X86_64_PC32 against protected symbol `__tls_get_addr@@SUNWprivate_1.1' can not be used when making a shared object; recompile with -fPIC
  /crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: final link failed: bad value
  collect2: error: ld returned 1 exit status
  dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/build.make:974: recipe for target 'dlls/mscoree/coreclr/libcoreclr.so' failed
  make[2]: *** [dlls/mscoree/coreclr/libcoreclr.so] Error 1
  CMakeFiles/Makefile2:4678: recipe for target 'dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all' failed
  make[1]: *** [dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all] Error 2
  Makefile:135: recipe for target 'all' failed
  make: *** [all] Error 2
  /runtime/src/coreclr
  Failed to build "CoreCLR component".

I haven't had time to go deeper on it, but thought you might wana know about this variation.

@kunalspathak
Copy link
Member Author

@kunalspathak, some platforms seem to have tls_get_addr in RO section.

tls_get_addr is needed for linux based OS. This PR just adds support for windows x64 currently.

recompile with -fPIC

That seems familiar. Might have to spend time trying to understand.

@kunalspathak kunalspathak marked this pull request as ready for review July 29, 2023 10:06
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib @VSadov @MichalStrehovsky

@kunalspathak
Copy link
Member Author

@MichalStrehovsky - let me know if there are more pipelines that i should trigger to test this out.

// block (...): [weight: 1.0]
// use(tlsRoot);

if (hasLazyStaticCtor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to deal with the static constructor independently from the field access? Something like;

  • If the type has static constructor, emit regular static constructor trigger. This can be then optimized/deduplicated with static constructor triggers for the same type.
  • Emit thread static access. This has no side-effects, it can be moved around, CSEd, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to deal with the static constructor independently from the field access?

The expansion happens after all the optimizations, CSE, etc. is done. We tried earlier to do in earlier phase, but it was making it hard to hoist certain checks as mentioned in #82973 (comment). Ideally, we need to expand just subset of appropriate portion of tree in various phases to achieve idle code, but might be tricky to get all the cases. Even today, without these changes, we do not hoist the call to helper out of loop because of

else if (s_helperCallProperties.MayRunCctor(helpFunc) &&
((call->gtFlags & GTF_CALL_HOISTABLE) == 0))
{
INDEBUG(failReason = "non-hoistable helper call";)
treeIsHoistable = false;
}

    static void Test(int n)
    {
        for (int i = 0; i < n; i++)
        {
            x1 = 5; x2 = 5; x3 = 5;
            y1 = 5; y2 = 5; y3 = 5;
        }
    }
G_M000_IG01:                ;; offset=0000H
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      ebx, ecx

G_M000_IG02:                ;; offset=0008H
       lea      rax, [(reloc 0x42aa78)]
       cmp      qword ptr [rax-08H], 0
       jne      SHORT G_M000_IG06

G_M000_IG03:                ;; offset=0016H
       xor      esi, esi
       test     ebx, ebx
       jle      SHORT G_M000_IG05

G_M000_IG04:                ;; offset=001CH
       call     CORINFO_HELP_READYTORUN_THREADSTATIC_BASE
       mov      dword ptr [rax+08H], 5
       mov      dword ptr [rax+0CH], 5
       mov      dword ptr [rax+10H], 5
       mov      dword ptr [rax+14H], 5
       mov      dword ptr [rax+18H], 5
       mov      dword ptr [rax+1CH], 5
       inc      esi
       cmp      esi, ebx
       jl       SHORT G_M000_IG04

G_M000_IG05:                ;; offset=0051H
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret

G_M000_IG06:                ;; offset=0058H
       call     CORINFO_HELP_READYTORUN_NONGCSTATIC_BASE
       jmp      SHORT G_M000_IG03

I propose to take these changes for .NET 8 because of time constraints and revisit missing gaps in .NET 9.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dotnet/jit-contrib Do you believe that this duplication is worth it for the time being?

.NET 8 because of time constraints and revisit missing gaps in .NET 9.

I think it may be too late to take this change for .NET 8. We are actively trying to shut things down for .NET 8.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be too late to take this change for .NET 8.

In that case, I can start working on resolving the duplication problem that you pointed and get this in for .NET 9.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried exploring on the idea of expanding lazy static ctor separately and it could be beneficial in cases where we have both a static init and a TLS, defined and used on same type (see example below). But it still won't eliminate the extra ctor check we introduce by expanding static initializer. If we have the static ctor trigger early, the VN will not
give it same number because under the hood, they call different helper. We could theortically avoid emitting the 2nd invocation of ctor check (0x40000000004203c8 related check below) it during expansion, if we have seen similar ctor invocation earlier, but for that we will have to make sure the previous ctor invocation dominates the block having the 2nd ctor invocation.

static string i = 5.ToString();
[ThreadStatic]
static string j = 5.ToString();
void Foo()
{
    Consume(i);
    Consume(j);
}
G_M27646_IG02:  ;; offset=0x0005
       lea      rcx, [(reloc 0x40000000004203c8)]
       cmp      qword ptr [rcx-0x08], 0
       jne      SHORT G_M27646_IG06
       mov      rcx, qword ptr [(reloc 0x40000000004203d0)]
       mov      rcx, gword ptr [rcx+0x08]
       call     Program:Consume(System.String)
       lea      r8, [(reloc 0x40000000004203c8)]
       add      r8, -8
       cmp      qword ptr [r8], 0
       jne      SHORT G_M27646_IG07
						;; size=47 bbWeight=1 PerfScore 14.25
G_M27646_IG03:  ;; offset=0x0034
       mov      rcx, qword ptr GS:[0x0058]
       mov      eax, dword ptr [(reloc 0x40000000004203d8)]
       mov      edx, (reloc 0x40000000004203e0)
       add      rdx, qword ptr [rcx+8*rax]
       mov      rbx, qword ptr [rdx]
       test     rbx, rbx
       je       SHORT G_M27646_IG08
						;; size=32 bbWeight=1 PerfScore 9.50
G_M27646_IG04:  ;; offset=0x0054
       mov      rcx, gword ptr [rbx+0x80]
       call     Program:Consume(System.String)
       nop      
						;; size=13 bbWeight=1 PerfScore 3.25
G_M27646_IG05:  ;; offset=0x0061
       add      rsp, 32
       pop      rbx
       ret      
						;; size=6 bbWeight=1 PerfScore 1.75
G_M27646_IG06:  ;; offset=0x0067
       call     CORINFO_HELP_READYTORUN_GCSTATIC_BASE
       mov      rcx, qword ptr [(reloc 0x40000000004203d0)]
       mov      rcx, gword ptr [rcx+0x08]
       call     Program:Consume(System.String)
       lea      r8, [(reloc 0x40000000004203c8)]
       add      r8, -8
       cmp      qword ptr [r8], 0
       je       SHORT G_M27646_IG03
						;; size=38 bbWeight=0 PerfScore 0.00
G_M27646_IG07:  ;; offset=0x008D
       xor      ecx, ecx
       mov      edx, -1
       lea      rax, [(reloc 0x40000000004203f0)]
       call     rax
       mov      rbx, rax
       jmp      SHORT G_M27646_IG04
						;; size=21 bbWeight=0 PerfScore 0.00
G_M27646_IG08:  ;; offset=0x00A2
       mov      rcx, rdx
       lea      rax, [(reloc 0x40000000004203e8)]
       call     rax
       mov      rbx, rax
       jmp      SHORT G_M27646_IG04
						;; size=17 bbWeig

I am not sure how common it is to see such patterns, @VSadov - any thoughts?

In cases such as below, we generate expected code:

    [ThreadStatic]
    static string i = 5.ToString();
    [ThreadStatic]
    static string j = 5.ToString();

        for (int k = 0; k < 10; k++)
        {            
            Consume(i);
            Consume(j);
        }
G_M27646_IG02:  ;; offset=0x0006
       xor      ebx, ebx
       lea      r8, [(reloc 0x40000000004203e8)]
       add      r8, -8
       cmp      qword ptr [r8], 0
       jne      SHORT G_M27646_IG05
       mov      rcx, qword ptr GS:[0x0058]
       mov      eax, dword ptr [(reloc 0x40000000004203c8)]
       mov      edx, (reloc 0x40000000004203d0)
       add      rdx, qword ptr [rcx+8*rax]
       mov      rsi, qword ptr [rdx]
       test     rsi, rsi
       je       SHORT G_M27646_IG06
                                                ;; size=51 bbWeight=1 PerfScore 14.50
G_M27646_IG03:  ;; offset=0x0039
       mov      rcx, gword ptr [rsi+0x68]
       call     Program:Consume(System.String)
       mov      rcx, gword ptr [rsi+0x78]
       call     Program:Consume(System.String)
       inc      ebx
       cmp      ebx, 10
       jl       SHORT G_M27646_IG03

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of general patterns, I think we can only assume that threadstatics are relatively less common than other kinds of fields. Also it is not common to have a lot of them in the same type. It is valid and certainly possible to see numerous threadstatics in a type, but it is more common for a type to have just one such field that holds all the threadstatic state for the type.
I'd say - 2 threadstatic fields in a type is likely common enough, 20 - is much less likely.

Mixing threadstatic and regular static fields in the same type is fairly common.

As for the order of access within the same method - static or threadstatic first, I think it would be too subtle if one pattern performs better than another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can only assume that threadstatics are relatively less common than other kinds of fields

On the other hand, I'd guess that code that uses them is often performance sensitive so handling it efficiently for AOT is important.

@VSadov
Copy link
Member

VSadov commented Jan 10, 2024

I tried running a few libraires testcases in a loop with the change, for stress purposes, and once in a while (about every 5-15 iterations) I see occasional crashes that look like GC holes.
That is with -rc Release -lc Release.

Same tests with f2a9ef8 commit from main (I assume that is the baseline for this change) run without failures, at least for 300 iterations, that I waited for.

Sometimes it is not a crash, but a test failure - typically like:

 System.Threading.SynchronizationLockException : Object synchronization method was called from an unsynchronized block of code.
     at System.Threading.Monitor.Exit(Object) + 0x15b
     at System.Collections.Concurrent.ConcurrentDictionary`2.ReleaseLocks(Int32) + 0x2d
     at System.Collections.Concurrent.ConcurrentDictionary`2.get_Count() + 0x34
     at System.Collections.Tests.IDictionary_Generic_Tests`2.AddToCollection(ICollection`1 collection, Int32 numberOfItemsToAdd) + 0x151
     at System.Collections.Tests.IDictionary_Generic_Tests`2.GenericIDictionaryFactory(Int32 count) + 0x2d
     at System.Collections.Tests.IDictionary_Generic_Tests`2.IDictionary_Generic_Values_ModifyingTheDictionaryUpdatesTheCollection(Int32 c

Possibly the same reason, just a different way to fail.

Maybe worth checking if all is ok with GC info?

@kunalspathak
Copy link
Member Author

Maybe worth checking if all is ok with GC info?

It was not. I was not considering that the value t_inlinedThreadStaticBase returned needs tracking. Here is the before vs. after fix diff:

image

@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

Maybe worth checking if all is ok with GC info?

It was not. I was not considering that the value t_inlinedThreadStaticBase returned needs tracking. Here is the before vs. after fix diff:

I am running the same subset of the tests that was hitting the GC holes and after 100+ iterations I do not see any failures.
It looks like it was the culprit.

@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

I cannot comment on the shape of JIT changes, but we do get the desired codegen and the run time behavior. Very nice!

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!!

@VSadov
Copy link
Member

VSadov commented Jan 13, 2024

The failures in System.Text.Json.Tests on CoreCLR appear to be #96876 and not related to this change since this is NativeAOT-specific.

Process terminated. Assertion failed.
Implementation depends on List<T> always using a T[] and not U[] where U : T.
   at System.Text.Json.Serialization.Tests.MetadataTests.VerifyWeatherForecastWithPOCOs(WeatherForecastWithPOCOs expected, WeatherForecastWithPOCOs obj) in /_/src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Serialization/MetadataTests/MetadataTests.JsonSerializer.cs:line 84
   at System.Text.Json.Serialization.Tests.MetadataTests.RoundTripSerializerOverloads() in /_/src/libraries/System.Text.Json/tests/System.Text.Json.Tests/Serialization/MetadataTests/MetadataTests.JsonSerializer.cs:line 21
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs:line 38
   at System.Text.Json.Serialization.Tests.MetadataTests.RoundTripSerializerOverloads()
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeWithNoArgs(Object obj, BindingFlags invokeAttr) in /_/src/libraries/System.Private.CoreLib/src/System/Reflection/MethodBaseInvoker.cs:line 57
   at Xunit.Sdk.TestInvoker`1.<>c__DisplayClass48_0.<<InvokeTestMethodAsync>b__1>d.MoveNext()

@kunalspathak
Copy link
Member Author

@BruceForstall - can you please take another look?

@kunalspathak kunalspathak merged commit 7989f18 into dotnet:main Jan 17, 2024
122 of 125 checks passed
@kunalspathak kunalspathak deleted the tls_nativeaot_winx64 branch January 17, 2024 20:31
tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
* wip

* working model

* wip

* wip

* working

* Add helper for tlsIndex

* add methods in superpmi

* revert some local changes

* misc fixes

* Stop emitting TLS access code for windows/x64

* fix linux build errors

* Do not throw not implemented for windows/x64

* fix the problem where ThreadStaticBase helper was still getting invoked

* Revert certain changes from JIT method

* Introduce getThreadLocalStaticInfo_ReadyToRun()

* Consume getThreadLocalStaticInfo_ReadyToRun()

* Remove getTlsRootInfo() and other methods

* Revert unneeded changes

* missing gtInitCldHnd initialization

* save target address

* jit format

* run thunkgenerator

* resolve merge conflicts

* fix issues so the TLS is inlined

* Rename data structures from *_ReadyToRun to *_NativeAOT

* jit format

* fix some unit test

* fix a bug

* fix the weird jump problem

* use enclosing type cls handle for VN of static gc/non-gc helper

* fix a bug of resetting the flag

* useEnclosingTypeOnly from runtime to determine if VN should optimize it

* do not use vnf, but only use useEnclosingTypeAsArg0

* Use GT_COMMA to add GCStaticBase call next to TLS call

* optimize the cctor call

* Remove lazy ctor generation from tls

* Update jitinterface to not fetch data for lazy ctor

* fix errors after merge

* fix test build errors

* fix bug in CSE

* Use CORINFO_FLG_FIELD_INITCLASS instead of separate flag

* Use the INITCLASS flag

* Remove useEnclosingTypeOnly

* Add NoCtor

* Use CORINFO_HELP_READYTORUN_THREADSTATIC_BASE_NOCTOR

* Minor cleanup

* Renegenrate thunk

* Add the SetFalseTarget

* fix merge conflict resolution

* better handling of GTF_ICON_SECREL_OFFSET

better handling of GTF_ICON_SECREL_OFFSET

* review feedback

* Disable optimization for minopts

* Add comments around iiaSecRel

* jit format

* create emitNewInstrCns()

* Expand TLS even if optimization is disabled

* Track t_inlinedThreadStaticBase

Better tracking `t_inlinedThreadStaticBase` as TYP_REF

* jit format
@github-actions github-actions bot locked and limited conversation to collaborators Feb 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants