-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread.InitializeCurrentThread should be marked as cold code #49520
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
My understanding is that static prediction is out of favor these days. But reordering like you propose has other benefits:
|
Yes as normally the branch predictor will be the dominant factor; however the peculiarity here is its included in every async method's kick off; as they all have different generic types for Being more generic if there are more examples could add another enum to |
PGO could obviously also address this; however it seems problematic to always start in the "wrong" state for all these methods? |
We should be able to start in a good state with static pre-generate PGO data. |
Maybe it's time to introduce Likely/Unlikely APIs? |
Yes, these APIs would make these micro-optimizations to be done manually. I do not think we would welcome PRs to add likely/unlikely annotations in the core libraries. I think we would rather want to depend on the static PGO data instead since it scales a lot better and it does not require humans to guess what is more likely. |
Here you can see PGO doing this sort of thing -- I'll see if I can dig up an example with ;;; NO PGO
; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
;# V00 OutArgs [V00 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M27667_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
;; bbWeight=1 PerfScore 0.00
G_M27667_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
mov rax, 0xD1FFAB1E
cmp dword ptr [rax], 0
jne SHORT G_M27667_IG04
;; bbWeight=1 PerfScore 3.25
G_M27667_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
jmp Winsock:<EnsureInitialized>g__Initialize|55_0()
; gcr arg pop 0
;; bbWeight=0.50 PerfScore 1.00
G_M27667_IG04: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
ret
;; bbWeight=0.50 PerfScore 0.50
;;; PGO
; Assembly listing for method Winsock:EnsureInitialized()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 669
; Final local variable assignments
;
;# V00 OutArgs [V00 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M27667_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
;; bbWeight=1 PerfScore 0.00
G_M27667_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
mov rax, 0xD1FFAB1E
cmp dword ptr [rax], 0
je SHORT G_M27667_IG04
;; bbWeight=1 PerfScore 3.25
G_M27667_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
ret
;; bbWeight=1.00 PerfScore 1.00
G_M27667_IG04: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
jmp Winsock:<EnsureInitialized>g__Initialize|55_0()
; gcr arg pop 0
;; bbWeight=0.00 PerfScore 0.00
|
Here it is -- Note SPMI hides the callee name. This is with dynamic PGO, but static PGO will do similar things. ;;; NO PGO
; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePGO == 1
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V01 tmp1 [V01,T00] ( 2, 4 ) ref -> rax class-hnd "dup spill"
; V02 tmp2 [V02,T01] ( 3, 2.50) ref -> rax
;
; Lcl frame size = 40
G_M9749_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M9749_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
mov rcx, 0xD1FFAB1E
mov edx, 639
call CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
; byrRegs +[rax]
; gcr arg pop 0
mov rax, gword ptr [rax+24]
; gcrRegs +[rax]
; byrRegs -[rax]
test rax, rax
jne SHORT G_M9749_IG04
;; bbWeight=1 PerfScore 4.75
G_M9749_IG03: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; gcrRegs -[rax]
add rsp, 40
jmp hackishModuleName:hackishMethodName()
; gcr arg pop 0
;; bbWeight=0.50 PerfScore 1.12
G_M9749_IG04: ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
; gcrRegs +[rax]
add rsp, 40
ret
;; bbWeight=0.50 PerfScore 0.62
;;; PGO
; Assembly listing for method System.Threading.Thread:get_CurrentThread():System.Threading.Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 240550
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V01 tmp1 [V01,T00] ( 2, 4 ) ref -> rax class-hnd "dup spill"
; V02 tmp2 [V02,T01] ( 3, 3.00) ref -> rax
;
; Lcl frame size = 40
G_M9749_IG01: ; gcVars=0000000000000000 {}, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, gcvars, byref, nogc <-- Prolog IG
sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M9749_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz
mov rcx, 0xD1FFAB1E
mov edx, 639
call CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
; byrRegs +[rax]
; gcr arg pop 0
mov rax, gword ptr [rax+24]
; gcrRegs +[rax]
; byrRegs -[rax]
test rax, rax
je SHORT G_M9749_IG04
;; bbWeight=1 PerfScore 4.75
G_M9749_IG03: ; gcrefRegs=00000001 {rax}, byrefRegs=00000000 {}, byref, epilog, nogc
add rsp, 40
ret
;; bbWeight=1.00 PerfScore 1.25
G_M9749_IG04: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, epilog, nogc
; gcrRegs -[rax]
add rsp, 40
jmp hackishModuleName:hackishMethodName()
; gcr arg pop 0
;; bbWeight=0.00 PerfScore 0.00 Raw profile view showing branch bias: |
That's beautiful! 🤩 Should this issue be closed? |
We should be seeing profile data flowing into both prejitting and jitting soon, so it would be good to confirm once that happens that we get the above without any explicit enabling of PGO. So let's keep this open as a reminder. |
Here's a listing from today's build. We now have PGO data flowing into SPC and on through to jitting. This is happening without any COMPlus settings. ; Assembly listing for method Thread:get_CurrentThread():Thread
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 157048
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) lclBlk (32) [rsp+00H] "OutgoingArgSpace"
; V01 tmp1 [V01,T00] ( 2, 4 ) ref -> rax class-hnd "dup spill"
; V02 tmp2 [V02,T01] ( 3, 3.00) ref -> rax
;
; Lcl frame size = 40
G_M63029_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M63029_IG02: ;; offset=0004H
48B92000239AFD7F0000 mov rcx, 0x7FFD9A230020
BA7F020000 mov edx, 639
E8886AAB5F call CORINFO_HELP_GETSHARED_GCTHREADSTATIC_BASE
488B4018 mov rax, gword ptr [rax+24]
4885C0 test rax, rax
7405 je SHORT G_M63029_IG04
;; bbWeight=1 PerfScore 4.75
G_M63029_IG03: ;; offset=0021H
4883C428 add rsp, 40
C3 ret
;; bbWeight=1.00 PerfScore 1.25
G_M63029_IG04: ;; offset=0026H
4883C428 add rsp, 40
E9C955FFFF jmp Thread:InitializeCurrentThread():Thread
;; bbWeight=0.00 PerfScore 0.00 |
Does it pick that up for async Method kick offs (as they will be new inlines); though presumably that will kick in at Tier1 anyway and at Tier0 the ordering of that |
We still have some work to do here, in cases where the caller doesn't have PGO data. For instance
The inlinee
but the caller does not, and so we don't bother to scale the inlinee counts, and after the resultant block fusion and unity weight descaling we end up seriously confused:
Looks like we should always compute a scale factor if the callee has PGO. In general the jit is not very smart yet about mixing PGO and non-PGO. |
Would this currently be an issue as if the caller has a loop as it wouldn't do a retiering Jit? |
With the advent of #49793 many framework methods will come equipped with profile data. And with the advent of #47558 the runtime will always tell the jit to look for profile data for optimized methods. Thus for any method that inlines |
Thread.InitializeCurrentThread
is only called once for theThread
lifetime, however the method it is used inCurrentThread
is called lots and everyasync
method has its own call to it (to getExecutionContext
etc)runtime/src/libraries/System.Private.CoreLib/src/System/Threading/Thread.cs
Lines 328 to 335 in bdd6d35
Currently the asm generated is a conditional jmp forward for it already being set e.g. from
AsyncMethodBuilderCore:Start(byref)
which is bad for static branch prediction.Reversing the condition in
CurrentThread
While it generates a conditional jmp forward for
InitializeCurrentThread
it also then adds a jmp for the regular path:Marking it as a cold method should resolve this by moving the call to
InitializeCurrentThread
to the end of the method so the regular path didn't have to jump over it?/cc @EgorBo
category:performance
theme:block-layout
The text was updated successfully, but these errors were encountered: