Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Fast-path GetStateMachineBox #21875

Closed

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Jan 8, 2019

Split CreateStateMachineBox out from GetStateMachineBox for a fast-path in common scenario.

At Jit time GetStateMachineBox for a pre-"boxed" statemachine is generally considered a profitable inline (without forcing with AggressiveInlining), adding only 79 bytes of asm to XxOnCompleted; whereas with the creation in it is not (417 bytes of asm).

At crossgen it it blocked due to [FAILED: runtime dictionary lookup] but that should be cleaned up by Tier1 Jit.

The GetStateMachineBox portion is the always called section for AwaitUnsafeOnCompleted

As per the comments the Get not Create path is the more common one (here's with Get inlined):

image

/cc @stephentoub

@benaadams
Copy link
Member Author

Pre

; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 48

G_M62408_IG01:
       56                   push     rsi
       4883EC30             sub      rsp, 48
       4889542428           mov      qword ptr [rsp+28H], rdx
       498BF0               mov      rsi, r8

G_M62408_IG02:
       498BD1               mov      rdx, r9
       E8C3F2FFFF           call     AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this
       488BD0               mov      rdx, rax
       480FBE0E             movsx    rcx, byte  ptr [rsi]
       488B0E               mov      rcx, gword ptr [rsi]
       41B801000000         mov      r8d, 1
       E8968B9D5D           call     TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
       90                   nop      

G_M62408_IG03:
       4883C430             add      rsp, 48
       5E                   pop      rsi
       C3                   ret      

; Total bytes of code 49, prolog size 10 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

Post

; Assembly listing for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this
; Lcl frame size = 48

G_M62419_IG01:
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC30             sub      rsp, 48
       4889542428           mov      qword ptr [rsp+28H], rdx
       488BF9               mov      rdi, rcx
       498BF0               mov      rsi, r8
       498BD9               mov      rbx, r9

G_M62419_IG02:
       E8C37E8A5B           call     ExecutionContext:Capture():ref
       488BE8               mov      rbp, rax
       488B17               mov      rdx, gword ptr [rdi]
       48B9906AF140FD7F0000 mov      rcx, 0x7FFD40F16A90
       E88E1FB75F           call     CORINFO_HELP_ISINSTANCEOFCLASS
       4C8BF0               mov      r14, rax
       4D85F6               test     r14, r14
       7414                 je       SHORT G_M62419_IG04
       49396E48             cmp      gword ptr [r14+72], rbp
       740C                 je       SHORT G_M62419_IG03
       498D4E48             lea      rcx, bword ptr [r14+72]
       488BD5               mov      rdx, rbp
       E82422B75F           call     CORINFO_HELP_ASSIGN_REF

G_M62419_IG03:
       EB11                 jmp      SHORT G_M62419_IG05

G_M62419_IG04:
       488BCF               mov      rcx, rdi
       488BD3               mov      rdx, rbx
       4C8BC5               mov      r8, rbp
       E8ACF4FFFF           call     AsyncTaskMethodBuilder`1:CreateStateMachineBox(byref,ref):ref:this
       4C8BF0               mov      r14, rax

G_M62419_IG05:
       498BD6               mov      rdx, r14
       480FBE0E             movsx    rcx, byte  ptr [rsi]
       488B0E               mov      rcx, gword ptr [rsi]
       41B801000000         mov      r8d, 1
       E8BCAE915B           call     TaskAwaiter:UnsafeOnCompletedInternal(ref,ref,bool)
       90                   nop      

G_M62419_IG06:
       4883C430             add      rsp, 48
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       C3                   ret      

; Total bytes of code 128, prolog size 15 for method AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this

@benaadams
Copy link
Member Author

benaadams commented Jan 8, 2019

As per the comments the Get not Create path is the more common one (here's with Get inlined):

image

@benaadams
Copy link
Member Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test

@stephentoub
Copy link
Member

As per the comments the Get not Create path is the more common one (here's with Get inlined)

It depends on the workload. Create needs to be called for every async method that yields at least once, so if one yields exactly once (which is quite common in some workloads), this won't help and could actually make things worse, especially if the get method doesn't get inlined.

There's also been some concern expressed about the size overheads added in AOT builds by the use of generics in these builders, and it would seem the introduction of another generic method will only make that worse. @jkotas?

You're focused on call counts. Does this have any measurable throughput impact at all?

@benaadams
Copy link
Member Author

Create needs to be called for every async method that yields at least once, so if one yields exactly once (which is quite common in some workloads), this won't help and could actually make things worse,

The call costs would still be there regardless (when the method is combined as it doesn't inline); so the additional cost is the preamble in CreateStateMachineBox as that's also included in AwaitUnsafeOnCompleted with the inline:

G_M11442_IG01:
       4156                 push     r14
       57                   push     rdi
       56                   push     rsi
       55                   push     rbp
       53                   push     rbx
       4883EC40             sub      rsp, 64
       488BD9               mov      rbx, rcx
       488BF2               mov      rsi, rdx
       498BE8               mov      rbp, r8
;
G_M11445_IG12:
       4883C440             add      rsp, 64
       5B                   pop      rbx
       5D                   pop      rbp
       5E                   pop      rsi
       5F                   pop      rdi
       415E                 pop      r14
       C3                   ret      

especially if the get method doesn't get inlined.

With Tiered Jitting that should solve the issue for crossgen, but as you point out it may be an issue for AoT (depending how good it is at resolving runtime handles at they would be fixed?)

Jit diffs aren't huge

Total bytes of diff: 1557 (0.04% of base)
    diff is a regression.

Total byte diff includes 9887 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    1 unique methods,     9887 unique bytes

Top file regressions by size (bytes):
        1557 : System.Private.CoreLib.dasm (0.04% of base)

1 total files with size differences (0 improved, 1 regressed), 0 unchanged.

Top method regessions by size (bytes):
        9887 : System.Private.CoreLib.dasm - AsyncTaskMethodBuilder`1:CreateStateMachineBox(byref,ref):ref:this (0/26 methods)
        1948 : System.Private.CoreLib.dasm - AsyncTaskMethodBuilder`1:AwaitUnsafeOnCompleted(byref,byref):this (32 methods)

Top method improvements by size (bytes):
      -10278 : System.Private.CoreLib.dasm - AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this (26/8 methods)

3 total methods with size differences (1 improved, 2 regressed), 17553 unchanged.

@benaadams
Copy link
Member Author

benaadams commented Jan 8, 2019

Interestingly adding the AggressiveInlining does cause it to get inlined in crossgen; so its just the additional push/pops in the CreateStateMachineBox path (and GetStateMachineBox doesn't get output)

@benaadams
Copy link
Member Author

@dotnet-bot test Ubuntu arm Cross Checked Innerloop Build and Test

@benaadams
Copy link
Member Author

Does this have any measurable throughput impact at all?

Having issues with dev machine atm, will get back to you

@jkotas
Copy link
Member

jkotas commented Jan 8, 2019

At crossgen it it blocked due to [FAILED: runtime dictionary lookup]

I think you are looking at different instantiation in crossgen than what you are looking with JIT. This same inlining failure is going to happen with JIT when TResult is reference type.

the size overheads added in AOT builds by the use of generics in these builders, and it would seem the introduction of another generic method will only make that worse. @jkotas?

Correct. Note that the code size bloat matters for JIT too, it is just harder to measure the impact. The code expanding optimizations are great for micro-benchmarks, but they can easily hurt e.g. non-trivial real-world workloads that run mix of hundred different async methods.

@benaadams
Copy link
Member Author

Its a modest ~55 byte increase per method; to use a 128 byte method vs 466 byte method combination (49+417) when the async path resumes 2+ times; which should give it more chance to be instruction cache resident (inlining aside).

Looped async resumption is also a pattern that is likely to be more common with the introduction of async enumerators/streams.

Saying that... I don't feel too strongly about this change, CreateStateMachineBox is called at least once; its a very mild improvement on async resumption so that's going to dominate more and something like dotnet/csharplang#1407 would be more powerful

@benaadams
Copy link
Member Author

Ubuntu arm https://github.com/dotnet/coreclr/issues/17129

GC_Scenarios._Dynamo_dynamo_dynamo_._Dynamo_dynamo_dynamo_sh [FAIL] - Check Test Results for specific test information and call stack.

@benaadams
Copy link
Member Author

Might be able to do something for bloat

@stephentoub
Copy link
Member

Its a modest ~55 byte increase per method

That's not including all of the metadata/runtime data structures/etc. for the additional generic method, though, right?

@benaadams
Copy link
Member Author

That's not including all of the metadata/runtime data structures/etc. for the additional generic method, though, right?

No, so this probably wouldn't help? #21908

@benaadams
Copy link
Member Author

Again, this is probably more interesting for now #21909 can revisit this later

@benaadams benaadams closed this Jan 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants