Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up interface checking and casting #49257

Merged
merged 5 commits into from
Mar 7, 2021

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Mar 6, 2021

16% - 38% faster* for IsInstanceOfInterface & ChkCastInterface

  • Less arguments in addressing modes [rdx + 8] vs [rdx + r8*8 + 8]; for lower latency and greater CPU port availability for more instruction parallelism (*see "Intel Optimization Reference Manual" sections "3.5.1.2 Using LEA" and "3.5.1.6 Address Calculations")
    • 2 operand addressing has latency of 1 clock and 2 ports available so a reciprocal throughput of 0.5 clocks
    • 3 operand addressing has latency of 3 clocks and 1 port available so a reciprocal throughput of 1 clock
  • Reduce branches, conditional jumps reduced from 8 to 5
  • 4 decs exchanged for 1 add with less data depenency

*Results #49257 (comment)

/cc @VSadov

-       xor      r8d, r8d
+       cmp      rcx, 4
+       jge      SHORT G_M5541_IG06                     >---\
        align    [0 bytes]                                  |
-                       ;; bbWeight=0.50 PerfScore 3.50     |
+                       ;; bbWeight=0.50 PerfScore 4.00     |
                                                            |
-G_M57878_IG04:                             <---\           |
-       cmp      qword ptr [rdx+8*r8], rsi      |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+8], rsi    |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+16], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       dec      rcx                            |           |
-       je       SHORT G_M57878_IG05            |           |
-       cmp      qword ptr [rdx+8*r8+24], rsi   |           |
-       je       SHORT G_M57878_IG06            |           |
-       je       SHORT G_M57878_IG05            |           |
-       add      r8, 4                          |           |
-       jmp      SHORT G_M57878_IG04        >---/           |
-                       ;; bbWeight=4    PerfScore 77.00    |
+G_M5541_IG04:                              <---\           |
+       lea      r8, [rdx+32]                   |           |
+       cmp      qword ptr [rdx], rsi           |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+8], rsi         |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+16], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       cmp      qword ptr [rdx+24], rsi        |           |
+       je       SHORT G_M5541_IG08             |           |
+       mov      rdx, r8                        |           |
+       add      rcx, -4                        |           |
+       cmp      rcx, 4                         |           |
+       jge      SHORT G_M5541_IG04         >---/           |
+                       ;; bbWeight=4    PerfScore 57.00    |
+                                                           |
+G_M5541_IG05:                                              |
+       test     rcx, rcx                                   |
+       je       SHORT G_M5541_IG07                         |
+                       ;; bbWeight=0.50 PerfScore 0.62     |
+                                                           |
+G_M5541_IG06:                              <---\       <---/
+       lea      r8, [rdx+8]                    |
+       cmp      qword ptr [rdx], rsi           |
+       je       SHORT G_M5541_IG08             |
+       mov      rdx, r8                        |
        dec      rcx                            |
+       test     rcx, rcx                       |
+       jg       SHORT G_M5541_IG06         >---/
+                       ;; bbWeight=4    PerfScore 21.00

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2021

@AndyAyersMS btw, the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization - otherwise we can re-order them because of some loop during start up and mess up other cases.

@benaadams
Copy link
Member Author

Updated summary

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2021

Can you run a benchmark, e.g. something like this (for IsInstanceOfClass ): https://gist.github.com/EgorBo/8766a778fa2522fad4e916c567a2b5d7

@benaadams
Copy link
Member Author

Can you run a benchmark

Have one but its been running for hours (too many variants); will have to try something shorter

@benaadams
Copy link
Member Author

benaadams commented Mar 7, 2021

Can you run a benchmark

Looks good after fixing the condition @jkotas pointed out #49257 (comment)

Interface cast benchmark gist

Method Implementation Mean branch Ratio
IsInterface1 Interfaces1 2.166 ns current 1.000
IsInterface1 Interfaces1 1.629 ns PR 0.752
IsNotImplemented Interfaces1 2.166 ns current 1.000
IsNotImplemented Interfaces1 1.950 ns PR 0.900
IsInterface1 Interfaces6 2.155 ns current 1.000
IsInterface1 Interfaces6 1.936 ns PR 0.898
IsInterface2 Interfaces6 2.161 ns current 1.000
IsInterface2 Interfaces6 1.908 ns PR 0.883
IsInterface3 Interfaces6 2.188 ns current 1.000
IsInterface3 Interfaces6 1.672 ns PR 0.764
IsInterface4 Interfaces6 2.346 ns current 1.000
IsInterface4 Interfaces6 1.965 ns PR 0.838
IsInterface5 Interfaces6 3.122 ns current 1.000
IsInterface5 Interfaces6 2.173 ns PR 0.696
IsInterface6 Interfaces6 3.472 ns current 1.000
IsInterface6 Interfaces6 2.687 ns PR 0.774
IsNotImplemented Interfaces6 3.644 ns current 1.000
IsNotImplemented Interfaces6 3.071 ns PR 0.843
IsInterface1 Interfaces16 2.154 ns current 1.000
IsInterface1 Interfaces16 1.958 ns PR 0.909
IsInterface2 Interfaces16 1.938 ns current 1.000
IsInterface2 Interfaces16 1.950 ns PR 1.006
IsInterface3 Interfaces16 2.185 ns current 1.000
IsInterface3 Interfaces16 1.661 ns PR 0.760
IsInterface4 Interfaces16 2.326 ns current 1.000
IsInterface4 Interfaces16 1.655 ns PR 0.712
IsInterface5 Interfaces16 3.201 ns current 1.000
IsInterface5 Interfaces16 2.349 ns PR 0.734
IsInterface6 Interfaces16 3.186 ns current 1.000
IsInterface6 Interfaces16 2.376 ns PR 0.746
IsInterface7 Interfaces16 3.360 ns current 1.000
IsInterface7 Interfaces16 2.578 ns PR 0.767
IsInterface8 Interfaces16 3.764 ns current 1.000
IsInterface8 Interfaces16 2.621 ns PR 0.696
IsInterface9 Interfaces16 4.291 ns current 1.000
IsInterface9 Interfaces16 3.091 ns PR 0.720
IsInterface10 Interfaces16 4.866 ns current 1.000
IsInterface10 Interfaces16 3.021 ns PR 0.621
IsInterface11 Interfaces16 4.600 ns current 1.000
IsInterface11 Interfaces16 3.439 ns PR 0.748
IsInterface12 Interfaces16 4.908 ns current 1.000
IsInterface12 Interfaces16 3.376 ns PR 0.688
IsInterface13 Interfaces16 5.862 ns current 1.000
IsInterface13 Interfaces16 3.720 ns PR 0.635
IsInterface14 Interfaces16 6.111 ns current 1.000
IsInterface14 Interfaces16 3.682 ns PR 0.603
IsInterface15 Interfaces16 5.902 ns current 1.000
IsInterface15 Interfaces16 3.955 ns PR 0.670
IsInterface16 Interfaces16 6.406 ns current 1.000
IsInterface16 Interfaces16 3.947 ns PR 0.616
IsNotImplemented Interfaces16 6.689 ns current 1.000
IsNotImplemented Interfaces16 5.412 ns PR 0.809

@benaadams benaadams marked this pull request as ready for review March 7, 2021 05:26
@benaadams benaadams changed the title Reduce branches in IsInstanceOfInterface/ChkCastInterface Speed up interface checking and casting Mar 7, 2021
@benaadams benaadams requested review from jkotas and EgorBo March 7, 2021 06:28
@jkotas jkotas requested a review from VSadov March 7, 2021 15:50
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me. @VSadov Could you please take a look as well?

@MaherJendoubi
Copy link
Contributor

MaherJendoubi commented Mar 7, 2021

You can delete unnecessary usings in the System.Runtime.CompilerServices.CastHelpers class :
using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86;

@benaadams
Copy link
Member Author

You can delete unnecessary usings ...

Done

if (interfaceMap[i + 3] == toTypeHnd)
// If not enough for unrolled, jmp straight to small loop
// as we already know there is one or more interfaces so don't need to check again.
goto few;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If interface table is allocated as 4 elements aligned and zero-filled, we could remove the "few" case entirely. At least in Chk case which is expected to succeed.
That would come with size increase though, so does not seem worth it.

Copy link
Member

@VSadov VSadov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Nice!!!

@VSadov VSadov merged commit bea0abb into dotnet:main Mar 7, 2021
@benaadams benaadams deleted the IsInstanceOfInterface branch March 7, 2021 19:54
@AndyAyersMS
Copy link
Member

the CastHelpers (also SpanHelpers) are heavily optimized by hands for general cases - I wonder if we should disable PGO for them or even mark as AggressiveOptimization

These methods get prejitted, right? Once we have PGO data for prejitted methods we should look and see if we're happy with the codegen we get.

@VSadov
Copy link
Member

VSadov commented Mar 8, 2021

@AndyAyersMS - Yes, these methods are R2R (so no jit at startup) and may rejit later.

They are not AggressiveOptimization. We know that will make them faster by avoiding tiering indirection, but we will have to jit at startup and we want to avoid that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants