Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jit] Loop condition doesn't elide bounds check #9422

Closed
benaadams opened this issue Dec 11, 2017 · 12 comments · Fixed by #100777
Closed

[Jit] Loop condition doesn't elide bounds check #9422

benaadams opened this issue Dec 11, 2017 · 12 comments · Fixed by #100777
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions in-pr There is an active PR which will close this issue when it is merged JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

As a condition of a while it doesn't elide the check

while ((uint)i >= (uint)entries.Length)
{    
    // not elided
    if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
    {

Changing it to an if + break does elide the check

do
{
    if ((uint)i >= (uint)entries.Length)
    {
        break;
    }
    
    // elided
    if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
    {

category:cq
theme:bounds-checks
skill-level:expert
cost:small

@mikedn
Copy link
Contributor

mikedn commented Dec 12, 2017

I looked at this in the past. The reason why this doesn't work in the while loop is that the loop is turned into a do/while loop with the loop condition being duplicated. Something like:

for (uint i = head; i < (uint)entries.Length; i = entries[i].next)
    s += entries[i].value;

becomes

if (i < (uint)entries.Length) {
    do {
        s += entries[i].value;
        i = entries[i].next;
    } while (i < (uint)entries.Length));
}

Each if creates the expected OAK_NO_THROW assertion:

N009 (  8,  8) [000073] -A-X--------              *  JTRUE     void  
In BB01 New Global ArrBnds  Assertion: (512, -1) ($200,$ffffffff) [idx: {InitVal($c1)};len: {ARR_LENGTH($100)}] in range  index=#02, mask=0000000000000002
N004 (  5,  5) [000015] ------------              *  JTRUE     void  
In BB02 New Global ArrBnds  Assertion: (523, -1) ($20b,$ffffffff) [idx: {402};len: {ARR_LENGTH($100)}] in range  index=#04, mask=0000000000000008

But then the i in entries[i] is not the i from assertions, it is PhiDef($3, $4, $202). And assertion propagation doesn't know how to propagate OAK_NO_THROW through PHIs. Maybe it should but doing that seems expensive.

It's probably RangeCheck that should pay attention to these assertions and generate the appropriate ranges. This is something I hope I'll look at in the future.

@glenn-slayden
Copy link

@mikedn wrote:
Maybe it should, but doing that seems expensive.

Could you clarify: time-expensive for JITting, or time- (and/or) space- (and/or) cache-miss (and/or) predictivity-expensive of the emitted ASM itself? [note 1]

If it's not the former, and considering the lack of elision is specifically related to loop code, wouldn't the gain (or loss, see next paragraph) in runtime performance here almost always easily swamp the one-time JIT hit, making this a higher priority fix?

But also more generally, isn't the elision--or redundant presence--at runtime, of any (public) memory fetch or store in fact a correctness bug, considering the strong concurrency guarantees of the .NET memory model? Of course I'm referring to where the checked condition depends on volatile value(s) visible elsewhere. Perhaps this issue here somehow inherently confined to checks of local variables only... which would still need to be true even now given ref-locals?

Pardon my lack of understanding on these matters.


Notes:
1. I assume you don't mean time expensive for csc.

@benaadams
Copy link
Member Author

I'm referring to where the checked condition depends on volatile value(s)

Explicit volatiles or Volatile.Read+Volatile.Write are treated differently; always forcing a memory access.

@mikedn
Copy link
Contributor

mikedn commented Oct 18, 2018

Could you clarify: time-expensive for JITting, or time- (and/or) space- (and/or) cache-miss (and/or) predictivity-expensive of the emitted ASM itself? [note 1]

Hmm, I don't remember the details. I think it seemed to me at the time that doing this might require multiple scans of the entire assertion table, possible leading to quadratic complexity.

The bigger problem is that the existing assertion propagation design doesn't quite scale. It does a ton of upfront work to generate these assertions but it doesn't know if they are actually needed. Worse, there's a limit to the number of assertions it tracks (for perf reasons) and generating useless assertions can mean not having room for useful ones. And on top of it, assertion propagation is already one the expensive JIT phases.

If it's not the former, and considering the lack of elision is specifically related to loop code, wouldn't the gain (or loss, see next paragraph) in runtime performance here almost always easily swamp the one-time JIT hit, making this a higher priority fix?

Could be. And that's typically true about a lot of optimizations that the JIT could do. But there are other potentially better approaches to this problem and AFAIR the perf impact of this specific range check wasn't very big. So it doesn't seem worthwhile to jump at fixing this issue using the first idea that crosses one's mind.

And as far as I'm concerned, I slowly arrived at the conclusion that it would better to put effort into improving certain other parts of the JIT (e.g. some IR design issues inherited from the legacy JIT32 that impact SSA graph traversal performance) before attempting to improve array range check elimination. That, of course, it solely my personal opinion. The JIT team may very well think otherwise.

But also more generally, isn't the elision--or redundant presence--at runtime, of any (public) memory fetch or store in fact a correctness bug, considering the strong concurrency guarantees of the .NET memory model? Of course I'm referring to where the checked condition depends on volatile value(s) visible elsewhere. Perhaps this issue here somehow inherently confined to checks of local variables only... which would still need to be true even now given ref-locals?

I'm not quite sure I understand what specific concerns you have around the memory model and range check elimination. I'll try to point out some thing that perhaps will clarify this:

  • The memory model does not prevent load elimination or load reordering (with respect to other non volatile loads). So there's no problem eliminating or moving the array length load that the range check normally performs.
  • Volatile loads generally block optimizations (unless there are bugs in the JIT)
  • .NET arrays are fixed size, their Length can't change. The JIT assumes that. Of course, you could change the Length by using unsafe code. That would be completely bogus and the JIT doesn't care about such scenarios.
  • Range check elimination requires the JIT to be able to reason about both the index and the array reference. That generally means that they must be local variables so other threads cannot change them. It is possible for other JIT optimizations (e.g. CSE) to transform class field accesses into local variable accesses by caching field values into local variables but that doesn't work very well currently.

@AndyAyersMS
Copy link
Member

There is some data on the relative usefulness of assertions over in dotnet/coreclr#18678. Roughly speaking only about 10% of the generated assertions are used.

@glenn-slayden
Copy link

@mikedn @AndyAyersMS @benaadams Thanks for the helpful insights.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@gfoidl
Copy link
Member

gfoidl commented Mar 31, 2022

While working on #67044 this seems to be half-way resolved.
The while doesn't need to be workarounded by do + if + break now.

Doesn't elide the bounds-check

private int While(int[] entries, int key)
{
    int i = 0;

    while ((uint)i < entries.Length)    // it's proven that i is in [0, entries.Length)
    {
        if (entries[i] == key)          // bounds-check introduced
        {
            break;
        }

        i++;
    }

    return i;
}
private int Do(int[] entries, int key)
{
    int i = 0;

    do
    {
        if ((uint)i >= entries.Length)  // it's proven that i is not in [0, entries.Length)
        {
            break;
        }

        if (entries[i] == key)          // bounds-check introduced
        {
            break;
        }

        i++;
    } while (true);

    return i;
}

Does elide the bounds-check

...when the length is casted to uint ((uint)entries.Length), the bounds-checks are elided.


FYI @EgorBo

@jakobbotsch
Copy link
Member

This looks fixed on main, this is the codegen I get for:

private struct S
{
    public int hashCode;
    public int key;
}

static int Foo(S[] entries, int key, int hashCode, EqualityComparer<int> comparer)
{
    int i = 0;
    while ((uint)i < (uint)entries.Length)
    {
        if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
        {
            return i;
        }
    }

    return -1;
}
; Assembly listing for method MustHaveValueBenchmark.C:Foo(MustHaveValueBenchmark.C+S[],int,int,System.Collections.Generic.EqualityComparer`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  9   )     ref  ->  rsi         class-hnd single-def
;  V01 arg1         [V01,T03] (  3,  4   )     int  ->  rbp         single-def
;  V02 arg2         [V02,T02] (  3,  6   )     int  ->  rbx         single-def
;  V03 arg3         [V03,T01] (  4,  6   )     ref  ->  rdi         class-hnd single-def
;* V04 loc0         [V04,T05] (  0,  0   )     int  ->  zero-ref    single-def
;  V05 OutArgs      [V05    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V06 cse0         [V06,T04] (  2,  2   )     int  ->  rax         "CSE - aggressive"
;
; Lcl frame size = 40

G_M37434_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
       mov      rsi, rcx
       mov      ebp, edx
       mov      ebx, r8d
       mov      rdi, r9
						;; size=19 bbWeight=1    PerfScore 5.25
G_M37434_IG02:
       mov      eax, dword ptr [rsi+08H]
       test     eax, eax
       je       SHORT G_M37434_IG07
						;; size=7 bbWeight=1    PerfScore 3.25
G_M37434_IG03:
       cmp      dword ptr [rsi+10H], ebx
       jne      SHORT G_M37434_IG03
						;; size=5 bbWeight=4    PerfScore 16.00
G_M37434_IG04:
       mov      r8d, dword ptr [rsi+14H]
       mov      rcx, rdi
       mov      edx, ebp
       mov      rax, qword ptr [rdi]
       mov      rax, qword ptr [rax+48H]
       call     [rax+20H]System.Collections.Generic.EqualityComparer`1[int]:Equals(int,int):bool:this
       test     eax, eax
       je       SHORT G_M37434_IG03
						;; size=23 bbWeight=2    PerfScore 21.50
G_M37434_IG05:
       xor      eax, eax
						;; size=2 bbWeight=0.50 PerfScore 0.12
G_M37434_IG06:
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=9 bbWeight=0.50 PerfScore 1.62
G_M37434_IG07:
       mov      eax, -1
						;; size=5 bbWeight=0.50 PerfScore 0.12
G_M37434_IG08:
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=9 bbWeight=0.50 PerfScore 1.62

; Total bytes of code 79, prolog size 19, PerfScore 57.40, instruction count 36, allocated bytes for code 79 (MethodHash=a11e6dc5) for method MustHaveValueBenchmark.C:Foo(MustHaveValueBenchmark.C+S[],int,int,System.Collections.Generic.EqualityComparer`1[int]):int
; ============================================================

And for the cases @gfoidl posted above:

; Assembly listing for method MustHaveValueBenchmark.C:While(int[],int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T01] (  4,  7   )     ref  ->  rdx         class-hnd single-def
;  V02 arg2         [V02,T02] (  3,  6   )     int  ->   r8         single-def
;  V03 loc0         [V03,T00] (  6, 18   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V05 cse0         [V05,T03] (  3,  6   )     int  ->  rcx         "CSE - aggressive"
;
; Lcl frame size = 0

G_M53493_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M53493_IG02:
       xor      eax, eax
       mov      ecx, dword ptr [rdx+08H]
       test     ecx, ecx
       je       SHORT G_M53493_IG04
       align    [0 bytes for IG03]
						;; size=9 bbWeight=1    PerfScore 3.50
G_M53493_IG03:
       mov      r9d, eax
       cmp      dword ptr [rdx+4*r9+10H], r8d
       je       SHORT G_M53493_IG04
       inc      eax
       cmp      ecx, eax
       ja       SHORT G_M53493_IG03
						;; size=16 bbWeight=4    PerfScore 23.00
G_M53493_IG04:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 26, prolog size 0, PerfScore 30.10, instruction count 12, allocated bytes for code 26 (MethodHash=71402f0a) for method MustHaveValueBenchmark.C:While(int[],int):int:this
; ============================================================
; Assembly listing for method MustHaveValueBenchmark.C:Do(int[],int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T01] (  4,  7   )     ref  ->  rdx         class-hnd single-def
;  V02 arg2         [V02,T03] (  3,  6   )     int  ->   r8         single-def
;  V03 loc0         [V03,T00] (  6, 22   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V05 cse0         [V05,T02] (  2,  9   )     int  ->  rcx         "CSE - aggressive"
;
; Lcl frame size = 0

G_M50657_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M50657_IG02:
       xor      eax, eax
       mov      ecx, dword ptr [rdx+08H]
       align    [0 bytes for IG03]
						;; size=5 bbWeight=1    PerfScore 2.25
G_M50657_IG03:
       cmp      ecx, eax
       jbe      SHORT G_M50657_IG05
						;; size=4 bbWeight=8    PerfScore 10.00
G_M50657_IG04:
       mov      r9d, eax
       cmp      dword ptr [rdx+4*r9+10H], r8d
       je       SHORT G_M50657_IG05
       inc      eax
       jmp      SHORT G_M50657_IG03
						;; size=14 bbWeight=4    PerfScore 26.00
G_M50657_IG05:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 24, prolog size 0, PerfScore 41.65, instruction count 11, allocated bytes for code 24 (MethodHash=5cc93a1e) for method MustHaveValueBenchmark.C:Do(int[],int):int:this
; ============================================================

@ghost ghost locked as resolved and limited conversation to collaborators Jan 14, 2023
@stephentoub
Copy link
Member

This looks fixed on main, this is the codegen I get for

We have several places in Dictionary with TODOs for this still.

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

// Should be a while loop https://github.com/dotnet/runtime/issues/9422

Can we revisit those and clean them up if this is fixed there or re-open this?

@jakobbotsch
Copy link
Member

Let's reopen this to track verifying that those examples are fixed.

@jakobbotsch jakobbotsch reopened this Jan 22, 2024
@jakobbotsch jakobbotsch modified the milestones: Future, 9.0.0 Jan 22, 2024
@jakobbotsch jakobbotsch self-assigned this Jan 22, 2024
@jakobbotsch
Copy link
Member

I do see the duplicated bounds checks in the shared generic versions of GetValueRefOrAddDefault. It seems to be related to loop inversion, the duplicated bounds check disappears if it is disabled.

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 8, 2024

Simplified repro for the Dictionary case:

[MethodImpl(MethodImplOptions.NoInlining)]
public static int Foo(int i, int[] indices)
{
    while ((uint)i < (uint)indices.Length)
    {
        i = indices[i];
    }

    return i;
}

generates

; Method Program:Foo(int,int[]):int (FullOpts)
G_M9328_IG01:  ;; offset=0x0000
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25

G_M9328_IG02:  ;; offset=0x0004
       mov      eax, dword ptr [rdx+0x08]
       cmp      eax, ecx
       jbe      SHORT G_M9328_IG04
       align    [0 bytes for IG03]
						;; size=7 bbWeight=1 PerfScore 3.25

G_M9328_IG03:  ;; offset=0x000B
       cmp      ecx, eax
       jae      SHORT G_M9328_IG06
       mov      ecx, ecx
       mov      ecx, dword ptr [rdx+4*rcx+0x10]
       cmp      eax, ecx
       ja       SHORT G_M9328_IG03
						;; size=14 bbWeight=4 PerfScore 19.00

G_M9328_IG04:  ;; offset=0x0019
       mov      eax, ecx
						;; size=2 bbWeight=1 PerfScore 0.25

G_M9328_IG05:  ;; offset=0x001B
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25

G_M9328_IG06:  ;; offset=0x0020
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
						;; size=6 bbWeight=0 PerfScore 0.00
; Total bytes of code: 38    

jakobbotsch added a commit to jakobbotsch/runtime that referenced this issue Apr 8, 2024
@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 8, 2024
jakobbotsch added a commit that referenced this issue Apr 17, 2024
matouskozak pushed a commit to matouskozak/runtime that referenced this issue Apr 30, 2024
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this issue May 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions in-pr There is an active PR which will close this issue when it is merged JitUntriaged CLR JIT issues needing additional triage optimization tenet-performance Performance related issue
Projects
None yet
9 participants