[Jit] Loop condition doesn't elide bounds check #9422

benaadams · 2017-12-11T22:53:13Z

As a condition of a while it doesn't elide the check

while ((uint)i >= (uint)entries.Length)
{    
    // not elided
    if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
    {

Changing it to an if + break does elide the check

do
{
    if ((uint)i >= (uint)entries.Length)
    {
        break;
    }
    
    // elided
    if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
    {

category:cq
theme:bounds-checks
skill-level:expert
cost:small

The text was updated successfully, but these errors were encountered:

mikedn · 2017-12-12T07:06:02Z

I looked at this in the past. The reason why this doesn't work in the while loop is that the loop is turned into a do/while loop with the loop condition being duplicated. Something like:

for (uint i = head; i < (uint)entries.Length; i = entries[i].next)
    s += entries[i].value;

becomes

if (i < (uint)entries.Length) {
    do {
        s += entries[i].value;
        i = entries[i].next;
    } while (i < (uint)entries.Length));
}

Each if creates the expected OAK_NO_THROW assertion:

N009 (  8,  8) [000073] -A-X--------              *  JTRUE     void  
In BB01 New Global ArrBnds  Assertion: (512, -1) ($200,$ffffffff) [idx: {InitVal($c1)};len: {ARR_LENGTH($100)}] in range  index=#02, mask=0000000000000002
N004 (  5,  5) [000015] ------------              *  JTRUE     void  
In BB02 New Global ArrBnds  Assertion: (523, -1) ($20b,$ffffffff) [idx: {402};len: {ARR_LENGTH($100)}] in range  index=#04, mask=0000000000000008

But then the i in entries[i] is not the i from assertions, it is PhiDef($3, $4, $202). And assertion propagation doesn't know how to propagate OAK_NO_THROW through PHIs. Maybe it should but doing that seems expensive.

It's probably RangeCheck that should pay attention to these assertions and generate the appropriate ranges. This is something I hope I'll look at in the future.

glenn-slayden · 2018-10-18T01:27:01Z

@mikedn wrote:
Maybe it should, but doing that seems expensive.

Could you clarify: time-expensive for JITting, or time- (and/or) space- (and/or) cache-miss (and/or) predictivity-expensive of the emitted ASM itself? ^{[note 1]}

If it's not the former, and considering the lack of elision is specifically related to loop code, wouldn't the gain (or loss, see next paragraph) in runtime performance here almost always easily swamp the one-time JIT hit, making this a higher priority fix?

But also more generally, isn't the elision--or redundant presence--at runtime, of any (public) memory fetch or store in fact a correctness bug, considering the strong concurrency guarantees of the .NET memory model? Of course I'm referring to where the checked condition depends on volatile value(s) visible elsewhere. Perhaps this issue here somehow inherently confined to checks of local variables only... which would still need to be true even now given ref-locals?

Pardon my lack of understanding on these matters.

_{Notes:
1. I assume you don't mean time expensive for csc.}

benaadams · 2018-10-18T02:04:30Z

I'm referring to where the checked condition depends on volatile value(s)

Explicit volatiles or Volatile.Read+Volatile.Write are treated differently; always forcing a memory access.

mikedn · 2018-10-18T15:44:23Z

Could you clarify: time-expensive for JITting, or time- (and/or) space- (and/or) cache-miss (and/or) predictivity-expensive of the emitted ASM itself? [note 1]

Hmm, I don't remember the details. I think it seemed to me at the time that doing this might require multiple scans of the entire assertion table, possible leading to quadratic complexity.

The bigger problem is that the existing assertion propagation design doesn't quite scale. It does a ton of upfront work to generate these assertions but it doesn't know if they are actually needed. Worse, there's a limit to the number of assertions it tracks (for perf reasons) and generating useless assertions can mean not having room for useful ones. And on top of it, assertion propagation is already one the expensive JIT phases.

If it's not the former, and considering the lack of elision is specifically related to loop code, wouldn't the gain (or loss, see next paragraph) in runtime performance here almost always easily swamp the one-time JIT hit, making this a higher priority fix?

Could be. And that's typically true about a lot of optimizations that the JIT could do. But there are other potentially better approaches to this problem and AFAIR the perf impact of this specific range check wasn't very big. So it doesn't seem worthwhile to jump at fixing this issue using the first idea that crosses one's mind.

And as far as I'm concerned, I slowly arrived at the conclusion that it would better to put effort into improving certain other parts of the JIT (e.g. some IR design issues inherited from the legacy JIT32 that impact SSA graph traversal performance) before attempting to improve array range check elimination. That, of course, it solely my personal opinion. The JIT team may very well think otherwise.

But also more generally, isn't the elision--or redundant presence--at runtime, of any (public) memory fetch or store in fact a correctness bug, considering the strong concurrency guarantees of the .NET memory model? Of course I'm referring to where the checked condition depends on volatile value(s) visible elsewhere. Perhaps this issue here somehow inherently confined to checks of local variables only... which would still need to be true even now given ref-locals?

I'm not quite sure I understand what specific concerns you have around the memory model and range check elimination. I'll try to point out some thing that perhaps will clarify this:

The memory model does not prevent load elimination or load reordering (with respect to other non volatile loads). So there's no problem eliminating or moving the array length load that the range check normally performs.
Volatile loads generally block optimizations (unless there are bugs in the JIT)
.NET arrays are fixed size, their Length can't change. The JIT assumes that. Of course, you could change the Length by using unsafe code. That would be completely bogus and the JIT doesn't care about such scenarios.
Range check elimination requires the JIT to be able to reason about both the index and the array reference. That generally means that they must be local variables so other threads cannot change them. It is possible for other JIT optimizations (e.g. CSE) to transform class field accesses into local variable accesses by caching field values into local variables but that doesn't work very well currently.

AndyAyersMS · 2018-10-18T15:56:29Z

There is some data on the relative usefulness of assertions over in dotnet/coreclr#18678. Roughly speaking only about 10% of the generated assertions are used.

glenn-slayden · 2018-10-23T19:01:56Z

@mikedn @AndyAyersMS @benaadams Thanks for the helpful insights.

gfoidl · 2022-03-31T13:33:14Z

While working on #67044 this seems to be half-way resolved.
The while doesn't need to be workarounded by do + if + break now.

Doesn't elide the bounds-check

private int While(int[] entries, int key)
{
    int i = 0;

    while ((uint)i < entries.Length)    // it's proven that i is in [0, entries.Length)
    {
        if (entries[i] == key)          // bounds-check introduced
        {
            break;
        }

        i++;
    }

    return i;
}

private int Do(int[] entries, int key)
{
    int i = 0;

    do
    {
        if ((uint)i >= entries.Length)  // it's proven that i is not in [0, entries.Length)
        {
            break;
        }

        if (entries[i] == key)          // bounds-check introduced
        {
            break;
        }

        i++;
    } while (true);

    return i;
}

Does elide the bounds-check

...when the length is casted to uint ((uint)entries.Length), the bounds-checks are elided.

FYI @EgorBo

jakobbotsch · 2022-12-15T11:01:08Z

This looks fixed on main, this is the codegen I get for:

private struct S
{
    public int hashCode;
    public int key;
}

static int Foo(S[] entries, int key, int hashCode, EqualityComparer<int> comparer)
{
    int i = 0;
    while ((uint)i < (uint)entries.Length)
    {
        if (entries[i].hashCode == hashCode && comparer.Equals(key, entries[i].key))
        {
            return i;
        }
    }

    return -1;
}

; Assembly listing for method MustHaveValueBenchmark.C:Foo(MustHaveValueBenchmark.C+S[],int,int,System.Collections.Generic.EqualityComparer`1[int]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  9   )     ref  ->  rsi         class-hnd single-def
;  V01 arg1         [V01,T03] (  3,  4   )     int  ->  rbp         single-def
;  V02 arg2         [V02,T02] (  3,  6   )     int  ->  rbx         single-def
;  V03 arg3         [V03,T01] (  4,  6   )     ref  ->  rdi         class-hnd single-def
;* V04 loc0         [V04,T05] (  0,  0   )     int  ->  zero-ref    single-def
;  V05 OutArgs      [V05    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V06 cse0         [V06,T04] (  2,  2   )     int  ->  rax         "CSE - aggressive"
;
; Lcl frame size = 40

G_M37434_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
       mov      rsi, rcx
       mov      ebp, edx
       mov      ebx, r8d
       mov      rdi, r9
						;; size=19 bbWeight=1    PerfScore 5.25
G_M37434_IG02:
       mov      eax, dword ptr [rsi+08H]
       test     eax, eax
       je       SHORT G_M37434_IG07
						;; size=7 bbWeight=1    PerfScore 3.25
G_M37434_IG03:
       cmp      dword ptr [rsi+10H], ebx
       jne      SHORT G_M37434_IG03
						;; size=5 bbWeight=4    PerfScore 16.00
G_M37434_IG04:
       mov      r8d, dword ptr [rsi+14H]
       mov      rcx, rdi
       mov      edx, ebp
       mov      rax, qword ptr [rdi]
       mov      rax, qword ptr [rax+48H]
       call     [rax+20H]System.Collections.Generic.EqualityComparer`1[int]:Equals(int,int):bool:this
       test     eax, eax
       je       SHORT G_M37434_IG03
						;; size=23 bbWeight=2    PerfScore 21.50
G_M37434_IG05:
       xor      eax, eax
						;; size=2 bbWeight=0.50 PerfScore 0.12
G_M37434_IG06:
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=9 bbWeight=0.50 PerfScore 1.62
G_M37434_IG07:
       mov      eax, -1
						;; size=5 bbWeight=0.50 PerfScore 0.12
G_M37434_IG08:
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; size=9 bbWeight=0.50 PerfScore 1.62

; Total bytes of code 79, prolog size 19, PerfScore 57.40, instruction count 36, allocated bytes for code 79 (MethodHash=a11e6dc5) for method MustHaveValueBenchmark.C:Foo(MustHaveValueBenchmark.C+S[],int,int,System.Collections.Generic.EqualityComparer`1[int]):int
; ============================================================

And for the cases @gfoidl posted above:

; Assembly listing for method MustHaveValueBenchmark.C:While(int[],int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T01] (  4,  7   )     ref  ->  rdx         class-hnd single-def
;  V02 arg2         [V02,T02] (  3,  6   )     int  ->   r8         single-def
;  V03 loc0         [V03,T00] (  6, 18   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V05 cse0         [V05,T03] (  3,  6   )     int  ->  rcx         "CSE - aggressive"
;
; Lcl frame size = 0

G_M53493_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M53493_IG02:
       xor      eax, eax
       mov      ecx, dword ptr [rdx+08H]
       test     ecx, ecx
       je       SHORT G_M53493_IG04
       align    [0 bytes for IG03]
						;; size=9 bbWeight=1    PerfScore 3.50
G_M53493_IG03:
       mov      r9d, eax
       cmp      dword ptr [rdx+4*r9+10H], r8d
       je       SHORT G_M53493_IG04
       inc      eax
       cmp      ecx, eax
       ja       SHORT G_M53493_IG03
						;; size=16 bbWeight=4    PerfScore 23.00
G_M53493_IG04:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 26, prolog size 0, PerfScore 30.10, instruction count 12, allocated bytes for code 26 (MethodHash=71402f0a) for method MustHaveValueBenchmark.C:While(int[],int):int:this
; ============================================================

; Assembly listing for method MustHaveValueBenchmark.C:Do(int[],int):int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd single-def
;  V01 arg1         [V01,T01] (  4,  7   )     ref  ->  rdx         class-hnd single-def
;  V02 arg2         [V02,T03] (  3,  6   )     int  ->   r8         single-def
;  V03 loc0         [V03,T00] (  6, 22   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V05 cse0         [V05,T02] (  2,  9   )     int  ->  rcx         "CSE - aggressive"
;
; Lcl frame size = 0

G_M50657_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M50657_IG02:
       xor      eax, eax
       mov      ecx, dword ptr [rdx+08H]
       align    [0 bytes for IG03]
						;; size=5 bbWeight=1    PerfScore 2.25
G_M50657_IG03:
       cmp      ecx, eax
       jbe      SHORT G_M50657_IG05
						;; size=4 bbWeight=8    PerfScore 10.00
G_M50657_IG04:
       mov      r9d, eax
       cmp      dword ptr [rdx+4*r9+10H], r8d
       je       SHORT G_M50657_IG05
       inc      eax
       jmp      SHORT G_M50657_IG03
						;; size=14 bbWeight=4    PerfScore 26.00
G_M50657_IG05:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 24, prolog size 0, PerfScore 41.65, instruction count 11, allocated bytes for code 24 (MethodHash=5cc93a1e) for method MustHaveValueBenchmark.C:Do(int[],int):int:this
; ============================================================

stephentoub · 2024-01-22T18:07:07Z

This looks fixed on main, this is the codegen I get for

We have several places in Dictionary with TODOs for this still.

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/Dictionary.cs

Line 421 in 5aabefa