Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop condition vs initial guard check impacts bounds checking #83349

Closed
stephentoub opened this issue Mar 13, 2023 · 5 comments · Fixed by #100777
Closed

Loop condition vs initial guard check impacts bounds checking #83349

stephentoub opened this issue Mar 13, 2023 · 5 comments · Fixed by #100777
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Milestone

Comments

@stephentoub
Copy link
Member

Consider these two functionally-identical loops:

    public static int M1(int i, ReadOnlySpan<char> src)
    {    
        int sum = 0;
        while ((uint)i < (uint)src.Length)
        {
            sum += src[i++];
        }
        return sum;
    }
    
    public static int M2(int i, ReadOnlySpan<char> src)
    {    
        int sum = 0;
        while (true)
        {
            if ((uint)i >= (uint)src.Length) break;

            sum += src[i++];
        }
        return sum;
    }

The second simply moves the loop condition to be the very first thing in the body. SharpLab even decompiles them to C# identically. However, the former has bounds checks whereas they're appropriately removed for the latter:
SharpLab

C.M1(Int32, System.ReadOnlySpan`1<Char>)
    L0000: sub rsp, 0x28
    L0004: mov rax, [rdx]
    L0007: mov edx, [rdx+8]
    L000a: xor r8d, r8d
    L000d: cmp ecx, edx
    L000f: jae short L0039
    L0011: nop [rax]
    L0018: nop [rax+rax]
    L0020: lea r9d, [rcx+1]
    L0024: cmp ecx, edx
    L0026: jae short L0041
    L0028: mov ecx, ecx
    L002a: movzx ecx, word ptr [rax+rcx*2]
    L002e: add r8d, ecx
    L0031: cmp r9d, edx
    L0034: mov ecx, r9d
    L0037: jb short L0020
    L0039: mov eax, r8d
    L003c: add rsp, 0x28
    L0040: ret
    L0041: call 0x00007fff38258b30
    L0046: int3

C.M2(Int32, System.ReadOnlySpan`1<Char>)
    L0000: mov rax, [rdx]
    L0003: mov edx, [rdx+8]
    L0006: xor r8d, r8d
    L0009: cmp ecx, edx
    L000b: jae short L001f
    L000d: lea r9d, [rcx+1]
    L0011: mov ecx, ecx
    L0013: movzx ecx, word ptr [rax+rcx*2]
    L0017: add r8d, ecx
    L001a: mov ecx, r9d
    L001d: jmp short L0009
    L001f: mov eax, r8d
    L0022: ret
@stephentoub stephentoub added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 13, 2023
@stephentoub stephentoub added this to the 8.0.0 milestone Mar 13, 2023
@ghost
Copy link

ghost commented Mar 13, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Consider these two functionally-identical loops:

    public static int M1(int i, ReadOnlySpan<char> src)
    {    
        int sum = 0;
        while ((uint)i < (uint)src.Length)
        {
            sum += src[i++];
        }
        return sum;
    }
    
    public static int M2(int i, ReadOnlySpan<char> src)
    {    
        int sum = 0;
        while (true)
        {
            if ((uint)i >= (uint)src.Length) break;

            sum += src[i++];
        }
        return sum;
    }

The second simply moves the loop condition to be the very first thing in the body. SharpLab even decompiles them to C# identically. However, the former has bounds checks whereas they're appropriately removed for the latter:
SharpLab

C.M1(Int32, System.ReadOnlySpan`1<Char>)
    L0000: sub rsp, 0x28
    L0004: mov rax, [rdx]
    L0007: mov edx, [rdx+8]
    L000a: xor r8d, r8d
    L000d: cmp ecx, edx
    L000f: jae short L0039
    L0011: nop [rax]
    L0018: nop [rax+rax]
    L0020: lea r9d, [rcx+1]
    L0024: cmp ecx, edx
    L0026: jae short L0041
    L0028: mov ecx, ecx
    L002a: movzx ecx, word ptr [rax+rcx*2]
    L002e: add r8d, ecx
    L0031: cmp r9d, edx
    L0034: mov ecx, r9d
    L0037: jb short L0020
    L0039: mov eax, r8d
    L003c: add rsp, 0x28
    L0040: ret
    L0041: call 0x00007fff38258b30
    L0046: int3

C.M2(Int32, System.ReadOnlySpan`1<Char>)
    L0000: mov rax, [rdx]
    L0003: mov edx, [rdx+8]
    L0006: xor r8d, r8d
    L0009: cmp ecx, edx
    L000b: jae short L001f
    L000d: lea r9d, [rcx+1]
    L0011: mov ecx, ecx
    L0013: movzx ecx, word ptr [rax+rcx*2]
    L0017: add r8d, ecx
    L001a: mov ecx, r9d
    L001d: jmp short L0009
    L001f: mov eax, r8d
    L0022: ret
Author: stephentoub
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: 8.0.0

@EgorBo
Copy link
Member

EgorBo commented Mar 13, 2023

public static int M1(int i, ReadOnlySpan<char> src)
{
    int sum = 0;
    while (i >= 0 && i < src.Length)
    {
        sum += src[i++];
    }
    return sum;
}

this works 🙂

@stephentoub
Copy link
Member Author

this works 🙂

Sort of... it eliminates the comparison as part of the bounds check, but it still has an extra comparison/branch for the i >= 0.

C.M1(Int32, System.ReadOnlySpan`1<Char>)
    L0000: mov rax, [rdx]
    L0003: mov edx, [rdx+8]
    L0006: xor r8d, r8d
    L0009: jmp short L001b
    L000b: lea r9d, [rcx+1]
    L000f: mov ecx, ecx
    L0011: movzx ecx, word ptr [rax+rcx*2]
    L0015: add r8d, ecx
    L0018: mov ecx, r9d
    L001b: test ecx, ecx
    L001d: jl short L0023
    L001f: cmp ecx, edx
    L0021: jl short L000b
    L0023: mov eax, r8d
    L0026: ret

In contrast, this:

public static int M2(int i, ReadOnlySpan<char> src)
{
    int sum = 0;
    while (true)
    {
        if ((uint)i >= (uint)src.Length) break;
        sum += src[i++];
    }
    return sum;
}

is:

C.M2(Int32, System.ReadOnlySpan`1<Char>)
    L0000: mov rax, [rdx]
    L0003: mov edx, [rdx+8]
    L0006: xor r8d, r8d
    L0009: cmp ecx, edx
    L000b: jae short L001f
    L000d: lea r9d, [rcx+1]
    L0011: mov ecx, ecx
    L0013: movzx ecx, word ptr [rax+rcx*2]
    L0017: add r8d, ecx
    L001a: mov ecx, r9d
    L001d: jmp short L0009
    L001f: mov eax, r8d
    L0022: ret

@EgorBo
Copy link
Member

EgorBo commented Aug 5, 2023

The fix doesn't look to be trivial so moving to 9.0 🙁

@EgorBo EgorBo modified the milestones: 8.0.0, 9.0.0 Aug 5, 2023
@jakobbotsch jakobbotsch assigned jakobbotsch and unassigned EgorBo Apr 11, 2024
@jakobbotsch
Copy link
Member

Going to grab this one since #100777 should fix it.

@dotnet-policy-service dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label Apr 11, 2024
matouskozak pushed a commit to matouskozak/runtime that referenced this issue Apr 30, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 18, 2024
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this issue May 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged tenet-performance Performance related issue
Projects
None yet
3 participants