Fix for issue #2117 #2230

brianpopow · 2022-09-15T17:21:59Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This PR is a workaround for Issue #2117.

# Conflicts: # src/ImageSharp/PixelFormats/PixelImplementations/Bgra32.cs

brianpopow · 2022-09-15T17:52:58Z

I have run the failing tests now many times, everything looks fine now. The workaround seems to solve the issue #2117.

JimBobSquarePants · 2022-09-15T22:16:01Z

CC @tannergooding

tannergooding

LGTM and this should be a net positive for codegen as well since it should make inlining simpler and avoid unnecessary memory accesses since vector can be enregistered.

It's worth noting the bug in question is also on .NET 6, but doesn't seem to be triggered by the ref pattern that was used here. Just more an FYI in case any other oddities creep up.

JimBobSquarePants · 2022-09-16T00:19:48Z

@tannergooding Good to know. Thanks for having a look. Should we be avoiding passing Vector2/3/4 by ref then? I think we do that in a couple of internal places which we can refactor.

tannergooding · 2022-09-16T00:51:00Z

Should we be avoiding passing Vector2/3/4 by ref then? I think we do that in a couple of internal places which we can refactor.

TL;DR;

That is my general recommendation, especially when the function is going to be aggressively inlined.

For this particular case the disassembly from not passing by ref looks like the below and you can see there are far fewer memory accesses happening:

push        rsi  
sub         rsp,30h  
vzeroupper  
vmovaps     xmmword ptr [rsp+20h],xmm6  
mov         rsi,rcx  
vmovupd     xmm6,xmmword ptr [rdx]  
mov         rcx,7FF8CA164688h  
mov         edx,155h  
call        CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE (07FF92929C890h)  
mov         rax,198C100EF80h  
mov         rax,qword ptr [rax]  
add         rax,8  
vmulps      xmm6,xmm6,xmmword ptr [rax]  
mov         rdx,198C100EF88h  
mov         rdx,qword ptr [rdx]  
vaddps      xmm6,xmm6,xmmword ptr [rdx+8]  
vxorps      xmm0,xmm0,xmm0  
vmaxps      xmm0,xmm6,xmm0  
vminps      xmm6,xmm0,xmmword ptr [rax]  
vmovaps     xmm0,xmm6  
vcvttss2si  eax,xmm0  
mov         byte ptr [rsi+2],al  
vmovshdup   xmm0,xmm6  
vcvttss2si  eax,xmm0  
mov         byte ptr [rsi+1],al  
vunpckhps   xmm0,xmm6,xmm6  
vcvttss2si  eax,xmm0  
mov         byte ptr [rsi],al  
vshufps     xmm0,xmm6,xmm6,0FFh  
vcvttss2si  eax,xmm0  
mov         byte ptr [rsi+3],al  
vmovaps     xmm6,xmmword ptr [rsp+20h]  
add         rsp,30h  
pop         rsi  
ret

Longer Explanation

Vector2/3/4 are "user-defined structs" containing 2/3/4x floats. This makes them classified as an HFA (Homogenous Floating-Point Aggregate). Additionally, the JIT has specialized handling for them (which I can hopefully extend to other types in the future) that allows them to be "enregistered" (meaning the entire struct is kept in a single register).

For such structs, the "worst case" scenario is that they are passed by shadow copy. What this means is that the ABI doesn't allow it to be passed in a single or even multiple registers and so they effectively a private copy is made and a reference is passed to that copy instead. -- That is if your signature is void Method(Vector4 value) the ABI level signature is actually more like void Method(in Vector4 value). However, its not quite the same in that the JIT doesn't have to consider this "address taken" and therefore there are no aliasing concerns. Likewise, it can trivially give it "value semantics" on the callee side and avoid repeated memory lookups. For something like Vector4 which can be enregistered, this is really a "good scenario" because the JIT will effectively just "spill" the value to the stack before the call (in the callee), pass the address of the spill location, and then "unspill" it back into a register on the caller side.

Additionally, some platforms, such as x64 Unix (Linux and MacOS), allow such HFAs to be passed in 1-4 registers. When its passed in 1 register, this is "best" because its just a register to register move and is effectively "free". When its passed in multiple registers then its similar to "struct promotion" which allows each field to be individually enregistered. For something like a SIMD type, "promotion" isn't ideal but its also often still better than spilling to memory.

Passing explicitly by reference (using ref T or in T or even a raw pointer T*) can be beneficial for certain scenarios especially for large structs that can't be enregistered (such as a Matrix3x2 or Matrix4x4 containing 6x or 16x float fields). However, its less ideal for types that can be enregistered since it means the type is "address taken" and the JIT can no longer enregister it. Instead, you have to explicitly create a local, read the value into the local, and remember to store it back if the mutation was important to get the "best codegen" for such a case. Additionally, passing explicitly by reference can hinder certain optimizations since the JIT may have to consider the value "address taken" and therefore "potentially aliased" and has to work harder to determine if that addressing can be removed so the value can be consumed directly.

Basic Scenarios to Consider

Primitive types (byte, sbyte, short, ushort, int, uint, long, ulong, nint, nuint, float, double, bool, char), reference types (string, object, List<T>, T[], T*), simple structs containing 1 field of the previously mentioned types should all be passed by value (just pass T)

Most structs greater than 16-bytes should be passed by readonly reference (in T). There are some exceptions related to SIMD code (covered below).

Structs less than or equal to 16-bytes large can have special considerations. If they are exactly 2 fields of the same type, passing by value can often be better (such as struct Int128 { ulong lower, upper; }). If they have a wide mix of field types (like Guid does), passing via in can be better. Structs the same size or smaller than a pointer (8-bytes for most modern considerations) should often be passed by value always as many ABIs treat them specially.

Structs only containing floating-point types get special consideration (HFAs). If they contain 1-4x float/double of the same type, they should typically be passed by value.

The structs Vector64<T> (Arm64 only), Vector128<T> (all platforms), and Vector256<T> (x86/x64 only) likewise get special treatment. The types themselves, or structs containing 1-4x of the same type of these are known as HVAs (Homogenous Vector Aggregates) and should typically be passed by value. -- This means a struct Matrix4x4 containing 16x float is treated differently from one containing 4x Vector128<float>. The former should be passed by reference and the latter by value. Even though they are the same size and "conceptually" the same layout, the HVA gets special treatment while the HFA version has too many individual fields).

tannergooding · 2022-09-16T01:03:06Z

Notably on .NET 6/7, you could make this even more efficient by doing something like:

public void Pack(Vector4 vector)
{
    vector *= MaxBytes;
    vector += Half;
    vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);

    Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
    // In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too

    R = result.GetElement(0);
    G = result.GetElement(4);
    B = result.GetElement(8);
    A = result.GetElement(12);
}

This converts all 4 elements at once and then extracts the truncated bytes directly:

vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret

We'll also be improving the codegen around vpextrb more in the future so it can be just vpextrb [rcx+2], xmm0, 0 instead of vpextrb eax, xmm0, 0 followed by mov [rcx+2], al.

You can also optimize in .NET 6+ by directly using Vector128.Create(). This creates a method local constant and avoids the static initializer entirely:

    private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
    private static Vector4 Half => Vector128.Create(0.5f).AsVector4();

JimBobSquarePants · 2022-09-16T01:16:24Z

This is an enormous wealth of knowledge. Thanks so much for making the effort to share is with us!

I'll create an issue we can use to track the improvements we can make in the pixel formats from your examples. That pattern is used in several places.

Fix for issue #2117

353f2e2

brianpopow added the unit tests label Sep 15, 2022

Merge remote-tracking branch 'origin/main' into bp/FixIssue2117

61e7457

# Conflicts: # src/ImageSharp/PixelFormats/PixelImplementations/Bgra32.cs

tannergooding approved these changes Sep 15, 2022

View reviewed changes

Enable .NET 7 workflows

0c62aeb

JimBobSquarePants merged commit 8af5d8a into main Sep 16, 2022

JimBobSquarePants deleted the bp/FixIssue2117 branch September 16, 2022 01:18

JimBobSquarePants mentioned this pull request Sep 16, 2022

Performance optimization opportunities in common pixel formats. #2232

Closed

4 tasks

brianpopow mentioned this pull request Sep 22, 2022

Failing tests on .NET7 #2117

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for issue #2117 #2230

Fix for issue #2117 #2230

brianpopow commented Sep 15, 2022

brianpopow commented Sep 15, 2022

JimBobSquarePants commented Sep 15, 2022

tannergooding left a comment

JimBobSquarePants commented Sep 16, 2022

tannergooding commented Sep 16, 2022

tannergooding commented Sep 16, 2022 •

edited

Loading

JimBobSquarePants commented Sep 16, 2022

Fix for issue #2117 #2230

Fix for issue #2117 #2230

Conversation

brianpopow commented Sep 15, 2022

Prerequisites

Description

brianpopow commented Sep 15, 2022

JimBobSquarePants commented Sep 15, 2022

tannergooding left a comment

Choose a reason for hiding this comment

JimBobSquarePants commented Sep 16, 2022

tannergooding commented Sep 16, 2022

TL;DR;

Longer Explanation

Basic Scenarios to Consider

tannergooding commented Sep 16, 2022 • edited Loading

JimBobSquarePants commented Sep 16, 2022

tannergooding commented Sep 16, 2022 •

edited

Loading