-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for issue #2117 #2230
Fix for issue #2117 #2230
Conversation
# Conflicts: # src/ImageSharp/PixelFormats/PixelImplementations/Bgra32.cs
I have run the failing tests now many times, everything looks fine now. The workaround seems to solve the issue #2117. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and this should be a net positive for codegen as well since it should make inlining simpler and avoid unnecessary memory accesses since vector
can be enregistered.
It's worth noting the bug in question is also on .NET 6, but doesn't seem to be triggered by the ref
pattern that was used here. Just more an FYI in case any other oddities creep up.
@tannergooding Good to know. Thanks for having a look. Should we be avoiding passing |
TL;DR;That is my general recommendation, especially when the function is going to be aggressively inlined. For this particular case the disassembly from not passing by ref looks like the below and you can see there are far fewer memory accesses happening: push rsi
sub rsp,30h
vzeroupper
vmovaps xmmword ptr [rsp+20h],xmm6
mov rsi,rcx
vmovupd xmm6,xmmword ptr [rdx]
mov rcx,7FF8CA164688h
mov edx,155h
call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE (07FF92929C890h)
mov rax,198C100EF80h
mov rax,qword ptr [rax]
add rax,8
vmulps xmm6,xmm6,xmmword ptr [rax]
mov rdx,198C100EF88h
mov rdx,qword ptr [rdx]
vaddps xmm6,xmm6,xmmword ptr [rdx+8]
vxorps xmm0,xmm0,xmm0
vmaxps xmm0,xmm6,xmm0
vminps xmm6,xmm0,xmmword ptr [rax]
vmovaps xmm0,xmm6
vcvttss2si eax,xmm0
mov byte ptr [rsi+2],al
vmovshdup xmm0,xmm6
vcvttss2si eax,xmm0
mov byte ptr [rsi+1],al
vunpckhps xmm0,xmm6,xmm6
vcvttss2si eax,xmm0
mov byte ptr [rsi],al
vshufps xmm0,xmm6,xmm6,0FFh
vcvttss2si eax,xmm0
mov byte ptr [rsi+3],al
vmovaps xmm6,xmmword ptr [rsp+20h]
add rsp,30h
pop rsi
ret Longer Explanation
For such structs, the "worst case" scenario is that they are passed by shadow copy. What this means is that the ABI doesn't allow it to be passed in a single or even multiple registers and so they effectively a private copy is made and a reference is passed to that copy instead. -- That is if your signature is Additionally, some platforms, such as x64 Unix (Linux and MacOS), allow such HFAs to be passed in 1-4 registers. When its passed in 1 register, this is "best" because its just a register to register move and is effectively "free". When its passed in multiple registers then its similar to "struct promotion" which allows each field to be individually enregistered. For something like a SIMD type, "promotion" isn't ideal but its also often still better than spilling to memory. Passing explicitly by reference (using Basic Scenarios to ConsiderPrimitive types ( Most structs greater than 16-bytes should be passed by readonly reference ( Structs less than or equal to 16-bytes large can have special considerations. If they are exactly 2 fields of the same type, passing by value can often be better (such as Structs only containing floating-point types get special consideration (HFAs). If they contain 1-4x float/double of the same type, they should typically be passed by value. The structs |
Notably on .NET 6/7, you could make this even more efficient by doing something like: public void Pack(Vector4 vector)
{
vector *= MaxBytes;
vector += Half;
vector = Vector4.Clamp(vector, Vector4.Zero, MaxBytes);
Vector128<byte> result = Sse2.ConvertToVector128Int32WithTruncation(vector.AsVector128()).AsByte();
// In .NET 7+ the above can be `result = Vector128.ConvertToInt32(vector.AsVector128()).AsByte()` so it works on Arm64 too
R = result.GetElement(0);
G = result.GetElement(4);
B = result.GetElement(8);
A = result.GetElement(12);
} This converts all 4 elements at once and then extracts the truncated bytes directly: vzeroupper
vmovupd xmm0, [0x7ffd160105c0]
vmovaps xmm1, xmm0
vmulps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [0x7ffd160105d0]
vaddps xmm1, xmm1, [rdx]
vmovupd [rdx], xmm1
vmovupd xmm1, [rdx]
vxorps xmm2, xmm2, xmm2
vmaxps xmm1, xmm1, xmm2
vminps xmm0, xmm1, xmm0
vmovupd [rdx], xmm0
vcvttps2dq xmm0, [rdx]
vpextrb eax, xmm0, 0
mov [rcx+2], al
vpextrb eax, xmm0, 4
mov [rcx+1], al
vpextrb eax, xmm0, 8
mov [rcx], al
vpextrb eax, xmm0, 0xc
mov [rcx+3], al
ret
You can also optimize in .NET 6+ by directly using private static Vector4 MaxBytes => Vector128.Create(255f).AsVector4();
private static Vector4 Half => Vector128.Create(0.5f).AsVector4(); |
This is an enormous wealth of knowledge. Thanks so much for making the effort to share is with us! I'll create an issue we can use to track the improvements we can make in the pixel formats from your examples. That pattern is used in several places. |
Prerequisites
Description
This PR is a workaround for Issue #2117.