-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements around fixed
#2418
Conversation
Only where it seems profitable. E.g. not when a UnmanagedMemoryStream is constructed of that pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes for review.
@@ -167,7 +167,7 @@ public static float Compress(float channel) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
private static unsafe void CompandAvx2(Span<Vector4> vectors, float[] table) | |||
{ | |||
fixed (float* tablePointer = &table[0]) | |||
fixed (float* tablePointer = &MemoryMarshal.GetArrayDataReference(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids the bound-checks.
See sharplab for example.
@@ -644,7 +644,7 @@ public static class HwIntrinsics | |||
ReadOnlySpan<byte> source, | |||
Span<float> dest) | |||
{ | |||
fixed (byte* sourceBase = source) | |||
fixed (byte* sourceBase = &MemoryMarshal.GetReference(source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids the length-check.
fixed (byte* sourceBase = source)
is compiled to (via pattern matching)
fixed (byte* ptr = &span.GetPinnableReference())
and Span<T>.GetPinnableReference
does a length-check which we don't need here, so we can omit it.
See sharplab for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do such (out-of-hot-loop) length checks have a measurable impact on perf? The new code feels bloated, I'm not in favor of such changes if there are no tangible gains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this method in HwIntrinsics I don't think it's really measurable in (micro-) benchmarks. But it reduces the machine code a bit and the length-check is just not needed at all here.
I'm more on the eager side to squeeze out unnecessary instructions.
But when you want me this change to revert I'm fine with that too.
The new code feels bloated
The C# code is bloated. That's the obvious downside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically call these methods for entire image scanlines (->100's or 1000's of elements in the span). I would assume having an extra bounds check before the loop negligible, unless there are benchmarks proving it otherwise.
The C# code is bloated.
When you turn it into semi-C, yes. But you don't always have to; we need to make informed, individual decisions regarding perf vs simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I'll revert.
@@ -15,25 +15,14 @@ namespace SixLabors.ImageSharp.Formats.Jpeg.Components; | |||
/// 8x8 matrix of <see cref="short"/> coefficients. | |||
/// </summary> | |||
// ReSharper disable once InconsistentNaming | |||
[StructLayout(LayoutKind.Explicit)] | |||
internal unsafe partial struct Block8x8 | |||
[StructLayout(LayoutKind.Explicit, Size = 2 * Size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unused fixed sized buffer to specify the size of the struct isn't neede, as the size can be set directly.
Codegen for creating the struct is also a bit better -- only zeroing the memory, instead of stack-cookie + zeroing.
Though that won't matter much -- the reduction of C# code is the motivation for this change.
Block8x8F result = default; | ||
result.LoadFrom(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going via instance method needs to copy two times, which can be avoided.
In sharplab it's hard to see (no method names, just the address of the call), so here from the JIT directly:
JIT assembly
; Assembly listing for method Foo:Default(System.Span`1[float]):Block8x8F
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
56 push rsi
4881EC00010000 sub rsp, 256
488BF1 mov rsi, rcx
G_M000_IG02: ;; offset=000BH
488B12 mov rdx, bword ptr [rdx]
488D0C24 lea rcx, bword ptr [rsp]
41B800010000 mov r8d, 256
E80330AF5F call CORINFO_HELP_MEMCPY
488BCE mov rcx, rsi
488D1424 lea rdx, bword ptr [rsp]
41B800010000 mov r8d, 256
E8F12FAF5F call CORINFO_HELP_MEMCPY
488BC6 mov rax, rsi
G_M000_IG03: ;; offset=0032H
4881C400010000 add rsp, 256
5E pop rsi
C3 ret
; Total bytes of code 59
; Assembly listing for method Foo:PR(System.Span`1[float]):Block8x8F_PR
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
56 push rsi
488BF1 mov rsi, rcx
G_M000_IG02: ;; offset=0004H
488B12 mov rdx, bword ptr [rdx]
488BCE mov rcx, rsi
41B800010000 mov r8d, 256
E80B2FAF5F call CORINFO_HELP_MEMCPY
488BC6 mov rax, rsi
G_M000_IG03: ;; offset=0018H
5E pop rsi
C3 ret
; Total bytes of code 26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering: is it the same gain when you compare use cases on call sites, ie.
Block8x8F b = Block8x8F.Load(data)`;
after the PR change vs. typical usage on main:
Block8x8F b = default;
b.LoadFrom(data);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends what's done with the instance of Block8x8F
. In this example Default2
shows no difference, as the JIT is smart enough to figure out what's going on. Namely: load the data onto the stack, then operate with the instance method of that type and use data on the stack.
But as soon the Block8x8F
is passed around a copy needs to be made. Thus for me the instance LoadFrom
has the potential for a perf-trap, and should be removed.
Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange for an value-type*. Here a static method seems the right fit, like int.Parse
, Vector128.Load
, etc.
As it's an internal type this change seems fine from this view.
* for reference types this is pretty much OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange
Note that Block8x8F
is really huge. If I remember correctly, in early Jpeg code there were cases when existing on-stack blocks were being reused by repopulating it. If it's no longer the case in the current code, I'm happy with the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked it -- all usages are safe. Block8x8F
is either passed by reference, and only copied where needed (e.g. constructors).
existing on-stack blocks were being reused by repopulating
When I understand it right, then this is a simple example of it.
Both variants work and produce the same machine-code.
(It's because the compilers hoists the loop-local block variable, and as it's a value-type the stack space needs to be reserved anyway).
Even if there are two blocks read then it's the same machine code.
I'll see if the C# code here can be simplified.
@@ -25,7 +25,7 @@ public static unsafe bool CheckNonOpaque(Span<Bgra32> row) | |||
ReadOnlySpan<byte> rowBytes = MemoryMarshal.AsBytes(row); | |||
int i = 0; | |||
int length = (row.Length * 4) - 3; | |||
fixed (byte* src = rowBytes) | |||
fixed (byte* src = &MemoryMarshal.GetReference(rowBytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ups, I've forgotten this one. Couldn't see them in the local diffs.
Is this ready for another review? |
Yes, please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks!
Prerequisites
Description
Block8x8
structBlock8x8F.Load
avoid unnecessary copying of data -- read it directly from the source