Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The new SIMD support could use some improvements #13811

Open
Const-me opened this issue Nov 21, 2019 · 12 comments
Open

The new SIMD support could use some improvements #13811

Const-me opened this issue Nov 21, 2019 · 12 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@Const-me
Copy link

Here's the source code of a function which computes brightness of RGBA images. Very typical SIMD code for image processing I traditionally do in C++. Decided to try the new C# for a change.

/// <summary>Load 4 pixels of RGB</summary>
static unsafe Vector128<int> load4( byte* src )
{
	return Sse2.LoadVector128( (int*)src );
}

/// <summary>Pack red channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packRed( Vector128<int> a, Vector128<int> b )
{
	Vector128<int> mask = Vector128.Create( 0xFF );
	a = Sse2.And( a, mask );
	b = Sse2.And( b, mask );
	return Sse2.ShiftLeftLogical128BitLane( Sse41.PackUnsignedSaturate( a, b ), 1 );
}

/// <summary>Pack green channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packGreen( Vector128<int> a, Vector128<int> b )
{
	Vector128<int> mask = Vector128.Create( 0xFF00 );
	a = Sse2.And( a, mask );
	b = Sse2.And( b, mask );
	return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Pack blue channel of 8 pixels into ushort values in [ 0xFF00 .. 0 ] interval</summary>
static Vector128<ushort> packBlue( Vector128<int> a, Vector128<int> b )
{
	a = Sse2.ShiftRightLogical128BitLane( a, 1 );
	b = Sse2.ShiftRightLogical128BitLane( b, 1 );
	Vector128<int> mask = Vector128.Create( 0xFF00 );
	a = Sse2.And( a, mask );
	b = Sse2.And( b, mask );
	return Sse41.PackUnsignedSaturate( a, b );
}

/// <summary>Load 8 pixels, split into RGB channels.</summary>
static unsafe void loadRgb( byte* src, out Vector128<ushort> red, out Vector128<ushort> green, out Vector128<ushort> blue )
{
	var a = load4( src );
	var b = load4( src + 16 );
	red = packRed( a, b );
	green = packGreen( a, b );
	blue = packBlue( a, b );
}

const ushort mulRed = (ushort)( 0.29891 * 0x10000 );
const ushort mulGreen = (ushort)( 0.58661 * 0x10000 );
const ushort mulBlue = (ushort)( 0.11448 * 0x10000 );

/// <summary>Compute brightness of 8 pixels</summary>
static Vector128<short> brightness( Vector128<ushort> r, Vector128<ushort> g, Vector128<ushort> b )
{
	r = Sse2.MultiplyHigh( r, Vector128.Create( mulRed ) );
	g = Sse2.MultiplyHigh( g, Vector128.Create( mulGreen ) );
	b = Sse2.MultiplyHigh( b, Vector128.Create( mulBlue ) );
	var result = Sse2.AddSaturate( Sse2.AddSaturate( r, g ), b );
	return Vector128.AsInt16( Sse2.ShiftRightLogical( result, 8 ) );
}

/// <summary>Convert buffer from RGBA to grayscale.</summary>
/// <remarks>
/// <para>If your image has line paddings, you'll want to call this once per line, not for the complete image.</para>
/// <para>If width of the image is not multiple of 16 pixels, you'll need to do more work to handle the last few pixels of every line.</para>
/// </remarks>
static unsafe void convertToGrayscale( byte* src, byte* dst, int count )
{
	byte* srcEnd = src + count * 4;
	while( src < srcEnd )
	{
		loadRgb( src, out var r, out var g, out var b );
		var low = brightness( r, g, b );
		loadRgb( src + 32, out r, out g, out b );
		var hi = brightness( r, g, b );

		var bytes = Sse2.PackUnsignedSaturate( low, hi );
		Sse2.Store( dst, bytes );

		src += 64;
		dst += 16;
	}
}

First of all, nice job inlining the code. The complete convertToGrayscale is compiled into a single function without any calls.

But there’s still room for improvements. When looking at the disassembly in VS2019 debugger, I noticed 2 things.

  1. You don’t propagate invariants. There’re many magic numbers involved. Instead of caching values in a register, the generated code creates them on demand inside the loop, with instructions like this:
mov         r8d,0FF00h
vmovd       xmm5,r8d
vpbroadcastd xmm5,xmm5

The code only uses first 6 out of 16 available vector registers. There're enough unused registers for all the invariant constants involved in the algorithm.

  1. There’re redundant instructions like this:
vmovaps     xmm3,xmm1
vmovaps     xmm4,xmm2

If you’ll manage to apply these 2 optimizations, I would expect up to 2x speedup, because it will be twice as few instructions in the body of the loop.

category:cq
theme:loop-opt
skill-level:expert
cost:large

@tannergooding
Copy link
Member

The first issue is tracked by https://github.com/dotnet/coreclr/issues/17225. We don't currently supporting emitting constants for the Vector*.Create functions and so you need to manually create static readonly variables for the time being.

For the second issue, there may be something that could be improved in the register allocator. We have a few issues related to extraneous movaps that are inserted for varying code patterns.

@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT

@Const-me
Copy link
Author

Const-me commented Nov 21, 2019

@tannergooding “you need to manually create static readonly variables for the time being” it became more than 2 times slower with static readonly variables, from 2.977ms to 6.491ms (for 1M pixels). Because now the code has a lot of memory references in the body of the loop. Take a look: https://gist.github.com/Const-me/5434ced8c37f91af56f358871d92bef5

@tannergooding
Copy link
Member

Ah, I see. This is a case where loop-hosting would be beneficial. I don't believe we have a bug tracking that for the Hardware Intrinsics today.

@Const-me
Copy link
Author

This is a case where loop-hosting would be beneficial.

Right. For reference, here’s C++ code which does the same thing, processing same about of data:
https://github.com/Const-me/IntelIntrinsics/blob/master/CppDemo/brightness.cpp
https://github.com/Const-me/IntelIntrinsics/blob/master/CppDemo/brightness.inl
It completes in 0.354 ms on my PC, i.e. 8 times faster than .NET core. Despite it’s using exactly same instructions to compute stuff.

I guess I’ll stay with C++ for a while, for code like this.

@EgorBo
Copy link
Member

EgorBo commented Nov 22, 2019

👍 for better loop hoisting.

private static readonly Vector128<byte> myConst = Vector128.Create((byte)42);

static int Foo1(Vector128<byte> a, Vector128<byte> ugly_const)
{
	return Sse2.MoveMask(Sse2.Add(a, ugly_const));
}

static int Foo2(Vector128<byte> v, int n)
{
	int result = 0;
	var local = myConst;
	for (int i = 0; i < n; i++)
	{
		result += Foo1(v, local);
	}
	return result;
}

But ideally could be just (with better loop hoisting):

private static readonly Vector128<byte> myConst = Vector128.Create((byte)42);

static int Foo1(Vector128<byte> a)
{
	return Sse2.MoveMask(Sse2.Add(a, myConst));
}

static int Foo2(Vector128<byte> v, int n)
{
	int result = 0;
	for (int i = 0; i < n; i++)
	{
		result += Foo1(v);
	}
	return result;
}

@Const-me
Copy link
Author

OK so my initial 8x difference seems to be unrelated to any of that. It’s just memory access costs. Here’s result for larger arrays:

128M pixels, best of 3:
C++: 55 ms
C# original: 137 ms
C#, manually propagated constants: 98.4 ms
Loading all 64 bytes at the start of the loop: same as above

511 M pixels, best of 3:
C++ 221 ms
C# 410 ms

.NET source:
https://gist.github.com/Const-me/2f743098cd3341fa8fc14a6cdc2dee98
VC++ assembly:
https://gist.github.com/Const-me/7a74e74264628aea0b4fe2e8ef004448
.NET assembly:
https://gist.github.com/Const-me/1330f061668ba0d3dd584b777c0e9df4

I know 2 possible explanations.

  1. Instruction reordering by VC2015 compiler is IMO most likely explanation. If that’s the case, it might be a showstopper for .NET SIMD, probably too expensive for the JIT. For many practical applications, 2x performance difference is too large to neglect, especially with easy and reliable [DllImport] in .NET.

  2. There’s one more thing. Note how the VC++ compiler aligned the code, so when it jumps, the address is even, 00007FF735AD11C0. It did that by inserting couple nop instructions before that address. See this document section 9.3 “Instruction fetch, decoding and retirement”. This is a small and inexpensive change in codegen, but I doubt this alone gonna fix 2x performance difference.

@Const-me
Copy link
Author

Const-me commented Nov 23, 2019

I had an error in my benchmark. It included time it took JIT to make the x86 code. Here’s corrected numbers.

1M pixels:
C++: 0.35 ms
C#, manually propagated invariants: 0.40 ms
C#, initial: 0.67 ms

128M pixels:
C++: 57 ms
C#, manually propagated invariants: 55.7 ms
C#, initial: 98.6 ms

As you see, C# performance is pretty comparable to C++, which is awesome.

But this is only true when manually loading all the invariants outside of the loop, and passing them as methods arguments. Here’s the complete code: https://gist.github.com/Const-me/3cb028a44abe2f207d147c167563b297

For SIMD algorithms more complex than this one, the requirement complicates code a lot, unfortunately. Please, consider doing something in the runtime (or maybe in the compiler) to fix the performance of the original straightforward version.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@AndyAyersMS
Copy link
Member

We now do a bit of hoisting but the invariant bits are fragmented across statements and it appears the hoisting algorithm is not set up for this. For example we hoist the CreateScalarUnsafe but not the dependent but invariant BroadcastScalarToVector128 in the next statement.

***** BB02
STMT00032 (IL 0x008...  ???)
N004 (  2,  5) [000183] -A------R---              *  ASG       simd16 (copy) $2c2
N003 (  1,  1) [000181] D------N----              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Int32]> V18 tmp9         d:2 $2c2
N002 (  2,  5) [000180] ------------              \--*  HWINTRINSIC simd16 int CreateScalarUnsafe $2c2
N001 (  1,  4) [000179] ------------                 \--*  CNS_INT   int    255 $48

***** BB02
STMT00033 (IL 0x008...  ???)
N004 (  2,  3) [000188] -A------R---              *  ASG       simd16 (copy) $2c3
N003 (  1,  1) [000186] D------N----              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Int32]> V17 tmp8         d:2 $2c3
N002 (  2,  2) [000185] ------------              \--*  HWINTRINSIC simd16 int BroadcastScalarToVector128 $2c3
N001 (  1,  1) [000184] ------------                 \--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Int32]> V18 tmp9         u:2 (last use) $2c2


Hoisting a copy of [000180] into PreHeader for loop L00 <BB02..BB02>:
N002 (  2,  5) [000180] ------------              *  HWINTRINSIC simd16 int CreateScalarUnsafe $2c2
N001 (  1,  4) [000179] ------------              \--*  CNS_INT   int    255 $48

So one follow-up here is to examine why the jit is unable to hoist a statement whose only loop dependence is on an earlier statement that is loop invariant.

We could also look at changing things upstream so the trees are not fragmented. Fragmentation comes from C# and the IL, where we see st/ld pairs in some of the paths, eg

public static unsafe Vector128<byte> Create(byte value)
{
if (Avx2.IsSupported)
{
Vector128<byte> result = CreateScalarUnsafe(value); // < v, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? >
return Avx2.BroadcastScalarToVector128(result); // < v, v, v, v, v, v, v, v, v, v, v, v, v, v, v, v >
}

produces a st/load into result

Importing BB10 (PC=007) of 'System.Runtime.Intrinsics.Vector128:Create(int):System.Runtime.Intrinsics.Vector128`1[Int32]'
    [ 0]   7 (0x007) ldarg.0
    [ 1]   8 (0x008) call 06002DC8
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
Named Intrinsic System.Runtime.Intrinsics.Vector128.CreateScalarUnsafe: Notify VM instruction set (Vector128) must be supported.
Recognized
  Known type Vector128<int>
  Known type Vector128<int>
Notify VM instruction set (SSE2) must be supported.

    [ 1]  13 (0x00d) stloc.0
lvaGrabTemp returning 18 (V18 tmp9) (a long lifetime temp) called for Inline stloc first use temp.
  Known type Vector128<int>


               [000183] -A----------              *  ASG       simd16 (copy)
               [000181] D------N----              +--*  LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1[Int32]> V18 tmp9         
               [000180] ------------              \--*  HWINTRINSIC simd16 int CreateScalarUnsafe
               [000179] ------------                 \--*  CNS_INT   int    255

    [ 0]  14 (0x00e) ldloc.0
    [ 1]  15 (0x00f) call 0600338C
In Compiler::impImportCall: opcode is call, kind=0, callRetType is struct, structSize is 16
Named Intrinsic System.Runtime.Intrinsics.X86.Avx2.BroadcastScalarToVector128: Recognized
  Known type Vector128<int>
  Known type Vector128<int>

    [ 1]  20 (0x014) ret

A bit surprised roslyn doesn't optimize away this local, perhaps stores to locals are considered observable and so don't get elided. We could also try fixing in source via nested calls, or see if the jit can peephole stloc/ldloc if we know the ldloc is the only use. Since we do an IL prescan we could potentially know this.

So also worth following up on a potential importer opt to possibly reduce temp usage.

I am going to keep this in 5.0 as it impacts HW intrinsic CQ. May even have broader impact on CQ if we look at importer temp elimination or fixing hoisting to handle cases like this.

@BruceForstall
Copy link
Member

@AndyAyersMS I'm going to move this out of 5.0 since it doesn't look like we're going to be able to get to it at this point.

@BruceForstall BruceForstall modified the milestones: 5.0.0, 6.0.0 Jun 22, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 15, 2021
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@EgorBo
Copy link
Member

EgorBo commented Apr 2, 2021

Moving to future as it looks like hoisting of Vector.Create is not the only issue here.
However, worth noting that the performance of this benchmark is much better in .NET 6.0 in comparison with .NET Core 3.0

@EgorBo EgorBo removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 2, 2021
@EgorBo EgorBo modified the milestones: 6.0.0, Future Apr 2, 2021
@kunalspathak
Copy link
Member

Latest assembly code produced:

; Assembly listing for method ConsoleApp15._13811:convertToGrayscale(long,long,int)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 14 single block inlinees; 0 inlinees without PGO data
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 10, 28   )    long  ->  rcx        
;  V01 arg1         [V01,T02] (  5, 14   )    long  ->  rdx        
;  V02 arg2         [V02,T04] (  3,  3   )     int  ->   r8         single-def
;  V03 loc0         [V03,T03] (  3,  6   )    long  ->  rax         single-def
;  V04 loc1         [V04,T10] (  4, 16   )  simd16  ->  registers   ld-addr-op
;  V05 loc2         [V05,T11] (  4, 16   )  simd16  ->  registers   ld-addr-op
;  V06 loc3         [V06,T12] (  4, 16   )  simd16  ->  registers   ld-addr-op
;  V07 loc4         [V07,T42] (  2,  8   )  simd16  ->  mm6        
;* V08 loc5         [V08    ] (  0,  0   )  simd16  ->  zero-ref   
;# V09 OutArgs      [V09    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;  V10 tmp1         [V10,T17] (  2, 16   )  simd16  ->  mm5         "impAppendStmt"
;  V11 tmp2         [V11,T13] (  4, 16   )  simd16  ->  mm5         "Inline stloc first use temp"
;  V12 tmp3         [V12,T14] (  4, 16   )  simd16  ->  mm6         "Inline stloc first use temp"
;* V13 tmp4         [V13    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V14 tmp5         [V14,T32] (  3, 12   )  simd16  ->  mm7         "Inline stloc first use temp"
;  V15 tmp6         [V15,T18] (  2, 16   )  simd16  ->  mm8         "Inlining Arg"
;  V16 tmp7         [V16,T19] (  2, 16   )  simd16  ->  mm7         "Inlining Arg"
;  V17 tmp8         [V17,T33] (  3, 12   )  simd16  ->  mm8         "Inline stloc first use temp"
;  V18 tmp9         [V18,T20] (  2, 16   )  simd16  ->  mm9         "Inlining Arg"
;  V19 tmp10        [V19,T21] (  2, 16   )  simd16  ->  mm8         "Inlining Arg"
;  V20 tmp11        [V20,T05] (  4, 32   )  simd16  ->  mm5         "Inlining Arg"
;  V21 tmp12        [V21,T06] (  4, 32   )  simd16  ->  mm6         "Inlining Arg"
;  V22 tmp13        [V22,T34] (  3, 12   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V23 tmp14        [V23,T22] (  2, 16   )  simd16  ->  mm7         "Inlining Arg"
;  V24 tmp15        [V24,T23] (  2, 16   )  simd16  ->  mm8         "Inlining Arg"
;  V25 tmp16        [V25,T24] (  2, 16   )  simd16  ->  mm5         "Inlining Arg"
;  V26 tmp17        [V26,T01] (  3, 24   )    long  ->   r8         "Inlining Arg"
;  V27 tmp18        [V27,T15] (  4, 16   )  simd16  ->  mm6         "Inline stloc first use temp"
;  V28 tmp19        [V28,T16] (  4, 16   )  simd16  ->  mm7         "Inline stloc first use temp"
;* V29 tmp20        [V29    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V30 tmp21        [V30,T35] (  3, 12   )  simd16  ->  mm8         "Inline stloc first use temp"
;  V31 tmp22        [V31,T25] (  2, 16   )  simd16  ->  mm9         "Inlining Arg"
;  V32 tmp23        [V32,T26] (  2, 16   )  simd16  ->  mm8         "Inlining Arg"
;  V33 tmp24        [V33,T36] (  3, 12   )  simd16  ->  mm9         "Inline stloc first use temp"
;  V34 tmp25        [V34,T27] (  2, 16   )  simd16  ->  mm10         "Inlining Arg"
;  V35 tmp26        [V35,T28] (  2, 16   )  simd16  ->  mm9         "Inlining Arg"
;  V36 tmp27        [V36,T07] (  4, 32   )  simd16  ->  mm6         "Inlining Arg"
;  V37 tmp28        [V37,T08] (  4, 32   )  simd16  ->  mm7         "Inlining Arg"
;  V38 tmp29        [V38,T37] (  3, 12   )  simd16  ->  mm10         "Inline stloc first use temp"
;  V39 tmp30        [V39,T29] (  2, 16   )  simd16  ->  mm7         "Inlining Arg"
;  V40 tmp31        [V40,T30] (  2, 16   )  simd16  ->  mm8         "Inlining Arg"
;  V41 tmp32        [V41,T31] (  2, 16   )  simd16  ->  mm6         "Inlining Arg"
;  V42 cse0         [V42,T09] (  5, 17   )  simd16  ->  mm1         "CSE - moderate"
;  V43 cse1         [V43,T38] (  3,  9   )  simd16  ->  mm0         "CSE - moderate"
;  V44 cse2         [V44,T39] (  3,  9   )  simd16  ->  mm2         "CSE - moderate"
;  V45 cse3         [V45,T40] (  3,  9   )  simd16  ->  mm3         "CSE - moderate"
;  V46 cse4         [V46,T41] (  3,  9   )  simd16  ->  mm4         "CSE - moderate"
;
; Lcl frame size = 88

G_M6610_IG01:
       sub      rsp, 88
       vzeroupper 
       vmovaps  qword ptr [rsp+40H], xmm6
       vmovaps  qword ptr [rsp+30H], xmm7
       vmovaps  qword ptr [rsp+20H], xmm8
       vmovaps  qword ptr [rsp+10H], xmm9
       vmovaps  qword ptr [rsp], xmm10
						;; size=41 bbWeight=4    PerfScore 45.00
G_M6610_IG02:
       shl      r8d, 2
       movsxd   rax, r8d
       add      rax, rcx
       cmp      rcx, rax
       jae      G_M6610_IG04
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovupd  xmm1, xmmword ptr [reloc @RWD16]
       vmovupd  xmm2, xmmword ptr [reloc @RWD32]
       vmovupd  xmm3, xmmword ptr [reloc @RWD48]
       vmovupd  xmm4, xmmword ptr [reloc @RWD64]
       align    [0 bytes for IG03]
						;; size=64 bbWeight=1    PerfScore 17.25
G_M6610_IG03:
       vmovdqu  xmm5, xmmword ptr [rcx]
       vmovdqu  xmm6, xmmword ptr [rcx+16]
       vmovaps  xmm7, xmm0
       vpand    xmm8, xmm5, xmm7
       vpand    xmm7, xmm6, xmm7
       vpackusdw xmm7, xmm8, xmm7
       vpslldq  xmm7, xmm7, 1
       vmovaps  xmm8, xmm1
       vpand    xmm9, xmm5, xmm8
       vpand    xmm8, xmm6, xmm8
       vpackusdw xmm8, xmm9, xmm8
       vpsrldq  xmm5, xmm5, 1
       vpsrldq  xmm6, xmm6, 1
       vmovaps  xmm9, xmm1
       vpand    xmm5, xmm5, xmm9
       vpand    xmm6, xmm6, xmm9
       vpackusdw xmm5, xmm5, xmm6
       vpmulhuw xmm7, xmm7, xmm2
       vpmulhuw xmm8, xmm8, xmm3
       vpmulhuw xmm5, xmm5, xmm4
       vpaddusw xmm6, xmm7, xmm8
       vpaddusw xmm5, xmm6, xmm5
       vpsrlw   xmm5, xmm5, 8
       lea      r8, [rcx+32]
       vmovdqu  xmm6, xmmword ptr [r8]
       vmovdqu  xmm7, xmmword ptr [r8+16]
       vmovaps  xmm8, xmm0
       vpand    xmm9, xmm6, xmm8
       vpand    xmm8, xmm7, xmm8
       vpackusdw xmm8, xmm9, xmm8
       vpslldq  xmm8, xmm8, 1
       vmovaps  xmm9, xmm1
       vpand    xmm10, xmm6, xmm9
       vpand    xmm9, xmm7, xmm9
       vpackusdw xmm9, xmm10, xmm9
       vpsrldq  xmm6, xmm6, 1
       vpsrldq  xmm7, xmm7, 1
       vmovaps  xmm10, xmm1
       vpand    xmm6, xmm6, xmm10
       vpand    xmm7, xmm7, xmm10
       vpackusdw xmm6, xmm6, xmm7
       vpmulhuw xmm7, xmm8, xmm2
       vpmulhuw xmm8, xmm9, xmm3
       vpmulhuw xmm6, xmm6, xmm4
       vpaddusw xmm7, xmm7, xmm8
       vpaddusw xmm6, xmm7, xmm6
       vpsrlw   xmm6, xmm6, 8
       vpackuswb xmm5, xmm5, xmm6
       vmovdqu  xmmword ptr [rdx], xmm5
       add      rcx, 64
       add      rdx, 16
       cmp      rcx, rax
       jb       G_M6610_IG03
						;; size=271 bbWeight=4    PerfScore 264.33
G_M6610_IG04:
       vmovaps  xmm6, qword ptr [rsp+40H]
       vmovaps  xmm7, qword ptr [rsp+30H]
       vmovaps  xmm8, qword ptr [rsp+20H]
       vmovaps  xmm9, qword ptr [rsp+10H]
       vmovaps  xmm10, qword ptr [rsp]
       add      rsp, 88
       ret      
						;; size=39 bbWeight=1    PerfScore 21.25
RWD00  	dq	000000FF000000FFh, 000000FF000000FFh
RWD16  	dq	0000FF000000FF00h, 0000FF000000FF00h
RWD32  	dq	4C854C854C854C85h, 4C854C854C854C85h
RWD48  	dq	962C962C962C962Ch, 962C962C962C962Ch
RWD64  	dq	1D4E1D4E1D4E1D4Eh, 1D4E1D4E1D4E1D4Eh


; Total bytes of code 373, prolog size 36, PerfScore 389.33, instruction count 78, allocated bytes for code 415 (MethodHash=8dbde62d) for method ConsoleApp15._13811:convertToGrayscale(long,long,int)
; ============================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
Status: Optimizations
Development

No branches or pull requests

8 participants