Optimize FMA codegen base on the overwritten #58196

weilinwa · 2021-08-26T17:04:27Z

This is for #12984. @kunalspathak @tannergooding, thanks!

ghost · 2021-08-26T17:04:35Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This is for #12984. @kunalspathak @tannergooding, thanks!

Author:	weilinwa
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

SingleAccretion

Some questions and suggestions.

src/coreclr/jit/gentree.cpp

kunalspathak

Added some comments. Did you run the superpmi asmdiff?

src/coreclr/jit/gentree.cpp

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/lsraxarch.cpp

weilinwa · 2021-08-27T20:36:56Z

Added some comments. Did you run the superpmi asmdiff?

No, I haven't. What is this for?

SingleAccretion · 2021-08-27T20:43:02Z

What is this for?

Pretty much all Jit changes are run through diffs (and SPMI is probably the most convenient tool for getting them), so that we can asses the impact on the generated code and how much existing test coverage we have.

weilinwa · 2021-08-30T22:02:26Z

What is this for?

Pretty much all Jit changes are run through diffs (and SPMI is probably the most convenient tool for getting them), so that we can asses the impact on the generated code and how much existing test coverage we have.

@SingleAccretion, I ran the SuperPMI.py with asmdiffs and saw quite some errors, most of which are from "JIT.HardwareIntrinsics.Arm.Helpers:FPRSqrtStepFused(float,float):float" or other similar tests. How can I find these tests to debug? And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

SingleAccretion · 2021-08-31T09:43:59Z

How can I find these tests to debug?

@weilinwa One of the nicest things with SPMI is that it makes debugging easy. When you encountered errors (I presume asserts), the tool should've printed a "reproduction command", with the path to the native SPMI executable and a list of .mcs. From there it should be straightforward to use any native debugger (I personally use VS's "executable project" feature) to drill into the code (I recommend using the Debug builds of native SPMI and Jit for this, the script uses Checked by default).

And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

I am not sure why that is either.

weilinwa · 2021-08-31T18:03:28Z

How can I find these tests to debug?

@weilinwa One of the nicest things with SPMI is that it makes debugging easy. When you encountered errors (I presume asserts), the tool should've printed a "reproduction command", with the path to the native SPMI executable and a list of .mcs. From there it should be straightforward to use any native debugger (I personally use VS's "executable project" feature) to drill into the code (I recommend using the Debug builds of native SPMI and Jit for this, the script uses Checked by default).

And, it's very confusing that my change is suppose to only work for xarch, why Arm tests are complaining.

I am not sure why that is either.

@SingleAccretion , I got the "Error: no baseline JIT found" when run the asmdiffs with -build_type Release or -build_type Debug. Only the Checked worked for me. Are the options I used correct?

SingleAccretion · 2021-08-31T18:06:17Z

Only the Checked worked for me. Are the options I used correct?

Yes. I believe we only have prebuilt Jits for the Checked config. That said, you can of course supply your own Jit for the base (or diff) via the -base/diff_jit_path options.

kunalspathak · 2021-08-31T18:07:17Z

@SingleAccretion , I got the "Error: no baseline JIT found" when run the asmdiffs with -build_type Release or -build_type Debug. Only the Checked worked for me. Are the options I used correct?

Correct way to use this is:

python superpmi.py asmdiffs -f benchmarks -base_jit_path path\to\before\clrjit_win_x64_x64.dll -diff_jit_path path\to\after\clrjit_win_x64_x64.dll -target_os windows -target_arch x64

python superpmi.py asmdiffs -f benchmarks -base_jit_path path\to\before\clrjit_unix_x64_x64.dll -diff_jit_path path\to\after\clrjit_unix_x64_x64.dll -target_os Linux -target_arch x64

This will do asmdiff for benchmark collection. You might want to also try libraries.pmi (.NET core libraries methods), coreclr_tests (test cases) and asp (asp.net benchmark).

weilinwa · 2021-09-09T21:55:19Z

@kunalspathak @tannergooding, I've modified the code logic to check different IsContainableHWIntrinsicOp() possibilities under each cases of overwrittenOpNum. Please take a look.

Asm diffs

benchmarks.run.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 1071
Total bytes of diff: 1083
Total bytes of delta: 12 (1.12% of base)
Total relative delta: 0.08
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
          12 : 12262.dasm (7.55% of base)

1 total files with Code Size differences (0 improved, 1 regressed), 2 unchanged.

Top method regressions (bytes):
          12 ( 7.55% of base) : 12262.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4

Top method regressions (percentages):
          12 ( 7.55% of base) : 12262.dasm - System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4

1 total methods with Code Size differences (0 improved, 1 regressed), 2 unchanged.

coreclr_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 62236
Total bytes of diff: 62336
Total bytes of delta: 100 (0.16% of base)
Total relative delta: 8.10
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
           4 : 118516.dasm (23.53% of base)
           4 : 126128.dasm (23.53% of base)
           4 : 170806.dasm (23.53% of base)
           4 : 129338.dasm (23.53% of base)
           4 : 219361.dasm (5.56% of base)
           4 : 137701.dasm (23.53% of base)
           4 : 6028.dasm (23.53% of base)
           4 : 219357.dasm (4.55% of base)
           4 : 219358.dasm (5.56% of base)
           4 : 6042.dasm (23.53% of base)
           4 : 134051.dasm (23.53% of base)
           4 : 131147.dasm (23.53% of base)
           4 : 134065.dasm (23.53% of base)
           4 : 219351.dasm (5.56% of base)
           4 : 219354.dasm (6.25% of base)
           4 : 112774.dasm (23.53% of base)
           4 : 219359.dasm (5.00% of base)
           4 : 219360.dasm (5.00% of base)
           4 : 117293.dasm (23.53% of base)
           4 : 43453.dasm (23.53% of base)

Top file improvements (bytes):
         -19 : 219345.dasm (-6.71% of base)
         -19 : 219328.dasm (-6.86% of base)
         -17 : 219347.dasm (-4.51% of base)
         -11 : 219330.dasm (-2.99% of base)
          -4 : 84103.dasm (-10.81% of base)
          -1 : 141012.dasm (-0.16% of base)
          -1 : 140956.dasm (-0.16% of base)
          -1 : 141076.dasm (-0.16% of base)
          -1 : 141416.dasm (-0.16% of base)
          -1 : 141440.dasm (-0.16% of base)
          -1 : 141060.dasm (-0.16% of base)
          -1 : 141400.dasm (-0.16% of base)
          -1 : 141432.dasm (-0.16% of base)
          -1 : 141464.dasm (-0.16% of base)
          -1 : 141448.dasm (-0.16% of base)
          -1 : 219333.dasm (-0.61% of base)
          -1 : 141424.dasm (-0.16% of base)
          -1 : 141052.dasm (-0.16% of base)
          -1 : 141408.dasm (-0.16% of base)
          -1 : 140980.dasm (-0.16% of base)

85 total files with Code Size differences (35 improved, 50 regressed), 334 unchanged.

Top method regressions (bytes):
           4 (23.53% of base) : 170806.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 129338.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 137701.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 6042.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 131147.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 134065.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 43453.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 124425.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 111033.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 120255.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 135993.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118530.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 126142.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 171556.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 117307.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 112788.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118516.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 126128.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 6028.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 134051.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float

Top method improvements (bytes):
         -19 (-6.71% of base) : 219345.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,double)
         -19 (-6.86% of base) : 219328.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,float)
         -17 (-4.51% of base) : 219347.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,double)
         -11 (-2.99% of base) : 219330.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,float)
          -4 (-10.81% of base) : 84103.dasm - Runtime_39424:TestLclFldAddrIntrinsicsFMA_MulipluAddScalar():double
          -1 (-0.65% of base) : 219313.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,double)
          -1 (-0.62% of base) : 219331.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,float)
          -1 (-0.52% of base) : 219314.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,double)
          -1 (-0.49% of base) : 219332.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,float)
          -1 (-0.61% of base) : 219315.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,double)
          -1 (-0.61% of base) : 219333.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,float)
          -1 (-0.16% of base) : 141408.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141004.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141012.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141416.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141440.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141052.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141060.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 141448.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 140956.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddDouble):this

Top method regressions (percentages):
           4 (23.53% of base) : 170806.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 129338.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 137701.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 6042.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 131147.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 134065.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 43453.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 124425.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 111033.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 120255.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 135993.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118530.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 126142.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 171556.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 117307.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 112788.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(double,double):double
           4 (23.53% of base) : 118516.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 126128.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 6028.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float
           4 (23.53% of base) : 134051.dasm - JIT.HardwareIntrinsics.Arm.Helpers:FPRecipStepFused(float,float):float

Top method improvements (percentages):
          -4 (-10.81% of base) : 84103.dasm - Runtime_39424:TestLclFldAddrIntrinsicsFMA_MulipluAddScalar():double
         -19 (-6.86% of base) : 219328.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,float)
         -19 (-6.71% of base) : 219345.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage1(byref,double)
         -17 (-4.51% of base) : 219347.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,double)
         -11 (-2.99% of base) : 219330.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage3(byref,float)
          -1 (-0.65% of base) : 219313.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,double)
          -1 (-0.62% of base) : 219331.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage4(byref,float)
          -1 (-0.61% of base) : 219315.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,double)
          -1 (-0.61% of base) : 219333.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage6(byref,float)
          -1 (-0.52% of base) : 219314.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,double)
          -1 (-0.49% of base) : 219332.dasm - MathFusedMultiplyAddTest.Program:TestExplicitFmaUsage5(byref,float)
          -1 (-0.16% of base) : 141004.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractDouble):this
          -1 (-0.16% of base) : 141012.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplyAddSubtractSingle):this
          -1 (-0.16% of base) : 141052.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddDouble):this
          -1 (-0.16% of base) : 141060.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.AlternatingTernaryOpTest__MultiplySubtractAddSingle):this
          -1 (-0.16% of base) : 140956.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddDouble):this
          -1 (-0.16% of base) : 140972.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddNegatedDouble):this
          -1 (-0.16% of base) : 140980.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddNegatedSingle):this
          -1 (-0.16% of base) : 140964.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplyAddSingle):this
          -1 (-0.16% of base) : 141036.dasm - TestStruct:RunStructFldScenario_Load(JIT.HardwareIntrinsics.X86.SimpleTernaryOpTest__MultiplySubtractDouble):this

85 total methods with Code Size differences (35 improved, 50 regressed), 334 unchanged.

libraries.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 836
Total bytes of diff: 860
Total bytes of delta: 24 (2.87% of base)
Total relative delta: 0.91
    diff is a regression.
    relative diff is a regression.

Detail diffs



Top file regressions (bytes):
           1 : 18759.dasm (3.57% of base)
           1 : 18779.dasm (3.57% of base)
           1 : 18765.dasm (3.57% of base)
           1 : 18762.dasm (4.00% of base)
           1 : 18756.dasm (4.00% of base)
           1 : 18763.dasm (4.00% of base)
           1 : 18758.dasm (3.57% of base)
           1 : 18782.dasm (4.00% of base)
           1 : 18768.dasm (3.57% of base)
           1 : 18767.dasm (4.00% of base)
           1 : 18772.dasm (4.00% of base)
           1 : 18783.dasm (4.00% of base)
           1 : 18773.dasm (4.00% of base)
           1 : 18785.dasm (3.57% of base)
           1 : 18764.dasm (3.57% of base)
           1 : 18774.dasm (3.57% of base)
           1 : 18775.dasm (3.57% of base)
           1 : 18757.dasm (4.00% of base)
           1 : 18769.dasm (3.57% of base)
           1 : 18776.dasm (4.00% of base)

24 total files with Code Size differences (0 improved, 24 regressed), 8 unchanged.

Top method regressions (bytes):
           1 ( 4.00% of base) : 18757.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18756.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18759.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18758.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18777.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18776.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18779.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18778.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18763.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18762.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18765.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18764.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18767.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18766.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18769.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18768.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 4.00% of base) : 18773.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18772.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18775.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18774.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]

Top method regressions (percentages):
           1 ( 4.00% of base) : 18757.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18756.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18777.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18776.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18763.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18762.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18767.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18766.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18773.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18772.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractAdd(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 4.00% of base) : 18783.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractNegated(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double]):System.Runtime.Intrinsics.Vector128`1[Double]
           1 ( 4.00% of base) : 18782.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtractNegated(System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single],System.Runtime.Intrinsics.Vector128`1[Single]):System.Runtime.Intrinsics.Vector128`1[Single]
           1 ( 3.57% of base) : 18759.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18758.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAdd(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18779.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18778.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddNegated(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18765.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18764.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplyAddSubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]
           1 ( 3.57% of base) : 18769.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double],System.Runtime.Intrinsics.Vector256`1[Double]):System.Runtime.Intrinsics.Vector256`1[Double]
           1 ( 3.57% of base) : 18768.dasm - System.Runtime.Intrinsics.X86.Fma:MultiplySubtract(System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single],System.Runtime.Intrinsics.Vector256`1[Single]):System.Runtime.Intrinsics.Vector256`1[Single]

24 total methods with Code Size differences (0 improved, 24 regressed), 8 unchanged.

libraries_tests.pmi.windows.x64.checked.mch:


Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 50
Total bytes of diff: 42
Total bytes of delta: -8 (-16.00% of base)
Total relative delta: -0.32
    diff is an improvement.
    relative diff is an improvement.

Detail diffs



Top file improvements (bytes):
          -4 : 210044.dasm (-16.00% of base)
          -4 : 209492.dasm (-16.00% of base)

2 total files with Code Size differences (2 improved, 0 regressed), 0 unchanged.

Top method improvements (bytes):
          -4 (-16.00% of base) : 209492.dasm - System.Tests.MathFTests:FusedMultiplyAdd(float,float,float,float)
          -4 (-16.00% of base) : 210044.dasm - System.Tests.MathTests:FusedMultiplyAdd(double,double,double,double)

Top method improvements (percentages):
          -4 (-16.00% of base) : 209492.dasm - System.Tests.MathFTests:FusedMultiplyAdd(float,float,float,float)
          -4 (-16.00% of base) : 210044.dasm - System.Tests.MathTests:FusedMultiplyAdd(double,double,double,double)

2 total methods with Code Size differences (2 improved, 0 regressed), 0 unchanged.

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/gentree.cpp

weilinwa · 2021-09-16T18:18:11Z

@tannergooding, I have a question about Fma.MultiplyAddScalar and other scalar type FMA methods.

In instructions for FMA of scalar values like VFMADD132SS DEST, SRC1, SRC2, DEST would hold the scalar result in DEST[31:0]. DEST[127:32] would be unchanged. However, because of the 3 difference FMA forms, DEST could be mapped to any one of the three operands in Fma.MultiplyAddScalar(op1, op2, op3) .

My questions is, do we need to ensure op1[127:32] == result[127:32] (rather than op2[127:32] == result[127:32] or op3[127:32] == result[127:32]) in the definition of Fma.MultiplyAddScalar. If we do, does this mean we cannot choose the 3 FMA forms freely? For 132 and 213, we could ensure op1 is mapped to DEST because of the commutative. But for 231, DEST needs to be mapped to op3.

tannergooding · 2021-09-16T19:47:55Z

@tannergooding, I have a question about Fma.MultiplyAddScalar and other scalar type FMA methods.

In instructions for FMA of scalar values like VFMADD132SS DEST, SRC1, SRC2, DEST would hold the scalar result in DEST[31:0]. DEST[127:32] would be unchanged. However, because of the 3 difference FMA forms, DEST could be mapped to any one of the three operands in Fma.MultiplyAddScalar(op1, op2, op3) .

My questions is, do we need to ensure op1[127:32] == result[127:32] (rather than op2[127:32] == result[127:32] or op3[127:32] == result[127:32]) in the definition of Fma.MultiplyAddScalar. If we do, does this mean we cannot choose the 3 FMA forms freely? For 132 and 213, we could ensure op1 is mapped to DEST because of the commutative. But for 231, DEST needs to be mapped to op3.

@weilinwa, that's a great question. The TL;DR; is that yes we do need to ensure op1[127:32] == result[127:32] or more specifically that the upper result bits come from the a operand (this can be done via a pre or post move/merge if appropriate/required).

Normally we provide two versions of the scalar function where this matters, such as:

public static Vector128<float> ReciprocalScalar(Vector128<float> value);
public static Vector128<float> ReciprocalScalar(Vector128<float> upper, Vector128<float> value);

When we do this, the upper bits come from value for the first overload and from upper in the other. We do this to try and ensure determinism first and foremost.

For FMA, we only expose overloads like the first one and so the expectation is that the upper bits come from a. Today, we ensure that a (op1) can't be contained for the scalar variants so that it is always the destination (see the check for CopiesUpperBits): https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L6312-L6347

We'd need to expose MultiplyAddScalarUnsafe APIs, or something similar, to allow the upper bits to be "undefined" (that is come from any operand) and so to allow the most efficient codegen in all scenarios. That would require an API review and approval for the scenario (but is likely worth it since that would also benefit Math.FusedMultiplyAdd where the upper bits aren't exposed and don't matter).

tannergooding · 2021-11-17T18:33:52Z

src/coreclr/jit/lsraxarch.cpp

+
+                srcCount += 1;
+                srcCount += BuildDelayFreeUses(emitOp2, emitOp1);
+                srcCount += emitOp3->isContained() ? BuildOperandUses(emitOp3) : BuildDelayFreeUses(emitOp3, emitOp1);


This is a lot smaller and easier to follow now 🎉

tannergooding · 2021-11-17T18:34:33Z

src/coreclr/jit/lsraxarch.cpp

+                if (containedOpNum == 1 && !copiesUpperBits)
+                {


Is the !copiesUpperBits check needed? If we are copiesUpperBits then containedOpNum shouldn't be 1.

Yes, this should be true. Do we need to add an assert before to ensure that?

I think you already have that assert a few lines up: https://github.com/dotnet/runtime/pull/58196/files/5ca658ebf53beee4d84446236f9391fba9e3e935#diff-9626112837daf480c93d401a587012b3e398dd90a953c870797d472cff36839dR2342

assert(!copiesUpperBits || !op1->isContained());

maybe replace it with assert(containedOpNum != 1 || !copiesUpperBits); to also cover the regOptional case

tannergooding · 2021-11-17T18:38:26Z

src/coreclr/jit/lsraxarch.cpp

+                // Intrinsics with CopyUpperBits semantics must have op1 as target
+                if (containedOpNum == 1 && !copiesUpperBits)
+                {
+                    if (resultOpNum != 3)


It would be nice if this were the positive case and there were an assert that resultOpNum != containedOpNum

Therefore, if containedOpNum == 1 then resultOpNum can only be 0, 2, or 3

If it's 3, then swapping op1/op3 is sufficient
If it's 2, then swapping op2/op3 is needed first
If it's 0, then it doesn't matter what we do so its fine to not swap

is it possible that containedOpNum ==0 and resultOpNum==0?

If none of the operands are overwritten and none are last use, then containedOpNum == 0.

I think we probably won't also get containedOpNum == 0 because VFMADD should support general-purpose loads as well and so RegOptional should probably be true for at least one case. But in general its better to check and account for possible future changes, scenarios, or nodes that are introduced

Because op lastUse could be updated after lowering, there are cases that we have resultOpNum == containedOpNum when they are not 0.

Can we make multiple ops contained in lowering or change that in lsra?

tannergooding · 2021-11-17T18:40:57Z

src/coreclr/jit/lsraxarch.cpp

+                }
+                else
+                {
+                    assert(containedOpNum == 2);


I think its possible for containedOpNum to be 0 and so we should check for this explicitly.

tannergooding · 2021-11-17T18:43:20Z

src/coreclr/jit/lsraxarch.cpp


-                    srcCount += op3->isContained() ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1);
+                    if (resultOpNum == 3 && !copiesUpperBits)


Just capturing a comment, I don't think we need to do anything in this PR.

I think the logic around copiesUpperBits could be simplified a bit so we don't need these extra checks everywhere. That is, if copiesUpperBits is true, then resultOpNum doesn't matter if its not 1 so maybe we should be forcing resultOpNum to be 0 in that case (that is if copiesUpperBits == true and resultOpNum != 1, then treat it as 0, because no matter what we do, op1 cannot be swapped or moved about and op2/op3 will be delay free or contained).

tannergooding · 2021-11-17T18:55:01Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+        // op1 = (op1 * op2) + [op3] or op2 = (op1 * op2) + [op3]
+        // ? = (op1 * op2) + [op3] or ? = (op1 * op2) + op3
+        // 213 form: XMM1 = (XMM2 * XMM1) + [XMM3]
+        isCommutative = copiesUpperBits;


shouldn't isCommutative be !copiesUpperBits? We can't swap anything if copiesUpperBits == true

Yes, I used it inaccurately here to barely control if we should enter the branch.

tannergooding · 2021-11-17T18:57:40Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

    }

+    regNumber op1Reg = emitOp1->GetRegNum();
+    regNumber op2Reg = emitOp2->GetRegNum();
+
    if (isCommutative && (op1Reg != targetReg) && (op2Reg == targetReg))


Is this block still needed given the above handling?

It feels like we should already be covering this under the last block, which is op3 or nothing is contained/spilled so:

if (!copiesUpperBits && (targetReg == op2Reg)) { std::swap(emitOp1, emitOp2); }

Then everything should be in the right place.

Why is it if (!copiesUpperBits && (targetReg == op2Reg)) not if (copiesUpperBits && (targetReg == op2Reg))? I thought we need to ensure targetReg is op1Reg only when copiesUpperBits is true.

Because emitOp1 is already op1, so if copiesUpperBits == true, then we don't want to change anything.

When its false, we only need to swap if the target reg is op2Reg.

weilinwa · 2021-11-22T17:01:00Z

@tannergooding, could you please take a look at the latest code when you have time? I resolved almost all of your comments except the resultOpNum and containedOpNum assertion. Thanks!

tannergooding · 2021-11-22T18:53:27Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

-        op1Reg = op3->GetRegNum();
-        op2Reg = op2->GetRegNum();
-        op3    = op1;
+        if (targetReg == op3NodeReg)


I think this needs to be !copiesUpperBits && (targetReg == op3NodeReg)

Otherwise, copiesUpperBits can be true since op1 is not Contained or UsedFromSpillTemp and therefore swapping emitOp1 isn't correct.

tannergooding · 2021-11-22T18:54:12Z

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

+        // op1 = (op1 * op2) + [op3] or op2 = (op1 * op2) + [op3]
+        // ? = (op1 * op2) + [op3] or ? = (op1 * op2) + op3
+        // 213 form: XMM1 = (XMM2 * XMM1) + [XMM3]
+        if (targetReg == op2NodeReg)


Likewise, I think this needs to be if (!copiesUpperBits && (targetReg == op2NodeReg)) for the same reason.

I think we also don't need the below section doing if (!copiesUpperBits && (emitOp2->GetRegNum() == targetReg)) as it will have already been covered up here.

tannergooding · 2021-11-22T18:56:48Z

Everything looks good except for the two related callouts in codegen.

Looks like there is also a merge conflict, like due to #59912.

tannergooding

This all LGTM. CC. @kunalspathak or @echesakovMSFT could you give a second review and merge if everything looks good to you as well

weilinwa · 2021-11-30T21:43:28Z

@kunalspathak @echesakovMSFT, could you take a look when you have some time? Thanks.

kunalspathak

I think you need to uncomment the 2 asserts and run the test to make sure they are not hit.

src/coreclr/jit/lsraxarch.cpp

kunalspathak · 2021-11-30T22:09:50Z

src/coreclr/jit/lsraxarch.cpp

+                if (containedOpNum == 1)
+                {
+                    // resultOpNum might change between lowering and lsra, comment out assertion for now.
+                    // assert(containedOpNum != resultOpNum);


Need to uncomment this assert?

This assertion cannot be uncommented because the last use value could change after lowering step. I left them here for follow up work if necessary.

Could you please create a issue for it and add the link to the issue in the comment here?

kunalspathak · 2021-11-30T22:19:43Z

src/coreclr/jit/lsraxarch.cpp

+                }
+                else if (containedOpNum == 3)
+                {
+                    // assert(containedOpNum != resultOpNum);


Co-authored-by: Kunal Pathak <Kunal.Pathak@microsoft.com>

kunalspathak

Thank you @weilinwa for your patience and commitment. This looks good to me.

kunalspathak · 2021-12-02T04:05:38Z

@weilinwa - I noticed superpmi.py replay failure on linux/x64. Can you double check if it is from your change?

ISSUE: <ASSERT> D:\a\_work\1\s\src\coreclr\jit\emitxarch.cpp (6781) - Assertion failed '(op3Reg != targetReg) || (op1Reg == targetReg)' in 'System.Numerics.Matrix4x4:Lerp(System.Numerics.Matrix4x4,System.Numerics.Matrix4x4,float):System.Numerics.Matrix4x4' during 'Generate code' (IL size 675)

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-62262-merge-d207c81ab3b14b3f92/unix-x64/1/console.122dc30b.log?sv=2019-07-07&se=2021-12-22T02%3A42%3A41Z&sr=c&sp=rl&sig=TWtmGXhWg7AuFc9lSuVCD%2FMqEkj7ZjYwRxf2ZKSSSA0%3D

Optimize FMA codegen base on the overwritten

ee2c0b6

ghost added the community-contribution Indicates that the PR has been added by a community member label Aug 26, 2021

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed community-contribution Indicates that the PR has been added by a community member labels Aug 26, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Aug 26, 2021

JulieLeeMSFT requested a review from kunalspathak August 26, 2021 21:18

JulieLeeMSFT assigned weilinwa Aug 26, 2021

SingleAccretion reviewed Aug 26, 2021

View reviewed changes

kunalspathak reviewed Aug 26, 2021

View reviewed changes

weilinwa added 2 commits August 27, 2021 10:39

Improve function/var names

46d0011

Add assertions

cce4bda

weilinwa added 2 commits September 7, 2021 11:45

Get use of FMA with TryGetUse

b825291

Decide FMA form with two conditions, OverwrittenOpNum and isContained

f615e39

Fix op reg error in codegen

b698036

tannergooding reviewed Sep 10, 2021

View reviewed changes

src/coreclr/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Sep 10, 2021

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Sep 10, 2021

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

Decide form using lastUse and isContained in no overwritten case

7d9c0d6

Clean up code

1344d92

tannergooding reviewed Nov 17, 2021

View reviewed changes

weilinwa added 2 commits November 17, 2021 13:34

Resolve comments

ec4ef66

Comment out assert because of lastUse change

aa93a85

runfoapp bot mentioned this pull request Nov 19, 2021

system.text.regularexpressions.tests.regexmatchtests.match_cachedpattern_newtimeoutapplies #61794

Closed

tannergooding reviewed Nov 22, 2021

View reviewed changes

weilinwa added 2 commits November 22, 2021 11:35

Fix some copiesUpperBits related errors

c66a018

Merge branch 'main' into fma_opt

ff5a433

tannergooding approved these changes Nov 22, 2021

View reviewed changes

kunalspathak requested changes Nov 30, 2021

View reviewed changes

ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Nov 30, 2021

Update src/coreclr/jit/lsraxarch.cpp

a4657c7

Co-authored-by: Kunal Pathak <Kunal.Pathak@microsoft.com>

weilinwa mentioned this pull request Nov 30, 2021

Improve FMA code generation related to operand last use #62215

Open

Add link to the new issue

75d7a37

kunalspathak approved these changes Dec 1, 2021

View reviewed changes

kunalspathak merged commit 42777cc into dotnet:main Dec 1, 2021

joshpeterson mentioned this pull request Dec 1, 2021

bot upstream main merge 2021 12 01 Unity-Technologies/runtime#12

Closed

kunalspathak mentioned this pull request Dec 2, 2021

Assertion failed '(op3Reg != targetReg) || (op1Reg == targetReg)' during 'Generate code' #62267

Closed

ghost locked as resolved and limited conversation to collaborators Jan 3, 2022


		srcCount += op3->isContained() ? BuildOperandUses(op3) : BuildDelayFreeUses(op3, op1);
		if (resultOpNum == 3 && !copiesUpperBits)

Optimize FMA codegen base on the overwritten #58196

Optimize FMA codegen base on the overwritten #58196

Conversation

weilinwa commented Aug 26, 2021

ghost commented Aug 26, 2021

SingleAccretion left a comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

weilinwa commented Aug 27, 2021

SingleAccretion commented Aug 27, 2021

weilinwa commented Aug 30, 2021

SingleAccretion commented Aug 31, 2021

weilinwa commented Aug 31, 2021

SingleAccretion commented Aug 31, 2021

kunalspathak commented Aug 31, 2021

weilinwa commented Sep 9, 2021

benchmarks.run.windows.x64.checked.mch:

coreclr_tests.pmi.windows.x64.checked.mch:

libraries.pmi.windows.x64.checked.mch:

libraries_tests.pmi.windows.x64.checked.mch:

weilinwa commented Sep 16, 2021

tannergooding commented Sep 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Nov 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weilinwa commented Nov 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented Nov 22, 2021

tannergooding left a comment

Choose a reason for hiding this comment

weilinwa commented Nov 30, 2021

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak commented Dec 2, 2021

tannergooding commented Sep 16, 2021 •

edited

Loading

tannergooding Nov 17, 2021 •

edited

Loading