Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating genCodeForBinary to be VEX aware #1344

Merged
merged 2 commits into from
Jan 10, 2020

Conversation

tannergooding
Copy link
Member

This is a rough draft attempt at resolving #1342

This essentially just moves some of the genHWIntrinsic_R_R_RM logic into a inst_RV_RV_TT method and updates genCodeForBinary to call it for floating-point types if FEATURE_HW_INTRINSICS is defined.

As can be seen below, the numbers are overall fairly promising.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -3270 (-0.01% of base)
    diff is an improvement.

Top file regressions (bytes):
          39 : System.Drawing.Primitives.dasm (0.10% of base)

Top file improvements (bytes):
       -2496 : System.Private.CoreLib.dasm (-0.05% of base)
        -266 : System.Runtime.Numerics.dasm (-0.37% of base)
        -178 : System.Linq.Parallel.dasm (-0.01% of base)
        -113 : System.Data.Common.dasm (-0.01% of base)
        -108 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base)
         -64 : System.Linq.dasm (-0.01% of base)
         -40 : System.Private.Xml.dasm (-0.00% of base)
         -16 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.00% of base)
         -16 : Newtonsoft.Json.dasm (-0.00% of base)
          -8 : Microsoft.CodeAnalysis.dasm (-0.00% of base)
          -4 : Microsoft.CSharp.dasm (-0.00% of base)

12 total files with Code Size differences (11 improved, 1 regressed), 97 unchanged.

Top method regressions (bytes):
          25 ( 6.60% of base) : System.Drawing.Primitives.dasm - RectangleF:Intersect(RectangleF,RectangleF):RectangleF
          22 ( 3.72% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ThreadTimeStackComputer:AddUnkownAsyncDurationIfNeeded(StartStopActivity,double,TraceEvent):this
          22 ( 7.61% of base) : System.Drawing.Primitives.dasm - RectangleF:Union(RectangleF,RectangleF):RectangleF
          16 ( 2.86% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - SampleProfilerThreadTimeComputer:AddUnkownAsyncDurationIfNeeded(StartStopActivity,double,TraceEvent):this
          15 ( 1.86% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - OverloadResolution:FoldFloatingBinaryOperator(int,ConstantValue,ConstantValue,TypeSymbol,TypeSymbol):ConstantValue
          12 ( 0.68% of base) : System.Private.CoreLib.dasm - DateTimeParse:ParseFormatO(ReadOnlySpan`1,byref):bool
          12 ( 5.04% of base) : System.Private.Xml.dasm - XsdDateTime:.ctor(DateTimeOffset,int):this
           4 ( 1.01% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - StartStopActivityComputer:GetCurrentStartStopActivityStack(MutableTraceEventStackSource,TraceThread,TraceThread,bool):int:this

Top method improvements (bytes):
        -384 (-13.55% of base) : System.Private.CoreLib.dasm - CalendricalCalculationsHelper:SumLongSequenceOfPeriodicTerms(double):double
        -338 (-13.56% of base) : System.Private.CoreLib.dasm - Matrix4x4:Invert(Matrix4x4,byref):bool
        -134 (-15.06% of base) : System.Private.CoreLib.dasm - Matrix4x4:Transform(Matrix4x4,Quaternion):Matrix4x4
         -88 (-9.24% of base) : System.Runtime.Numerics.dasm - Complex:Asin_Internal(double,double,byref,byref,byref)
         -65 (-13.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateFromQuaternion(Quaternion):Matrix4x4
         -60 (-11.21% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateFromAxisAngle(Vector3,float):Matrix4x4
         -59 (-10.42% of base) : System.Private.CoreLib.dasm - Quaternion:Slerp(Quaternion,Quaternion,float):Quaternion
         -57 (-9.79% of base) : System.Private.CoreLib.dasm - Quaternion:Lerp(Quaternion,Quaternion,float):Quaternion
         -52 (-10.00% of base) : System.Private.CoreLib.dasm - Matrix4x4:GetDeterminant():float:this
         -52 (-11.98% of base) : System.Private.CoreLib.dasm - Quaternion:Divide(Quaternion,Quaternion):Quaternion
         -52 (-11.98% of base) : System.Private.CoreLib.dasm - Quaternion:op_Division(Quaternion,Quaternion):Quaternion
         -50 (-11.63% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector3,Quaternion):Vector4
         -48 (-13.75% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Matrix4x4):Plane
         -48 (-16.55% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector4,Matrix4x4):Vector4
         -48 (-14.55% of base) : System.Runtime.Numerics.dasm - Complex:Tan(Complex):Complex
         -47 (-10.63% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Quaternion):Plane
         -43 (-10.17% of base) : System.Private.CoreLib.dasm - Vector3:Transform(Vector3,Quaternion):Vector3
         -43 (-9.60% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector4,Quaternion):Vector4
         -40 (-7.84% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateShadow(Vector3,Plane):Matrix4x4
         -40 (-8.62% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateReflection(Plane):Matrix4x4

Top method regressions (percentages):
          22 ( 7.61% of base) : System.Drawing.Primitives.dasm - RectangleF:Union(RectangleF,RectangleF):RectangleF
          25 ( 6.60% of base) : System.Drawing.Primitives.dasm - RectangleF:Intersect(RectangleF,RectangleF):RectangleF
          12 ( 5.04% of base) : System.Private.Xml.dasm - XsdDateTime:.ctor(DateTimeOffset,int):this
          22 ( 3.72% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - ThreadTimeStackComputer:AddUnkownAsyncDurationIfNeeded(StartStopActivity,double,TraceEvent):this
          16 ( 2.86% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - SampleProfilerThreadTimeComputer:AddUnkownAsyncDurationIfNeeded(StartStopActivity,double,TraceEvent):this
          15 ( 1.86% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - OverloadResolution:FoldFloatingBinaryOperator(int,ConstantValue,ConstantValue,TypeSymbol,TypeSymbol):ConstantValue
           4 ( 1.01% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - StartStopActivityComputer:GetCurrentStartStopActivityStack(MutableTraceEventStackSource,TraceThread,TraceThread,bool):int:this
          12 ( 0.68% of base) : System.Private.CoreLib.dasm - DateTimeParse:ParseFormatO(ReadOnlySpan`1,byref):bool

Top method improvements (percentages):
         -12 (-21.05% of base) : System.Private.CoreLib.dasm - Quaternion:Multiply(Quaternion,float):Quaternion
         -12 (-21.05% of base) : System.Private.CoreLib.dasm - Quaternion:op_Multiply(Quaternion,float):Quaternion
         -20 (-17.09% of base) : System.Private.CoreLib.dasm - Matrix3x2:Multiply(Matrix3x2,float):Matrix3x2
         -20 (-17.09% of base) : System.Private.CoreLib.dasm - Matrix3x2:op_Multiply(Matrix3x2,float):Matrix3x2
         -48 (-16.55% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector4,Matrix4x4):Vector4
         -16 (-15.53% of base) : System.Private.Xml.dasm - NumericExpr:GetValue(int,double,double):double
          -4 (-15.38% of base) : Newtonsoft.Json.dasm - JsonValidatingReader:FloatingPointRemainder(double,double):double
          -4 (-15.38% of base) : System.Private.CoreLib.dasm - CalendricalCalculationsHelper:Reminder(double,double):double
        -134 (-15.06% of base) : System.Private.CoreLib.dasm - Matrix4x4:Transform(Matrix4x4,Quaternion):Matrix4x4
         -36 (-14.69% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector3,Matrix4x4):Vector4
         -48 (-14.55% of base) : System.Runtime.Numerics.dasm - Complex:Tan(Complex):Complex
         -24 (-14.46% of base) : System.Private.CoreLib.dasm - Vector3:TransformNormal(Vector3,Matrix4x4):Vector3
          -8 (-14.04% of base) : System.Private.CoreLib.dasm - CalendricalCalculationsHelper:AsLocalTime(double,double):double
          -4 (-13.79% of base) : System.Runtime.Numerics.dasm - Complex:Scale(Complex,double):Complex
         -48 (-13.75% of base) : System.Private.CoreLib.dasm - Plane:Transform(Plane,Matrix4x4):Plane
        -338 (-13.56% of base) : System.Private.CoreLib.dasm - Matrix4x4:Invert(Matrix4x4,byref):bool
        -384 (-13.55% of base) : System.Private.CoreLib.dasm - CalendricalCalculationsHelper:SumLongSequenceOfPeriodicTerms(double):double
         -24 (-13.48% of base) : System.Private.CoreLib.dasm - Vector4:Transform(Vector2,Matrix4x4):Vector4
         -65 (-13.32% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateFromQuaternion(Quaternion):Matrix4x4
         -16 (-13.22% of base) : System.Private.CoreLib.dasm - Quaternion:Normalize(Quaternion):Quaternion

171 total methods with Code Size differences (163 improved, 8 regressed), 201419 unchanged.

@tannergooding tannergooding added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 7, 2020
@tannergooding
Copy link
Member Author

CC. @CarolEidt

@tannergooding

This comment has been minimized.

@tannergooding
Copy link
Member Author

The majority of diffs are positive changes, such as:

- vdivsd   xmm0, qword ptr [reloc @RWD00]
- vmovsd   qword ptr [rbp-48H], xmm0
+ vdivsd   xmm6, xmm0, qword ptr [reloc @RWD00]

However, their are a few cases (such as DateTimeParse:ParseFormatO) where the epilogues (and there are multiple) have an additional instruction such as vmovaps xmm6, qword ptr [rsp+50H].

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks OK overall, but I'm not comfortable with continuing to have fundamental support for 3-operand encodings under FEATURE_HW_INTRINSICS. I think that's something that should be cleaned up.

src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved
@tannergooding
Copy link
Member Author

This looks OK overall, but I'm not comfortable with continuing to have fundamental support for 3-operand encodings under FEATURE_HW_INTRINSICS

Right, and I called that out above. I'm working on updating the function to work without FEATURE_HW_INTRINSIC.

@@ -1336,6 +1336,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#if defined(_TARGET_XARCH_)
void inst_RV_RV_IV(instruction ins, emitAttr size, regNumber reg1, regNumber reg2, unsigned ival);
void inst_RV_TT_IV(instruction ins, emitAttr attr, regNumber reg1, GenTree* rmOp, int ival);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was conditioned under FEATURE_HW_INTRINSICS for the implementation, but not the declaration here. So, I did the minimal fixup to allow it to be available without `FEATURE_HW_INTRINSICS.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@@ -909,6 +909,18 @@ void CodeGen::genCodeForBinary(GenTreeOp* treeNode)
regNumber op1reg = op1->isUsedFromReg() ? op1->GetRegNum() : REG_NA;
regNumber op2reg = op2->isUsedFromReg() ? op2->GetRegNum() : REG_NA;

if (varTypeIsFloating(treeNode->TypeGet()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just hijacks all the floating-point types and forwards to inst_RV_RV_TT, the emit helpers (such as emitIns_SIMD_R_R_R) deal with the difference between VEX and non-VEX for dst and op1Reg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - I think that's the right approach.

@@ -5810,6 +5807,7 @@ void emitter::emitIns_SIMD_R_R_S(
}
}

#ifdef FEATURE_HW_INTRINSICS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only made the minimum number of emitIns_SIMD_* methods needed available outside FEATURE_HW_INTRINSICS as the others can't be encountered during normal codegen (and likely never will be).

@tannergooding tannergooding changed the title [WIP] Updating genCodeForBinary to be VEX aware Updating genCodeForBinary to be VEX aware Jan 7, 2020
@tannergooding tannergooding removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 7, 2020
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

// isRMW -- true if the instruction is RMW; otherwise, false
//
void CodeGen::inst_RV_RV_TT(
instruction ins, emitAttr size, regNumber targetReg, regNumber op1Reg, GenTree* op2, bool isRMW)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isRMW should come before targetReg & operands. It's already pretty dubious that such information has to be communicated separately from instructions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this at the end to keep parameter ordering with other overloads consistent and because this should likely be looked up a different way eventually (either based on the instruction or the node)

@tannergooding
Copy link
Member Author

Any other feedback here? If not, I think it should be good to merge once CI passes

@tannergooding tannergooding merged commit ef27a17 into dotnet:master Jan 10, 2020
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for eliminating the FEATURE_HW_INTRINSICS dependencies.

@@ -909,6 +909,18 @@ void CodeGen::genCodeForBinary(GenTreeOp* treeNode)
regNumber op1reg = op1->isUsedFromReg() ? op1->GetRegNum() : REG_NA;
regNumber op2reg = op2->isUsedFromReg() ? op2->GetRegNum() : REG_NA;

if (varTypeIsFloating(treeNode->TypeGet()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - I think that's the right approach.

@tannergooding
Copy link
Member Author

SIMD.ConsoleMandel.ScalarFloatSinglethreadADT directly benefited from this PR and is ~7.3x faster because of it:
image

@sandreenko
Copy link
Contributor

SIMD.ConsoleMandel.ScalarFloatSinglethreadADT directly benefited from this PR and is ~7.3x faster because of it.

not sure what you mean and why it is important now but on the left part the main problem was that these ADT were not put in registers. This problem was fixed by #37745.

@tannergooding
Copy link
Member Author

Hmm, I'm seeing 4.4s in 3.1 and 0.6s in 5.0; with it being 2.1s if just this PR is reverted; the additional moves inserted from the ops being interpreted as RMW were making a big difference in the pipelining of the code (at least from what I was able to discern from uProf).

@sandreenko
Copy link
Contributor

Hmm, I'm seeing 4.4s in 3.1 and 0.6s in 5.0; with it being 2.1s if just this PR is reverted; the additional moves inserted from the ops being interpreted as RMW were making a big difference in the pipelining of the code (at least from what I was able to discern from uProf).

yeah, maybe it will be 3x time slower without this PR, I am just saying that these diffs that you have shown are not all caused by this PR and this PR, applied to 3.1 alone, won't produce any diffs on this test.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants