Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Math{F}.CoreCLR methods from FCall to QCall #39474

Closed
wants to merge 1 commit into from
Closed

Convert Math{F}.CoreCLR methods from FCall to QCall #39474

wants to merge 1 commit into from

Conversation

am11
Copy link
Member

@am11 am11 commented Jul 16, 2020

Fixes #13820

TODOs:

  • decide which c-runtime functions to call directly
    • CRT functions are directly registered to QCalls, without the wrappers.
  • how to handle /fp:fast vs. /fp:precise when p/invoking crt functions on Windows
    • remove the wrappers and /fp:fast logic. we will use the default settings of the project (/fp:fast is always on).

@jkotas
Copy link
Member

jkotas commented Jul 17, 2020

Convert math intrinsics to named intrinsics

This looks like a signifiant change on its own. Would you mind submitting it in a separate PR to make it easier to review, test and get in?

@am11
Copy link
Member Author

am11 commented Jul 20, 2020

@jkotas, @tannergooding, in CLR checked debug x64 builds (all platforms), Single API MathF.Abs(-3.3f) is returning -3.3 instead of 3.3. Other build flavors do not have this issue. Also the double API Math.Abs(-3.3) does not has this issue in checked debug build. I couldn't figure out the reason. Is there any special handling for Checked Debug build which I should look into?

Edit: locally, I can only reproduce this issue with CLR's Debug build, but CI encounters in CLR Checked build.

# repros `MathF.Abs(-3.3f)` issue
./build.sh -c Debug

# do not produce the issue
./build.sh -rc Checked
./build.sh -c Release

@jkotas
Copy link
Member

jkotas commented Jul 20, 2020

CLR Checked build.

The difference between checked and debug builds are optimizations settings. Debug is asserts on + optimizations off. Checked is assert on + optimizations on.

@am11
Copy link
Member Author

am11 commented Jul 21, 2020

Remaining failures are related to 'Insert GC Polls', which seems to have recent changes (#39111).

ARM64 checked tests are failing this assertion:

// After the DFS reverse postorder is completed, we must have visited all the basic blocks.
noway_assert(postIndex == fgBBcount + 1);

and x86 checked test is failing this one:

// We are allowed to split loops and we need to keep a few other flags...
//
noway_assert((originalFlags & (BBF_SPLIT_NONEXIST & ~(BBF_LOOP_HEAD | BBF_LOOP_CALL0 | BBF_LOOP_CALL1))) == 0);

@jkotas
Copy link
Member

jkotas commented Jul 21, 2020

Thanks! I have opened #39726 on this. This change will be blocked until it is fixed.

@jkotas jkotas added the blocked Issue/PR is blocked on something - see comments label Jul 21, 2020
@jkotas
Copy link
Member

jkotas commented Jul 21, 2020

Convert math intrinsics to named intrinsics

This looks like a signifiant change on its own. Would you mind submitting it in a separate PR to make it easier to review, test and get in?

Doing this in the meantime may be useful.

@am11
Copy link
Member Author

am11 commented Jul 21, 2020

Submitted #39730. First and third commits are for this PR. Will rebase after JIT changes are merged.

@erozenfeld
Copy link
Member

Which tests are failing with the JIT asserts?

@am11
Copy link
Member Author

am11 commented Jul 21, 2020

erozenfeld added a commit to erozenfeld/runtime that referenced this pull request Jul 24, 2020
`fgComputeDoms` has an assumption that the flow graph
has no unreachable blocks. It's checked with this assert:

https://github.com/dotnet/runtime/blob/ad325b014124b1adb9306abf95fdac85e3f7f2f4/src/coreclr/src/jit/flowgraph.cpp#L2342

This assert fired when testing dotnet#39474 (`Convert Math{F}.CoreCLR methods from FCall to QCall`)
when we are updating the flow graph after inserting GC polls.

This change switches `fgUpdateChangedFlowGraph` to call `fgComputeReachability`,
which both removes unreachable blocks and calls `fgComputeDoms`.

pin-icount reported a 0.0043% throughput improvement, which is within noise level.
erozenfeld added a commit to erozenfeld/runtime that referenced this pull request Jul 24, 2020
* Don't inline GC polls in cold basic blocks.
* Allow GC poll inlining in basic blocks with `BBF_LOOP_PREHEADER` or `BBF_RETLESS_CALL` set.

This fixes one of the assert seen in dotnet#39474 (comment)

Contributes to resolving dotnet#39726.
@erozenfeld
Copy link
Member

I opened #39878 and #39881, which will address the asserts in #39474 (comment)

@jkotas
Copy link
Member

jkotas commented Jul 30, 2020

You should be able to build a benchmark using the same pattern as the GC latency benchmark for sorting from https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/ that shows that algorithms that call Math operations a lot can stall GC for a long time.

The overhead of the GC poll should be very small once everything else works correctly. Also, this change should remove extra layer of indirection (the QCall wrapper) that will help to pay for this a bit of extra overhead.

I have looked what the JIT generates with the latest iteration of the change. I have used the same test as in #39474 (comment). It still does not seem to be working correctly. The PInvoke call is not getting inlined and there is an extra method frame that adds significant overhead. The problem is that the JIT recognizes the PInvoke as intrinsic and does not bother inlining it. This needs to be fixed to make this perform well.

@jkotas
Copy link
Member

jkotas commented Jul 30, 2020

I will create a separate PR for mono and cleanup #if CORECLR from src/libraries.

You can get this refactoring PR going in parallel if you would like.

@am11
Copy link
Member Author

am11 commented Jul 30, 2020

Thank you for sharing that link. I will try to compare the GC polls before and after these changes.

This non-QCall DllImport version is ready for use: am11@e8c4c31. It was the fastest among regressions per the benchmark. I can update the PR with it. However, it sill incurs the overhead due to P/Invoke+Intrinsic not getting inlined. Is there an issue tracking it? I can also try to investigate over the weekend.

The mono change was made yesterday in a separate branch, based off of this PR branch: https://github.com/am11/runtime/commits/feature/mathf-abs (top commit: Intrinsify MathF.Abs(float) instead of Math.Abs). I was thinking about creating that PR after these changes go in.

@jkotas
Copy link
Member

jkotas commented Jul 30, 2020

It was the fastest among regressions per the benchmark.

Do you understand the reason why was the fastest? I suspect that it was just a measurement noise. I have stepped through the code on Windows and Linux (I do not have an easy access to mac at the moment) and the code was identical to the version with QCall. Also, the change is not right on Windows. msvcrt.dll is a legacy Windows C-runtime library that we do not want to use.

it still incurs the overhead due to P/Invoke+Intrinsic not getting inlined. Is there an issue tracking it?

There is no issue tracking it. This issue is exposed by this PR, there are no intrinsics that are DllImports today. If you can look into it, it would be great.

@tannergooding
Copy link
Member

@AndyAyersMS might also have an idea if there is something that might block this in inlining today.

@am11
Copy link
Member Author

am11 commented Jul 31, 2020

With the current state of this PR vs. master, i have found another difference in intrinsic recognition of Abs() flavors:

Method Branch Named Intrinsic {0}: Recognized
Math.Abs(double) master Yes, {0} -> System.Math.Abs
Math.Abs(double) pr Yes, {0} -> System.Math.Abs
Math.Abs(float) master Yes, {0} -> System.Math.Abs
Math.Abs(float) pr No
MathF.Abs(float) master Yes, {0} -> System.Math.Abs
MathF.Abs(float) pr Yes, {0} -> System.MathF.Abs

Perhaps the agressive-inlining on Math.Abs(float), which now points to [Intrinsic] MathF.Abs(float) is behaving unexpectedly.

@jkotas
Copy link
Member

jkotas commented Jul 31, 2020

It may be best to do the Math.Abs refactoring in separate PR where it will be easier to validate that it is not regressing anything.

@am11
Copy link
Member Author

am11 commented Aug 1, 2020

I took JIT dumps from master branch using COMPlus_JitDump=* COMPlus_JitPrintInlinedMethods=1, in order to set the expectation. With FCalls even, there is no math in the context of inlin* found in the dumps, although there are many other inline/inliner/inlining related infos found:

...
----------- Statements (and blocks) added due to the inlining of call [000293] -----------
Inlinee method body:fgInlineAppendStatements: no gc ref inline locals.
Successfully inlined Internal.Runtime.CompilerServices.Unsafe:As(byref):byref (2 IL bytes) (depth 2) [aggressive inline attribute]
INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute' for 'System.SpanHelpers:Contains(byref,ushort,int):bool' calling 'Internal.Runtime.CompilerServices.Unsafe:As(byref):byref'
INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute'
Inlines into 060013C6 System.SpanHelpers:Contains(byref,ushort,int):bool
  [1 IL=0007 TR=000005 06005898] [profitable inline] System.Diagnostics.Debug:Assert(bool)
...
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline' for 'ILStubClass:IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle,long,int):int' calling 'System.StubHelpers.StubHelpers:SafeHandleAddRef(System.Runtime.InteropServices.SafeHandle,byref):long'
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline'
INLINER: inlineInfo.tokenLookupContextHandle for System.StubHelpers.StubHelpers:SafeHandleRelease(System.Runtime.InteropServices.SafeHandle) set to 0x0000000115FEFCE1:
Invoking compiler for the inlinee method System.StubHelpers.StubHelpers:SafeHandleRelease(System.Runtime.InteropServices.SafeHandle) :
INLINER impTokenLookupContextHandle for System.StubHelpers.StubHelpers:SafeHandleRelease(System.Runtime.InteropServices.SafeHandle) is 0x0000000115FEFCE1.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.
Native estimate for function size exceeds threshold for inlining 30.8 > 28 (multiplier = 3.3)
Inline expansion aborted, inline not profitable
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline' for 'ILStubClass:IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle,long,int):int' calling 'System.StubHelpers.StubHelpers:SafeHandleRelease(System.Runtime.InteropServices.SafeHandle)'
INLINER: during 'fgInline' result 'failed this call site' reason 'unprofitable inline'
Inlines into 06000000 ILStubClass:IL_STUB_PInvoke(System.Runtime.InteropServices.SafeHandle,long,int):int
...

@am11
Copy link
Member Author

am11 commented Aug 7, 2020

Also, this change should remove extra layer of indirection (the QCall wrapper) that will help to pay for this a bit of extra overhead.

@jkotas, I have ran CI job with direct P/Invoke 3957e2d, but found strange issues during the test runs. For example,

this C program:

// ilogb-test.c
#include<math.h>
#include<stdio.h>

int main()
{
    printf("ilogb(NAN): %d\n", ilogb(NAN));
}

when compiled with clang -lm ilogb-test.c on Linux x64, results in ilogb(NAN): -2147483648.
when compiled with cl ilogb-test.c on Windows x64, results in ilogb(NAN): 2147483647.

Whereas, this .NET program in master (FCall) and QCall:

using System;

class Program
{
    static void Main(string[] args) => Console.WriteLine("Math.ILogB(Double.NaN): {0}", Math.ILogB(Double.NaN));
}

produces Math.ILogB(Double.NaN): 2147483647 on both Linux and Windows.
With direct P/Invoke, we get Math.ILogB(Double.NaN): -2147483648.

I am not sure how QCall/FCall mitigate this disparity. Perhaps some compiler flag?

(keeping QCall implementation for now)

@am11
Copy link
Member Author

am11 commented Aug 7, 2020

Changes:

  • Revert Math.Abs refactoring.
  • Ignore SuppressGCTransitionAttribute for ref.
    • Some recent change in master is forcing it (see APICompat failures on Linux #40416 e.g.). After the rebase with master, it started giving errors like:

      .packages/microsoft.dotnet.apicompat/5.0.0-beta.20381.6/build/Microsoft.DotNet.ApiCompat.targets(145,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) CannotRemoveAttribute : Attribute 'System.Runtime.InteropServices.SuppressGCTransitionAttribute' exists on 'System.Math.Abs(System.Double)' in the implementation but not the reference.


After manually comparing some of the pr vs. master control flows in JIT, I have found that importer needs the following patch to match code reachability of master in Compiler::impIntrinsic function:

--- a/src/coreclr/src/jit/importer.cpp
+++ b/src/coreclr/src/jit/importer.cpp
@@ -3608,7 +3608,7 @@ GenTree* Compiler::impIntrinsic(GenTree*                newobjThis,
     // If that call is an intrinsic and is expanded, codegen for NextCallReturnAddress will fail.
     // To avoid that we conservatively expand only required intrinsics in methods that call
     // the NextCallReturnAddress intrinsic.
-    if (!mustExpand && (opts.OptimizationDisabled() || info.compHasNextCallRetAddr))
+    if (!mustExpand && (opts.OptimizationDisabled() || info.compHasNextCallRetAddr) && (ni < NI_System_Math_FusedMultiplyAdd || ni > NI_System_Math_Floor))

I am not sure about the side-effects (just that it compiles and works on my machine 😅).

However, this still does not solve our lcl frame size regression problem:

-; rsp based frame
+; rbp based frame
 ; partially interruptible
 ; Final local variable assignments
 ;
-;  V00 arg0         [V00    ] (  5,  5   )  double  ->  [rsp+0x00]   do-not-enreg[X] addr-exposed ld-addr-op
+;  V00 arg0         [V00    ] (  5,  5   )  double  ->  [rbp-0x08]   do-not-enreg[X] addr-exposed ld-addr-op
 ;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
 ;
-; Lcl frame size = 8
+; Lcl frame size = 16

@jkotas, at this point, I think this will require help from someone more familiar with JIT.

@jkotas
Copy link
Member

jkotas commented Aug 10, 2020

I am not sure how QCall/FCall mitigate this disparity. Perhaps some compiler flag?

It is mitigated here:

PALIMPORT int __cdecl PAL_ilogb(double x)

@jkotas
Copy link
Member

jkotas commented Aug 10, 2020

After manually comparing some of the pr vs. master control flows in JIT, I have found that importer needs the following patch to match code reachability of master

What is the problem that this is attempting to fix?

Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
* Don't inline GC polls in cold basic blocks.
* Allow GC poll inlining in basic blocks with `BBF_LOOP_PREHEADER` or `BBF_RETLESS_CALL` set.

This fixes one of the assert seen in dotnet#39474 (comment)

Contributes to resolving dotnet#39726.
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
`fgComputeDoms` has an assumption that the flow graph
has no unreachable blocks. It's checked with this assert:

https://github.com/dotnet/runtime/blob/ad325b014124b1adb9306abf95fdac85e3f7f2f4/src/coreclr/src/jit/flowgraph.cpp#L2342

This assert fired when testing dotnet#39474 (`Convert Math{F}.CoreCLR methods from FCall to QCall`)
when we are updating the flow graph after inserting GC polls.

This change switches `fgUpdateChangedFlowGraph` to call `fgComputeReachability`,
which both removes unreachable blocks and calls `fgComputeDoms`.

pin-icount reported a 0.0043% throughput improvement, which is within noise level.
@am11
Copy link
Member Author

am11 commented Aug 10, 2020

What is the problem that this is attempting to fix?

I was tracing the code path due to math function calls in JIT with FCall vs. QCall. With QCall, it bails out earlier from

if (!mustExpand && (opts.OptimizationDisabled() || info.compHasNextCallRetAddr))
{
*pIntrinsicID = CORINFO_INTRINSIC_Illegal;
return retNode;
with FCall, it goes all the way to
retNode = impMathIntrinsic(method, sig, callType, ni, tailCall);
break;

@jkotas
Copy link
Member

jkotas commented Aug 11, 2020

I was tracing the code path due to math function calls in JIT with FCall vs. QCall. With QCall, it bails out earlier from

This sounds like you have been looking at Tier0 code or optimizations off. It is fine to inline this with optimizations off.

@jkotas
Copy link
Member

jkotas commented Aug 11, 2020

an extra method frame that adds significant overhead.

There are several ways this can be fixed. I think that the best way to fix this is:

  • Add a new flag to GenTreeIntrinsic that says that the call will need gcpoll if the intrinsic is converted to one later in RewriteIntrinsicAsUserCall
  • The importer needs to create GenTreeIntrinsic with the actual unmanaged target for these math intrinsics
  • Compiler::fgInsertGCPolls needs to recognize GenTreeIntrinsic with the gcpoll flag set in addition to the regular call

Would you like to give it a shot?

@jkotas
Copy link
Member

jkotas commented Aug 11, 2020

is there really an extra frame due to P/Invoke not getting inlined

Yes, there is an extra frame not getting inlined. Run this program and set a breakpoint at the cos method in libc / ucrtbase:

using System;

class Program
{
    static void Main() { for (;;) Work(); }

    static double Work()
    {
        double arg = 0;
        double sum = 0;
        for (int i = 0; i < 100000000; i++)
        {
            sum += Math.Cos(arg); arg += 1e-8;
        }
        return sum;
    }
}

The stacktrace when the cos method is called should look like this:

ucrtbase!cos 
System.Math.Cos(Double)+0x1b <- this is the extra frame that we need to get rid off
test!Program.Work()+0x27

@am11
Copy link
Member Author

am11 commented Aug 11, 2020

Thanks. With libSOS, I can see that the call to Math.Cos is not present in case of FCall:

--- master	2020-08-11 17:30:15.000000000 +0300
+++ pr		2020-08-11 17:30:43.000000000 +0300
@@ -1,7 +1,7 @@
-(lldb) dumpstack
-OS Thread Id: 0x121b2fc (1)
+OS Thread Id: 0x11f94c5 (1)
 TEB information is not available so a stack size of 0xFFFF is assumed
 Current frame: libsystem_m.dylib!cos
 Child-SP         RetAddr          Caller, Callee
-00007FFEEFBFE6F0 000000012069595c (MethodDesc 0000000120731f60 + 0x5c Foo.Program.Work()), calling 0000000101936e40 (stub for System.Math.Cos(Double))
-00007FFEEFBFE6F8 0000000101a6e027 libcoreclr.dylib!MetaSig::Init(unsigned char const*, unsigned int, Module*, SigTypeContext const*, MetaSig::MetaSigKind) + 0x297, calling libcoreclr.dylib!SigParser::SkipExactlyOne()
+00007FFEEFBFE6D0 00000001210bd3fb (MethodDesc 00000001217a89c8 + 0x1b System.Math.Cos(Double))
+00007FFEEFBFE6F0 000000012169669c (MethodDesc 0000000121736168 + 0x5c Foo.Program.Work()), calling 0000000121696220 (stub for System.Math.Cos(Double))
+00007FFEEFBFE6F8 0000000102935bf7 libcoreclr.dylib!MetaSig::Init(unsigned char const*, unsigned int, Module*, SigTypeContext const*, MetaSig::MetaSigKind) + 0x297, calling libcoreclr.dylib!SigParser::SkipExactlyOne()

with this test, we never reach Compiler::impMathIntrinsic in importer, where GentTreeIntrinsic nodes are created (with both master and pr; tested with release and checked builds).

@am11 am11 marked this pull request as draft August 20, 2020 18:01
@am11
Copy link
Member Author

am11 commented Aug 20, 2020

@jkotas, I tried implementing this approach #39474 (comment), but could not make it elide that libm call and the regression persists. I will defer to the JIT experts for the inlining part.

@am11
Copy link
Member Author

am11 commented Oct 5, 2020

using System;

class Program
{
    static void Main() { for (;;) Work(); }

    static double Work()
    {
        double arg = 0;
        double sum = 0;
        for (int i = 0; i < 100000000; i++)
        {
            sum += Math.Cos(arg); arg += 1e-8;
        }
        return sum;
    }
}

The stacktrace when the cos method is called should look like this:

ucrtbase!cos 
System.Math.Cos(Double)+0x1b <- this is the extra frame that we need to get rid off
test!Program.Work()+0x27

With today's master branch (56950ad), I built and ran this program on macOS, that also breaks on frame #0: 0x00007fff789e0c1c libsystem_m.dylib`cos, and jit-diff only reports textual PMI diffs with master and rebased of this PR. If ucrtbase's cos is also hitting on Windows with master branch, then looks like perf. has regressed since?

Pref. results: https://gist.github.com/am11/65e492e856d81090530b03f0b8662f34 (vs. old results from July https://gist.github.com/am11/8dcbea6e3ca864585f14f888d10953e3)

@am11 am11 closed this Nov 27, 2020
@am11 am11 deleted the feature/fcalls-to-qcalls branch November 27, 2020 01:02
@ghost ghost locked as resolved and limited conversation to collaborators Dec 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling System.Math floating point operations in a loop causes long GC pause times
6 participants