-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Math{F}.CoreCLR methods from FCall to QCall #39474
Conversation
src/coreclr/src/System.Private.CoreLib/src/System/Math.CoreCLR.cs
Outdated
Show resolved
Hide resolved
This looks like a signifiant change on its own. Would you mind submitting it in a separate PR to make it easier to review, test and get in? |
@jkotas, @tannergooding, in CLR Edit: locally, I can only reproduce this issue with CLR's Debug build, but CI encounters in CLR Checked build. # repros `MathF.Abs(-3.3f)` issue
./build.sh -c Debug
# do not produce the issue
./build.sh -rc Checked
./build.sh -c Release |
The difference between checked and debug builds are optimizations settings. Debug is asserts on + optimizations off. Checked is assert on + optimizations on. |
Remaining failures are related to 'Insert GC Polls', which seems to have recent changes (#39111). ARM64 checked tests are failing this assertion: runtime/src/coreclr/src/jit/flowgraph.cpp Lines 2341 to 2342 in c5b39f3
and x86 checked test is failing this one: runtime/src/coreclr/src/jit/flowgraph.cpp Lines 4142 to 4144 in c5b39f3
|
Thanks! I have opened #39726 on this. This change will be blocked until it is fixed. |
Doing this in the meantime may be useful. |
Submitted #39730. First and third commits are for this PR. Will rebase after JIT changes are merged. |
Which tests are failing with the JIT asserts? |
WinNT x86: https://helix.dot.net/api/2019-06-17/jobs/b855baea-b8a7-4cf9-b532-0e653190ec7c/workitems/JIT/console It is the checked build per the CI categorization, but locally it might only repro with debug ( |
`fgComputeDoms` has an assumption that the flow graph has no unreachable blocks. It's checked with this assert: https://github.com/dotnet/runtime/blob/ad325b014124b1adb9306abf95fdac85e3f7f2f4/src/coreclr/src/jit/flowgraph.cpp#L2342 This assert fired when testing dotnet#39474 (`Convert Math{F}.CoreCLR methods from FCall to QCall`) when we are updating the flow graph after inserting GC polls. This change switches `fgUpdateChangedFlowGraph` to call `fgComputeReachability`, which both removes unreachable blocks and calls `fgComputeDoms`. pin-icount reported a 0.0043% throughput improvement, which is within noise level.
* Don't inline GC polls in cold basic blocks. * Allow GC poll inlining in basic blocks with `BBF_LOOP_PREHEADER` or `BBF_RETLESS_CALL` set. This fixes one of the assert seen in dotnet#39474 (comment) Contributes to resolving dotnet#39726.
I opened #39878 and #39881, which will address the asserts in #39474 (comment) |
You should be able to build a benchmark using the same pattern as the GC latency benchmark for sorting from https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/ that shows that algorithms that call Math operations a lot can stall GC for a long time. The overhead of the GC poll should be very small once everything else works correctly. Also, this change should remove extra layer of indirection (the QCall wrapper) that will help to pay for this a bit of extra overhead. I have looked what the JIT generates with the latest iteration of the change. I have used the same test as in #39474 (comment). It still does not seem to be working correctly. The PInvoke call is not getting inlined and there is an extra method frame that adds significant overhead. The problem is that the JIT recognizes the PInvoke as intrinsic and does not bother inlining it. This needs to be fixed to make this perform well. |
You can get this refactoring PR going in parallel if you would like. |
Thank you for sharing that link. I will try to compare the GC polls before and after these changes. This non-QCall DllImport version is ready for use: am11@e8c4c31. It was the fastest among regressions per the benchmark. I can update the PR with it. However, it sill incurs the overhead due to P/Invoke+Intrinsic not getting inlined. Is there an issue tracking it? I can also try to investigate over the weekend. The mono change was made yesterday in a separate branch, based off of this PR branch: https://github.com/am11/runtime/commits/feature/mathf-abs (top commit: |
Do you understand the reason why was the fastest? I suspect that it was just a measurement noise. I have stepped through the code on Windows and Linux (I do not have an easy access to mac at the moment) and the code was identical to the version with QCall. Also, the change is not right on Windows. msvcrt.dll is a legacy Windows C-runtime library that we do not want to use.
There is no issue tracking it. This issue is exposed by this PR, there are no intrinsics that are DllImports today. If you can look into it, it would be great. |
@AndyAyersMS might also have an idea if there is something that might block this in inlining today. |
With the current state of this PR vs. master, i have found another difference in intrinsic recognition of Abs() flavors:
Perhaps the agressive-inlining on Math.Abs(float), which now points to |
It may be best to do the |
I took JIT dumps from master branch using
|
@jkotas, I have ran CI job with direct P/Invoke 3957e2d, but found strange issues during the test runs. For example, this C program: // ilogb-test.c
#include<math.h>
#include<stdio.h>
int main()
{
printf("ilogb(NAN): %d\n", ilogb(NAN));
} when compiled with Whereas, this .NET program in master (FCall) and QCall: using System;
class Program
{
static void Main(string[] args) => Console.WriteLine("Math.ILogB(Double.NaN): {0}", Math.ILogB(Double.NaN));
} produces I am not sure how QCall/FCall mitigate this disparity. Perhaps some compiler flag? (keeping QCall implementation for now) |
Changes:
After manually comparing some of the pr vs. master control flows in JIT, I have found that importer needs the following patch to match code reachability of master in --- a/src/coreclr/src/jit/importer.cpp
+++ b/src/coreclr/src/jit/importer.cpp
@@ -3608,7 +3608,7 @@ GenTree* Compiler::impIntrinsic(GenTree* newobjThis,
// If that call is an intrinsic and is expanded, codegen for NextCallReturnAddress will fail.
// To avoid that we conservatively expand only required intrinsics in methods that call
// the NextCallReturnAddress intrinsic.
- if (!mustExpand && (opts.OptimizationDisabled() || info.compHasNextCallRetAddr))
+ if (!mustExpand && (opts.OptimizationDisabled() || info.compHasNextCallRetAddr) && (ni < NI_System_Math_FusedMultiplyAdd || ni > NI_System_Math_Floor)) I am not sure about the side-effects (just that it compiles and works on my machine 😅). However, this still does not solve our lcl frame size regression problem: -; rsp based frame
+; rbp based frame
; partially interruptible
; Final local variable assignments
;
-; V00 arg0 [V00 ] ( 5, 5 ) double -> [rsp+0x00] do-not-enreg[X] addr-exposed ld-addr-op
+; V00 arg0 [V00 ] ( 5, 5 ) double -> [rbp-0x08] do-not-enreg[X] addr-exposed ld-addr-op
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
-; Lcl frame size = 8
+; Lcl frame size = 16 @jkotas, at this point, I think this will require help from someone more familiar with JIT. |
It is mitigated here:
|
What is the problem that this is attempting to fix? |
* Don't inline GC polls in cold basic blocks. * Allow GC poll inlining in basic blocks with `BBF_LOOP_PREHEADER` or `BBF_RETLESS_CALL` set. This fixes one of the assert seen in dotnet#39474 (comment) Contributes to resolving dotnet#39726.
`fgComputeDoms` has an assumption that the flow graph has no unreachable blocks. It's checked with this assert: https://github.com/dotnet/runtime/blob/ad325b014124b1adb9306abf95fdac85e3f7f2f4/src/coreclr/src/jit/flowgraph.cpp#L2342 This assert fired when testing dotnet#39474 (`Convert Math{F}.CoreCLR methods from FCall to QCall`) when we are updating the flow graph after inserting GC polls. This change switches `fgUpdateChangedFlowGraph` to call `fgComputeReachability`, which both removes unreachable blocks and calls `fgComputeDoms`. pin-icount reported a 0.0043% throughput improvement, which is within noise level.
I was tracing the code path due to math function calls in JIT with FCall vs. QCall. With QCall, it bails out earlier from runtime/src/coreclr/src/jit/importer.cpp Lines 3617 to 3620 in 4ec3a25
runtime/src/coreclr/src/jit/importer.cpp Lines 4285 to 4286 in 4ec3a25
|
This sounds like you have been looking at Tier0 code or optimizations off. It is fine to inline this with optimizations off. |
There are several ways this can be fixed. I think that the best way to fix this is:
Would you like to give it a shot? |
Yes, there is an extra frame not getting inlined. Run this program and set a breakpoint at the
The stacktrace when the
|
Thanks. With libSOS, I can see that the call to Math.Cos is not present in case of FCall: --- master 2020-08-11 17:30:15.000000000 +0300
+++ pr 2020-08-11 17:30:43.000000000 +0300
@@ -1,7 +1,7 @@
-(lldb) dumpstack
-OS Thread Id: 0x121b2fc (1)
+OS Thread Id: 0x11f94c5 (1)
TEB information is not available so a stack size of 0xFFFF is assumed
Current frame: libsystem_m.dylib!cos
Child-SP RetAddr Caller, Callee
-00007FFEEFBFE6F0 000000012069595c (MethodDesc 0000000120731f60 + 0x5c Foo.Program.Work()), calling 0000000101936e40 (stub for System.Math.Cos(Double))
-00007FFEEFBFE6F8 0000000101a6e027 libcoreclr.dylib!MetaSig::Init(unsigned char const*, unsigned int, Module*, SigTypeContext const*, MetaSig::MetaSigKind) + 0x297, calling libcoreclr.dylib!SigParser::SkipExactlyOne()
+00007FFEEFBFE6D0 00000001210bd3fb (MethodDesc 00000001217a89c8 + 0x1b System.Math.Cos(Double))
+00007FFEEFBFE6F0 000000012169669c (MethodDesc 0000000121736168 + 0x5c Foo.Program.Work()), calling 0000000121696220 (stub for System.Math.Cos(Double))
+00007FFEEFBFE6F8 0000000102935bf7 libcoreclr.dylib!MetaSig::Init(unsigned char const*, unsigned int, Module*, SigTypeContext const*, MetaSig::MetaSigKind) + 0x297, calling libcoreclr.dylib!SigParser::SkipExactlyOne() with this test, we never reach |
@jkotas, I tried implementing this approach #39474 (comment), but could not make it elide that libm call and the regression persists. I will defer to the JIT experts for the inlining part. |
With today's master branch (56950ad), I built and ran this program on macOS, that also breaks on Pref. results: https://gist.github.com/am11/65e492e856d81090530b03f0b8662f34 (vs. old results from July https://gist.github.com/am11/8dcbea6e3ca864585f14f888d10953e3) |
Fixes #13820
TODOs:
/fp:fast
vs./fp:precise
when p/invoking crt functions on Windows