Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Conditionally remove the GC transition from a P/Invoke #26458

Merged

Conversation

AaronRobinsonMSFT
Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT commented Aug 31, 2019

API Review at dotnet/runtime#30741

This has been tested with a simple native function BOOL NextUInt(DWORD *t) and works in all scenarios:

public static class NativeLibrary
{
    [DllImport(nameof(NativeLibrary), EntryPoint = "NextUInt")]
    [SuppressGCTransition]
    public static extern unsafe int NextUInt_Fast(int* n);

    [DllImport(nameof(NativeLibrary), EntryPoint = "NextUInt")]
    public static extern unsafe int NextUInt_Slow(int* n);

    [DllImport(nameof(NativeLibrary), EntryPoint = "NextUInt")]
    [SuppressGCTransition]
    public static extern unsafe bool NextUInt_VerySlow(int* n);

    [DllImport(nameof(NativeLibrary), EntryPoint = "NextUInt")]
    public static extern unsafe bool NextUInt_Slowest(int* n);
}

TODO:

Future issues to file:

/cc @jkotas @davidwrighton @jkoritzinsky @dotnet/jit-contrib @jeffschwMSFT

@jkotas
Copy link
Member

jkotas commented Aug 31, 2019

#22320

@AndyAyersMS
Copy link
Member

Benchmarking pinvoke can be tricky. Some of the cost is paid in the prolog/epilog and some at the call site.

See #2373 for some notes.

You should try and find the actual bit of code BDN is measuring. I think it is saved off somewhere under artifacts.

@AaronRobinsonMSFT
Copy link
Member Author

AaronRobinsonMSFT commented Aug 31, 2019

@AndyAyersMS Thanks. I actually removed that comment because I realized I wasn't using BenchmarkDotNet properly with CoreRun. I figured it out and below are the results that I was anticipating.

This is a nice win based on the numbers below. I think the most obvious is the case removing the GC transition even without inlining yields a substantial benefit.

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.950 (1803/April2018Update/Redstone4)
Intel Core i7-6650U CPU 2.20GHz (Skylake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-rc1-014164
  [Host]     : .NET Core 3.0.0-rc1-19430-09 (CoreCLR 4.700.19.43003, CoreFX 4.700.19.42010), 64bit RyuJIT
  Job-SWMKPL : .NET Core ? (CoreCLR 5.0.19.43001, CoreFX 5.0.19.42613), 64bit RyuJIT

Runtime=Core  Toolchain=CoreRun  
Method Mean Error StdDev
NoInline_GCTransition 16.978 ns 0.3587 ns 0.3356 ns
NoInline_NoGCTransition 10.960 ns 0.1100 ns 0.1029 ns
Inline_GCTransition 15.269 ns 0.1089 ns 0.0909 ns
Inline_NoGCTransition 8.879 ns 0.1646 ns 0.1459 ns

Details of benchmark

Native target

namespace
{
    std::atomic<uint32_t> _n{ 0 };
}

EXPORT
BOOL CALLCONV NextUInt(/* out */ uint32_t *n)
{
    if (n == nullptr)
        return FALSE;

    *n = (++_n);
    return TRUE;
}

P/Invoke definitions

static class NativeLib
{
  [DllImport(nameof(NativeLib), EntryPoint = "NextUInt")]
  public static extern unsafe bool NextUInt_NoInline_GCTransition(int* n);

  [DllImport(nameof(NativeLib), EntryPoint = "NextUInt")]
  [SuppressGCTransition]
  public static extern unsafe bool NextUInt_NoInline_NoGCTransition(int* n);

  [DllImport(nameof(NativeLib), EntryPoint = "NextUInt")]
  public static extern unsafe int NextUInt_Inline_GCTransition(int* n);

  [DllImport(nameof(NativeLib), EntryPoint = "NextUInt")]
  [SuppressGCTransition]
  public static extern unsafe int NextUInt_Inline_NoGCTransition(int* n);
}

@jkotas
Copy link
Member

jkotas commented Sep 1, 2019

We should think about how to enable the same capability for function pointers. It would be unfortunate for this optimization to be limited to DllImport.

@AaronRobinsonMSFT AaronRobinsonMSFT changed the title [PROTOTYPE] Conditionally remove the GC transition from a P/Invoke Conditionally remove the GC transition from a P/Invoke Sep 1, 2019
@AaronRobinsonMSFT
Copy link
Member Author

Apply SuppressGCTransition to appropriate P/Invoke calls during WPF and ASP.Net Core start-up.

/cc @sbomer @vitek-karas

@jkotas
Copy link
Member

jkotas commented Sep 2, 2019

Apply SuppressGCTransition to appropriate P/Invoke calls during WPF and ASP.Net Core start-up.

SuppressGCTransition is not startup time optimization, it is not going to help startup time in measurable way.

SuppressGCTransition is steady state throughput optimization. We should be looking for places where cheap P/Invokes are called frequently in steady state.

@jkotas
Copy link
Member

jkotas commented Sep 2, 2019

appropriate P/Invoke calls in System.Private.CoreLib

It may be useful if this PR marks 5 - 10 P/Invokes in System.Private.CoreLib with this. It would show by example where this is meant to be used.

@jkotas
Copy link
Member

jkotas commented Sep 2, 2019

removal of overhead. The suggestion for a Roslyn analyzer is intended to provide user guidance.

I do not think you can fix this problem by Roslyn analyzer.

I think that it is important that this feature does not contribute to the unpredictable GC pauses. It is a good trade-off to take small performance hit for that.

@jkotas
Copy link
Member

jkotas commented Sep 2, 2019

Here is a example of the type of issue that we are exposing ourselves to by ignoring the GC poll problem:

change 1239317 edit on 2005/08/27 18:18:35

	Title: Pri 0 bug fix 549067 (Thread.IsAlive)
	
	Bugs fixed:
	  549067 SelfTest failure: Thread.IsAlive needs to poll for GC
	
	Issue description:
	  Thread.IsAlive and Thread.get_ThreadState need to poll for a GC to
	  prevent starving the GC thread.
	
	Change description:
	  Add in FC_GC_POLL_RET() in two places.

We have fixed many bugs like this over the years in FCall implementations, and I am sure that there are number of them left. We need to make sure that this feature is not a farm for bugs like this.

@filipnavara
Copy link
Member

Looking forward to it. This could make it easier to implement things like DateTime.Now without internal calls.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2019

I think that it is important that this feature does not contribute to the unpredictable GC pauses

Here is a test to demonstrate the problem:

using System;
using System.Threading;
using System.Runtime.InteropServices;

class Test
{
    [DllImport("kernel32.dll")]
    [SuppressGCTransitionAttribute]
    extern static int GetTickCount();

    static void Main()
    {
        new Thread(() => { for(;;) GetTickCount(); }).Start();

        for (int i = 0; ; i++) i.ToString();
    }
}

Run this under perfview for 1 minute and look at median pause times in GC stats.

Baseline: 0.1 miliseconds
The current PR: 144 miliseconds.

We consider any GC pause times over 10 miliseconds to be performance bugs that are worth looking into.

@AaronRobinsonMSFT
Copy link
Member Author

AaronRobinsonMSFT commented Sep 3, 2019

Here is a test to demonstrate the problem:

Yep. I wrote something similar. I'm just not convinced this is something we should try to mitigate. The point of the function is to give the absolute most amount of performance. As pointed out running PerfView can easily help identify this issue as well as asking the question "Are you using SuppressGCTransitionAttribute?".

Going through the thought experiment of what is the worse case scenario or scenario where a user would get into trouble, there are really only two of them.

  1. Violate one of the statements in the API documentation.
  2. Call the function in a tight loop.

The first is handled by documentation and the second by review which as we have already discussed above should/need to happen for any function being considered for the attribute. If a loop is really needed, then the advice I have been giving for years kicks in - do as much work as possible without making a transition. In this case, the user should write the desired loop in unmanaged code and call that function in a normal P/Invoke.

I agree the starvation issue can be mitigated to some degree but not fully due to the complicated checklist for API usage. Detecting GC starvation can be done with the tools and adding the polling seems to me like we are hiding the issues because it may just work during development because of polling but sure enough when it reaches production the issue will present itself. I believe we should let issues surface as soon as possible so users detect issues early instead of being led into a false sense of "it seems to work fine".

@stephentoub
Copy link
Member

I'm still trying to understand where this could actually be used in .NET Core itself. What are some clear examples where we'd use it? GetTickCount was the one example cited, and then also used to demonstrate a problem.

@AaronRobinsonMSFT
Copy link
Member Author

@stephentoub Some examples: GetCurrentThreadId(), GetCurrentProcessId(), QueryPerformanceCounter() (this one could be tricky on older systems), any internal calls for runtime data querying.

I would imagine that @tannergooding and @jkotas can give additional examples.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2019

In this case, the user should write the desired loop in unmanaged code

The problem is composition. This will be used to implement library APIs and same restrictions would transitively apply to the library APIs that use this as an implementation detail. For example, if this is used for QueryPerformanceCounter(), the Stopwatch documentation would need to include a warning that calling Stopwatch too often will lead to long GC pause times. That would be super confusing.

Detecting GC starvation can be done with the tools

Connecting long GC pause times to root cause is extremely expensive. The long GC pause times (in production) happen intermittently and one typically cannot afford to run the verbose tracing in production. Long GC pause times are much harder to diagnose than crashes.
Also, the absolute average throughput is a performance metric where we are doing fine, but not so much the low predictable GC pause times. Our GC-related investments these days are centered around having low predictable GC pause times.

adding the polling seems to me like we are hiding the issues

Adding the polling is not hiding issues. It is correct-by-construction solution. If we really wanted to, we can teach the JIT to optimize the polling out if it can prove that there is enough other opportunities for GC to kick in.

and then also used to demonstrate a problem.

The problem is a bug in the feature implementation. GetTickCount is the kind of API that this is intended for.

@jkotas
Copy link
Member

jkotas commented Sep 3, 2019

The repro in #26458 (comment) also shows a codegen inefficiency. There is unnecessary zeroing of r11 before the call. We should get rid of it:

00007ff8`13bc20fa 4533db          xor     r11d,r11d // This is unnecessary
00007ff8`13bc20fd e83e000000      call    CLRStub[JumpStub]@7ff813bc2140 (00007ff8`13bc2140)
00007ff8`13bc2102 ebf6            jmp     x!Test+<>c.<Main>b__1_0()+0xa (00007ff8`13bc20fa)

@AaronRobinsonMSFT
Copy link
Member Author

There is unnecessary zeroing of r11 before the call.

That is for the call cookie. I will remove it.

@AaronRobinsonMSFT
Copy link
Member Author

AaronRobinsonMSFT commented Sep 3, 2019

Actually scratch that. I think this is an existing issue (see IN0009). Below is the generated code for the call to GetTickCount() without this feature. I don't know enough for how to properly remove this. With SuppressGCTransitionAttribute applied the xor is being inserted due to the creation of the P/Invoke cookie which I think is for varargs:

coreclr/src/jit/morph.cpp

Lines 2865 to 2879 in b18ea9c

// We are allowed to have a Fixed Return Buffer argument combined
// with any of the remaining non-standard arguments
//
if (call->IsUnmanaged() && !opts.ShouldUsePInvokeHelpers())
{
assert(!call->gtCallCookie);
// Add a conservative estimate of the stack size in a special parameter (r11) at the call site.
// It will be used only on the intercepted-for-host code path to copy the arguments.
GenTree* cns = new (this, GT_CNS_INT) GenTreeIntCon(TYP_I_IMPL, fgEstimateCallStackSize(call));
call->gtCallArgs = gtNewListNode(cns, call->gtCallArgs);
numArgs++;
nonStandardArgs.Add(cns, REG_PINVOKE_COOKIE_PARAM);
}

*************** After end code gen, before unwindEmit()
G_M23371_IG01:        ; func=00, offs=000000H, size=001CH, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN001f: 000000 push     rbp
IN0020: 000001 push     r15
IN0021: 000003 push     r14
IN0022: 000005 push     r13
IN0023: 000007 push     r12
IN0024: 000009 push     rdi
IN0025: 00000A push     rsi
IN0026: 00000B push     rbx
IN0027: 00000C sub      rsp, 120
IN0028: 000010 lea      rbp, [rsp+B0H]
IN0029: 000018 mov      gword ptr [V00 rbp+10H], rcx

G_M23371_IG02:        ; offs=00001CH, size=004EH, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 00001C lea      rcx, [V03+0x8 rbp-80H]
IN0002: 000020 mov      rdx, r10
IN0003: 000023 call     CORINFO_HELP_INIT_PINVOKE_FRAME
IN0004: 000028 mov      qword ptr [V02 rbp-40H], rax
IN0005: 00002C mov      r11, rsp
IN0006: 00002F mov      qword ptr [V03+0x28 rbp-60H], r11
IN0007: 000033 mov      r11, rbp
IN0008: 000036 mov      qword ptr [V03+0x38 rbp-50H], r11
IN0009: 00003A xor      r11, r11
IN000a: 00003D mov      rax, 0x7FF9F80E90D8
IN000b: 000047 mov      qword ptr [V03+0x18 rbp-70H], rax
IN000c: 00004B lea      rax, G_M23371_IG04
IN000d: 000052 mov      qword ptr [V03+0x30 rbp-58H], rax
IN000e: 000056 mov      rax, qword ptr [V02 rbp-40H]
IN000f: 00005A lea      rdx, bword ptr [V03+0x8 rbp-80H]
IN0010: 00005E mov      qword ptr [rax+16], rdx
IN0011: 000062 mov      rax, qword ptr [V02 rbp-40H]
IN0012: 000066 mov      byte  ptr [rax+12], 0

G_M23371_IG03:        ; offs=00006AH, size=0006H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0013: 00006A call     [PInvokeTesting.NativeLib:GetTickCount():int]

G_M23371_IG04:        ; offs=000070H, size=0017H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0014: 000070 mov      rdx, qword ptr [V02 rbp-40H]
IN0015: 000074 mov      byte  ptr [rdx+12], 1
IN0016: 000078 cmp      dword ptr [(reloc 0x7ffa57ef8ef8)], 0
IN0017: 00007F je       SHORT G_M23371_IG05
IN0018: 000081 call     [CORINFO_HELP_STOP_FOR_GC]

G_M23371_IG05:        ; offs=000087H, size=000EH, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0019: 000087 mov      rax, qword ptr [V02 rbp-40H]
IN001a: 00008B mov      rdx, bword ptr [V03+0x10 rbp-78H]
IN001b: 00008F mov      qword ptr [rax+16], rdx
IN001c: 000093 jmp      SHORT G_M23371_IG06

G_M23371_IG06:        ; offs=000095H, size=0008H, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN001d: 000095 mov      rax, qword ptr [V02 rbp-40H]
IN001e: 000099 mov      byte  ptr [rax+12], 1

G_M23371_IG07:        ; offs=00009DH, size=0011H, epilog, nogc, emitadd

IN002a: 00009D lea      rsp, [rbp-38H]
IN002b: 0000A1 pop      rbx
IN002c: 0000A2 pop      rsi
IN002d: 0000A3 pop      rdi
IN002e: 0000A4 pop      r12
IN002f: 0000A6 pop      r13
IN0030: 0000A8 pop      r14
IN0031: 0000AA pop      r15
IN0032: 0000AC pop      rbp
IN0033: 0000AD ret

@AaronRobinsonMSFT
Copy link
Member Author

Finally...

@AndyAyersMS All the below functions make sense to be improved.

jit-diff diff --pmi --base [ROOT]\bin_checked_a\Product\Windows_NT.x64.Checked --diff [ROOT]\Product\Windows_NT.x64.Checked --core_root [ROOT]\tests\Windows_NT.x64.Release\Tests\Core_Root --frameworks
Using --output [ROOT]\diffs
Beginning PMI Diffs for System.Private.CoreLib.dll, framework assemblies
\ Finished 129/129 Base 129/129 Diff [389.6 sec]
Completed PMI Diffs for System.Private.CoreLib.dll, framework assemblies in 389.58s
Diffs (if any) can be viewed by comparing:[ROOT]\diffs\dasmset_5\base [ROOT]\diffs\dasmset_5\diff
Analyzing diffs...
Found 1 files with textual diffs.
PMI Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary:
(Lower is better)
Total bytes of diff: -1070 (-0.00% of base)
    diff is an improvement.
Top file improvements by size (bytes):
       -1070 : System.Private.CoreLib.dasm (-0.02% of base)
1 total files with size differences (1 improved, 0 regressed), 128 unchanged.
Top method improvements by size (bytes):
        -244 (-80.26% of base) : System.Private.CoreLib.dasm - Interop:GetCurrentProcessId():int (2 methods)
        -168 (-64.12% of base) : System.Private.CoreLib.dasm - Console:.cctor()
        -141 (-47.80% of base) : System.Private.CoreLib.dasm - EventPipeController:BuildTraceFileName():ref
        -117 (-69.23% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(int):long
        -117 (-14.03% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke():int (5 methods)
Top method improvements by size (percentage):
        -244 (-80.26% of base) : System.Private.CoreLib.dasm - Interop:GetCurrentProcessId():int (2 methods)
        -117 (-69.23% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(int):long
        -168 (-64.12% of base) : System.Private.CoreLib.dasm - Console:.cctor()
        -114 (-55.61% of base) : System.Private.CoreLib.dasm - ILStubClass:IL_STUB_PInvoke(int,byref):bool
        -141 (-47.80% of base) : System.Private.CoreLib.dasm - EventPipeController:BuildTraceFileName():ref
8 total methods with size differences (8 improved, 0 regressed), 202937 unchanged.
Completed analysis in 23.61s

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment

src/jit/morph.cpp Outdated Show resolved Hide resolved
@AndyAyersMS
Copy link
Member

Thanks for double-checking the diffs.

@AaronRobinsonMSFT
Copy link
Member Author

@BruceForstall Please review the updated clr-abi document.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the CLR ABI doc. I've written a few suggestions.

Documentation/botr/clr-abi.md Outdated Show resolved Hide resolved
Documentation/botr/clr-abi.md Outdated Show resolved Hide resolved
Documentation/botr/clr-abi.md Outdated Show resolved Hide resolved
Documentation/botr/clr-abi.md Outdated Show resolved Hide resolved
@AaronRobinsonMSFT
Copy link
Member Author

@jkotas or @dotnet/jit-contrib any additional feedback here?

@jkotas
Copy link
Member

jkotas commented Oct 14, 2019

Could you please prepare PR ready to be cherry picked to fix the CoreFX build breaks once this change hits CoreFX?

@AaronRobinsonMSFT
Copy link
Member Author

@jkotas I don't fully recall the way this breaks. Is the issue where the type must be declared in https://github.com/dotnet/corefx/blob/master/src/System.Runtime.InteropServices/ref/System.Runtime.InteropServices.cs?

@jkotas
Copy link
Member

jkotas commented Oct 14, 2019

Actually, the current version of the change is probably ok - no adjustments in CoreFX are necessary. I wanted to double check it but run into unrelated build breaks. Looks like there is a silent merge conflict with JIT changes done in the meantime. Could you please do:

git pull
git merge master
build release skiptests

You should see these build errors:

94>D:\coreclr\src\jit\morph.cpp(15455,44): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\protononjit\protononjit.vcxproj]
94>D:\coreclr\src\jit\morph.cpp(15455,72): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\protononjit\protononjit.vcxproj]
38>D:\coreclr\src\jit\morph.cpp(15455,44): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\standalone\clrjit.vcxproj]
52>D:\coreclr\src\jit\morph.cpp(15455,44): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\linuxnonjit\linuxnonjit.vcxproj]
38>D:\coreclr\src\jit\morph.cpp(15455,72): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\standalone\clrjit.vcxproj]
52>D:\coreclr\src\jit\morph.cpp(15455,72): error C2065: 'tree': undeclared identifier [D:\coreclr\bin\obj\Windows_NT.x64.Release\src\jit\linuxnonjit\linuxnonjit.vcxproj]

@AaronRobinsonMSFT
Copy link
Member Author

@jkotas Done.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codegen team should sign-off on the JIT changes. The rest looks fine to me.

@AaronRobinsonMSFT
Copy link
Member Author

@dotnet/jit-contrib Anyone have concerns with the current changes? The code is designed to be a functional no-op unless the associated attribute is added to a function. This assumption was confirmed with the jit-diff run.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the JIT changes
Looks Good

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.