JIT: make profile data available to inlinees #42277

AndyAyersMS · 2020-09-15T20:55:15Z

Update the jit to try and read profile data for inlinees, and if successful,
scale it appropriately for the inline call site. This kicks in for crossgen
BBOPT and TieredPGO Tier1.

Update VM and Crossgen hosts to handle requests for inlinee profile counts.
Crossgen2 does not seem to support profile data retrieval yet.

Note crossgen experience may not be as good as one might expect, because
crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen
even with this limitation is probably a win overall.

Fix small issue in the jit where we were overly aggressive about merging the
callee block's flags into the callsite block's flags.

Update the jit to try and read profile data for inlinees, and if successful, scale it appropriately for the inline call site. This kicks in for crossgen BBOPT and TieredPGO Tier1. Update VM and Crossgen hosts to handle requests for inlinee profile counts. Crossgen2 does not seem to support profile data retrieval yet. Note crossgen experience may not be as good as one might expect, because crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen even with this limitation is probably a win overall. Fix small issue in the jit where we were overly aggressive about merging the callee block's flags into the callsite block's flags.

AndyAyersMS · 2020-09-15T20:59:05Z

cc @dotnet/jit-contrib @davidwrighton

Crossgen diffs (ignore the assemblies with minor diffs, I think my baseline build is a bit off). Impact is on the Roslyn assemblies that have IBC data.

Most of the code size diffs seem to be from changes to block layout; suspect our PGO based layout algorithm is reacting a bit too strongly here (and by way of contrast, the inliner is not reacting strongly enough).

Crossgen CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 33470799
Total bytes of diff: 33494356
Total bytes of delta: 23557 (0.07% of base)
    diff is a regression.
Top file regressions (bytes):
       13619 : Microsoft.CodeAnalysis.CSharp.dasm (0.66% of base)
        8759 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.39% of base)
        2484 : Microsoft.CodeAnalysis.dasm (0.33% of base)
           2 : System.Runtime.Serialization.Formatters.dasm (0.00% of base)
           2 : System.Text.RegularExpressions.dasm (0.00% of base)
           1 : System.Drawing.Common.dasm (0.00% of base)
Top file improvements (bytes):
       -1134 : System.Private.CoreLib.dasm (-0.03% of base)
         -76 : System.Text.Json.dasm (-0.02% of base)
         -51 : System.Private.Xml.dasm (-0.00% of base)
         -24 : System.CodeDom.dasm (-0.01% of base)
         -13 : Microsoft.Diagnostics.FastSerialization.dasm (-0.04% of base)
         -12 : Microsoft.CSharp.dasm (-0.00% of base)
12 total files with Code Size differences (6 improved, 6 regressed), 255 unchanged.
Top method regressions (bytes):
         344 ( 8.79% of base) : Microsoft.CodeAnalysis.dasm - MetadataSizes:.ctor(ImmutableArray`1,ImmutableArray`1,int,int,int,int,bool,bool,bool):this
         192 ( 4.77% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LocalRewriter:RewriteEnumeratorForEachStatement(BoundForEachStatement):BoundStatement:this
         173 ( 7.85% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseModifiers(SyntaxListBuilder):this
         170 ( 3.36% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseMemberDeclarationOrStatement(ushort,String):MemberDeclarationSyntax:this
         166 ( 3.46% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedEventAccessorSymbol:ConstructFieldLikeEventAccessorBody_Regular(SourceEventSymbol,bool,VisualBasicCompilation,DiagnosticBag):BoundBlock
         165 (23.17% of base) : Microsoft.CodeAnalysis.dasm - PeWriter:CreateSectionHeaders(MetadataSizes,int):List`1:this
         162 ( 3.64% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodBodySynthesizer:ConstructFieldLikeEventAccessorBody_Regular(SourceEventSymbol,bool,CSharpCompilation,DiagnosticBag):BoundBlock
         152 ( 6.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DocumentationCommentParser:ParseXmlElement():XmlNodeSyntax:this
         147 (67.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DataFlowPass:VisitLocalDeclaration(BoundLocalDeclaration):BoundNode:this
         144 ( 5.11% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - AnonymousTypeGetHashCodeMethodSymbol:GetBoundMethodBody(DiagnosticBag,byref):BoundBlock:this
         141 ( 7.16% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - LocalRewriter:RewriteIfStatement(VisualBasicSyntaxNode,VisualBasicSyntaxNode,BoundExpression,BoundStatement,BoundStatement,bool,ImmutableArray`1):BoundStatement:this
         135 ( 2.41% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - AbstractFlowPass`1:VisitTryStatement(BoundTryStatement):BoundNode:this (3 methods)
         127 ( 3.96% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNullCoalescingOperator(BinaryExpressionSyntax,DiagnosticBag):BoundExpression:this
         121 (20.40% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNamedAttributeArgumentName(AttributeArgumentSyntax,NamedTypeSymbol,DiagnosticBag,byref,byref):Symbol:this
         108 (13.00% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindIndexerAccess(ExpressionSyntax,BoundExpression,AnalyzedArguments,DiagnosticBag):BoundExpression:this
         108 ( 4.54% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseFromControlVars():SeparatedSyntaxList`1:this
         107 ( 2.71% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodBodySynthesizer:MakeSubmissionInitialization(ArrayBuilder`1,CSharpSyntaxNode,MethodSymbol,SynthesizedSubmissionFields,CSharpCompilation)
         104 (17.51% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BoundIsOperator:Update(BoundExpression,BoundTypeExpression,Conversion,TypeSymbol):BoundIsOperator:this
         102 ( 5.23% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - LocalRewriter:VisitIfStatement(BoundIfStatement):BoundNode:this
         101 ( 2.70% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedStringSwitchHashMethod:GetBoundMethodBody(DiagnosticBag,byref):BoundBlock:this
Top method improvements (bytes):
        -218 (-18.81% of base) : System.Private.CoreLib.dasm - OrdinalCasing:CompareStringIgnoreCase(byref,int,byref,int):int
        -212 (-17.19% of base) : System.Private.CoreLib.dasm - OrdinalCasing:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
        -200 (-16.54% of base) : System.Private.CoreLib.dasm - OrdinalCasing:LastIndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
        -198 (-17.11% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitDelegateCreationExpression(BoundDelegateCreationExpression):BoundNode:this (2 methods)
        -182 (-28.13% of base) : System.Private.CoreLib.dasm - OrdinalCasing:EqualSurrogate(ushort,ushort,ushort,ushort):bool
        -170 (-5.58% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindConstructorInitializer(ArgumentListSyntax,MethodSymbol,DiagnosticBag):BoundExpression:this
        -152 (-10.83% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:ResolveDefaultMethodGroup(BoundMethodGroup,AnalyzedArguments,bool,byref,bool,bool):MethodGroupResolution:this
        -147 (-21.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitEventAssignmentOperator(BoundEventAssignmentOperator):BoundNode:this (2 methods)
        -145 (-2.82% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:BindBinaryOperator(VisualBasicSyntaxNode,BoundExpression,BoundExpression,ushort,int,bool,DiagnosticBag,bool,byref):BoundExpression:this
        -143 (-17.61% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitUsingStatement(BoundUsingStatement):BoundNode:this (3 methods)
        -135 (-10.00% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:AdjustConditionalState(BoundExpression):this (3 methods)
        -117 (-17.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitLockStatement(BoundLockStatement):BoundNode:this (3 methods)
        -111 (-10.30% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitNullCoalescingOperator(BoundNullCoalescingOperator):BoundNode:this (2 methods)
        -110 (-2.43% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:PerformCompileTimeBinaryOperation(ushort,byte,CConst,CConst,ExpressionSyntax):CConst
         -92 (-3.97% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:LookupMembersInSubmissions(LookupResult,TypeSymbol,String,int,ConsList`1,int,Binder,bool,byref):this
         -91 (-10.29% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitSequence(BoundSequence):BoundNode:this (2 methods)
         -91 (-13.52% of base) : System.Private.CoreLib.dasm - OrdinalCasing:ToUpperOrdinal(ReadOnlySpan`1,Span`1)
         -86 (-23.31% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitQueryClause(BoundQueryClause):BoundNode:this (2 methods)
         -86 (-26.79% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitArrayLength(BoundArrayLength):BoundNode:this (2 methods)
         -75 (-4.20% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitCompoundAssignmentOperator(BoundCompoundAssignmentOperator):BoundNode:this (2 methods)
Top method regressions (percentages):
         147 (67.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DataFlowPass:VisitLocalDeclaration(BoundLocalDeclaration):BoundNode:this
          77 (37.38% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:LabeledStatement(SyntaxToken,SyntaxToken,StatementSyntax):LabeledStatementSyntax:this
          75 (36.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:ConstructorConstraint(SyntaxToken,SyntaxToken,SyntaxToken):ConstructorConstraintSyntax:this
          75 (36.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:QueryContinuation(SyntaxToken,SyntaxToken,QueryBodySyntax):QueryContinuationSyntax:this
          33 (28.95% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:ThisReference(CSharpSyntaxNode,NamedTypeSymbol,bool,bool):BoundThisReference:this
          50 (24.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSimpleCaseClause:Update(BoundExpression,BoundExpression):BoundSimpleCaseClause:this
          50 (24.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundCaseBlock:Update(BoundCaseStatement,BoundBlock):BoundCaseBlock:this
          37 (24.34% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSequencePoint:Update(BoundStatement):BoundSequencePoint:this
          44 (24.04% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundParenthesized:Update(BoundExpression,TypeSymbol):BoundParenthesized:this
          44 (24.04% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundGotoStatement:Update(LabelSymbol,BoundLabel):BoundGotoStatement:this
          44 (23.91% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSequencePointExpression:Update(BoundExpression,TypeSymbol):BoundSequencePointExpression:this
          44 (23.66% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:QueryExpression(FromClauseSyntax,QueryBodySyntax):QueryExpressionSyntax:this
          58 (23.58% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:CheckedStatement(ushort,SyntaxToken,BlockSyntax):CheckedStatementSyntax:this
         165 (23.17% of base) : Microsoft.CodeAnalysis.dasm - PeWriter:CreateSectionHeaders(MetadataSizes,int):List`1:this
          42 (22.83% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:WhereClause(SyntaxToken,ExpressionSyntax):WhereClauseSyntax:this
          65 (22.57% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DocumentationCommentParser:ParseXmlAttributeEndQuote(ushort):SyntaxToken:this
          48 (21.92% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:GlobalStatement(StatementSyntax):GlobalStatementSyntax:this
          71 (21.85% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:BracketedParameterList(SyntaxToken,SeparatedSyntaxList`1,SyntaxToken):BracketedParameterListSyntax:this
          39 (21.79% of base) : Microsoft.CodeAnalysis.dasm - BlobWriter:WriteBytes(ubyte,int):this
         121 (20.40% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNamedAttributeArgumentName(AttributeArgumentSyntax,NamedTypeSymbol,DiagnosticBag,byref,byref):Symbol:this
Top method improvements (percentages):
         -21 (-77.78% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetInt16WithAllBitsSet():short
         -20 (-76.92% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetUInt16WithAllBitsSet():ushort
         -19 (-76.00% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetSByteWithAllBitsSet():byte
         -18 (-75.00% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetByteWithAllBitsSet():ubyte
         -20 (-36.36% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadInt16BigEndian(ReadOnlySpan`1,byref):bool
         -19 (-35.85% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadUInt16BigEndian(ReadOnlySpan`1,byref):bool
        -182 (-28.13% of base) : System.Private.CoreLib.dasm - OrdinalCasing:EqualSurrogate(ushort,ushort,ushort,ushort):bool
         -86 (-26.79% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitArrayLength(BoundArrayLength):BoundNode:this (2 methods)
         -86 (-23.31% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitQueryClause(BoundQueryClause):BoundNode:this (2 methods)
        -147 (-21.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitEventAssignmentOperator(BoundEventAssignmentOperator):BoundNode:this (2 methods)
         -13 (-19.70% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadHalfBigEndian(ReadOnlySpan`1,byref):bool
         -24 (-19.51% of base) : System.Private.CoreLib.dasm - Rune:ToString():String:this
        -218 (-18.81% of base) : System.Private.CoreLib.dasm - OrdinalCasing:CompareStringIgnoreCase(byref,int,byref,int):int
        -143 (-17.61% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitUsingStatement(BoundUsingStatement):BoundNode:this (3 methods)
        -117 (-17.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitLockStatement(BoundLockStatement):BoundNode:this (3 methods)
        -212 (-17.19% of base) : System.Private.CoreLib.dasm - OrdinalCasing:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
         -56 (-17.13% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindThis(ThisExpressionSyntax,DiagnosticBag):BoundThisReference:this
        -198 (-17.11% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitDelegateCreationExpression(BoundDelegateCreationExpression):BoundNode:this (2 methods)
        -200 (-16.54% of base) : System.Private.CoreLib.dasm - OrdinalCasing:LastIndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
         -22 (-14.67% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodSymbolExtensions:IsTaskReturningAsync(MethodSymbol,CSharpCompilation):bool
1779 total methods with Code Size differences (262 improved, 1517 regressed), 195471 unchanged.

EgorBo · 2020-09-15T21:27:02Z

@AndyAyersMS is there a doc how PgoManager works? How to collect some simple PGO data, how to re-use it?
I assume only crossgen is able to collect it?

upd: wow, TIL COMPlus_TieredPGO - works for tier0, nice!

AndyAyersMS · 2020-09-15T22:10:09Z

@EgorBo yes, that's right -- just set COMPlus_TieredPGO=1 and tier0 will collect counts for tier1.

Still lots to do in this space.

EgorBo · 2020-09-15T22:13:18Z

@EgorBo yes, that's right -- just set COMPlus_TieredPGO=1 and tier0 will collect counts for tier1.

Still lots to do in this space.

Still quite impressive! Question: do we emit PGO counters to R2R images for BCL libs? so when I promote let's say String.IsNullOrEmpty to tier1 (from R2R'd version) I can get weights for cases when input is empty and not

davidwrighton

Approved, but see my comment about possible race conditions with profiler rejit.

davidwrighton · 2020-09-15T22:53:39Z

src/coreclr/src/vm/jitinterface.cpp

    {
-        hr = E_NOTIMPL;
+        COR_ILMETHOD_DECODER decoder(pMD->GetILHeader());
+        codeSize = decoder.GetCodeSize();


It feels like there is some opportunity for a race condition here while interacting with Profiler Rejit, but otherwise this looks good. I'm going to mark this as approved, but we may need a conversation with @davmason to make sure this is safe.

Agree rejitting could cause troubles. I wonder this is a broader issue -- we tend to query properties of inlinees over time, and not all in one go.

I actually have the IL size on the jit side so could pass the information down, if needed.

I suspect the broader issue is a thing, but given the rarity of rejit we would never find the issue in our testing.

Inlining has always ignored profiler modified IL, whether it's from SetILFunctionBody or via rejit. When grabbing the IL for inlining the jit grabs it directly from the MethodDesc and doesn't query the ReJITManager or call Module::GetDynamicIL.

It's always been an unexpected gotcha for profilers that if you rewrite method B and method A inlines B, even if the inlining occurs after the IL rewrite it will inline the original IL. I never knew the reason for this, perhaps it's intentional because of the way the jit requests inlinee data. Starting in 3.0 inlining is explicitly blocked for any rejit methods. I added logic to CEEInfo::canInline that will return INLINE_FAIL if the method's active IL has been modified with rejit.

Ok, thanks.

It might be worth tracing out the fuller picture and rationale somewhere, if it's not written down.

AndyAyersMS · 2020-09-15T23:11:34Z

@EgorBo no we don't put counters in R2R, normally.

If you crossgen with the right options it will enable BBINSTR mode in the jit and allow collection of counts when the code is run; these counts can be read back by the jit during a subsequent crossgen that enables BBOPT (this is the "IBC" mode you may have seen referenced in places -- Instrumented Block Counts). The Roslyn assemblies noted above are set up this way, as are (or should be) some/many of our officially built assemblies.

R2R collected counts are not (yet) available to the jit; we've had a case or two where because of this Tier1 code is slower than R2R code. Likewise methods that bypass Tier0 won't get instrumented, by default this is:

Methods with loops
Methods with AggressiveOptimization
Methods that bypass tiering all together (dynamic methods, proably more)

To get TieredPGO profile data for as many Tier1 methods as possible, you thus need:

COMPlus_ReadyToRun=0
COMPlus_TC_QuickJitForLoops=1
COMPlus_TieredPGO=1

Also note the exact time at which a Tier1 rejit happens is unpredictable. The Tier0 method may well be running concurrently, so there is some risk of reading inconsistent counts, and some run to run variation in the counts Tier1 sees.

EgorBo · 2020-09-16T11:04:13Z

I tried to test a few random micro benchmarks from dotnet/performance with TieredPGO:

[Config(typeof(MyConfig))]
public class PgoBench
{
    private class MyConfig : ManualConfig
    {
        public MyConfig()
        {
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithId("Default params"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_TieredCompilation", "0"))
                .WithId("No TC"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
                    new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("COMPlus_TieredPGO", "0"))
                .WithId("NoR2R, QJ4L"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
                    new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("COMPlus_TieredPGO", "1"))
                .WithId("NoR2R, QJ4L, TieredPGO"));
        }
    }

    [Benchmark]
    [Arguments(true)]
    public string BoolToString(bool value) => value.ToString();

    [Benchmark]
    [Arguments("4242.5555")]
    public float StringToFloat(string value) => float.Parse(value, CultureInfo.InvariantCulture);

    
    static readonly char[] s_colonAndSemicolon = { ':', ';' };
    [Benchmark]
    public int StringIndexOfAny() =>
        "All the world's a stage, and all the men and women merely players: they have their exits and their entrances; and one man in his time plays many parts, his acts being seven ages."
            .IndexOfAny(s_colonAndSemicolon);
}

Results:

|           Method |                                                   EnvironmentVariables |       Mean |
|----------------- |----------------------------------------------------------------------- |-----------:|
| StringIndexOfAny |                                                     Default JIT params | 12.2355 ns |
| StringIndexOfAny |                                            COMPlus_TieredCompilation=0 |  9.5735 ns |
| StringIndexOfAny | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 |  8.8814 ns |
| StringIndexOfAny | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 |  8.8079 ns |

|    StringToFloat |                                                     Default JIT params | 57.7228 ns |
|    StringToFloat |                                            COMPlus_TieredCompilation=0 | 67.9649 ns |
|    StringToFloat | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 | 59.5685 ns |
|    StringToFloat | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 | 49.3033 ns |

|     BoolToString |                                                     Default JIT params |  0.8825 ns |
|     BoolToString |                                            COMPlus_TieredCompilation=0 |  1.2576 ns |
|     BoolToString | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 |  1.4813 ns |
|     BoolToString | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 |  1.0969 ns |

👍 as expected.

(.NET 5 Preview 8)
/cc @benaadams @adamsitnik

PS: Is OSR already there in the jit so COMPlus_TC_QuickJitForLoops can not lead to regressions in benchmarks?
PS2: the benchmark doesn't include this PR changes

AndyAyersMS · 2020-09-16T16:19:13Z

Nice to see. I've tried running some of the bigger benchmarks and don't see consistent improvement (yet)

Is OSR already there

No, OSR is not enabled by default (and there are a few bugs to fix). OSR methods jit at Tier1 so should pick up pgo data too. Enable via

COMPLUS_TC_OnStackReplacement=1

Also note if you try running bigger things that the "profile data slab" the runtime creates is fixed size and may run out of space, meaning some Tier0 code doesn't get profiled. Current size is 512K bytes, which is 64K profile entries. A Tier0 method uses 2 entry slots for header data, and N slots for counters (where N is number of basic blocks). So roughly speaking it will fill up at around 8K methods or so. There is no notification/detection if it fills up. The slab design was intended to make import and export easy. It's not ideal for in-process pgo.

I have various plans to make the in-process PGO more space efficient, like

don't bother probing methods with one basic block
use spanning tree tricks to reduce probe count when there is flow (requires count reconstruction, which in turn requires we have pred lists built very early; also might benefit from simple profile synthesis at Tier0 so we can construct a quasi-maximum weight spanning tree)
rely on IL consistency and stop recording IL offsets in entries (implicit schema)
allow sharded slabs so space can expand as needed

Initial focus will be ensuring the jit uses the profile data to its best advantage. There is a lot of work to do there too. I am working on a more detailed plan and will try and share it out soon.

EgorBo · 2020-09-17T10:29:18Z

@AndyAyersMS thanks for the detailed explanation! A small issue bother me - it's enough to call a method 30 times to promote it to tier1 and bake BB weights there forever. so when we inline it in a different context (callsite) those weights can be irrelevant.
A small ugly benchmark: https://gist.github.com/EgorBo/4956832bf8f67674f86ef78fd7699156

TieredPGO=0:  272 ms
TieredPGO=1:  344 ms

does jit need an ability to deoptimize methods to solve this problem? Or maybe weights should be reset for new callsites?

AndyAyersMS · 2020-09-17T16:31:33Z

does jit need an ability to deoptimize methods...?

Good question. Profile based optimizations rely on the past being a good predictor of the future. So any sort of profiling scheme is vulnerable to the profile not accurately representing future behavior.

We are using Tier0 to collect counts because it was fairly easy to set up, and provided an easy way to get profile data into the jit, so we could start improving the ways the jit uses profile data. But as you note, that means we only get to observe the first few calls for some methods. It remains to be seen how well (or poorly) this predicts future behavior in general.

I think we will get around to deopt eventually, as we start making stronger bets based on profile data and need the ability to reconsider and reoptimize if we have bet wrong. I don't know if we'd deopt because of poor block layout, though -- I was imagining it would be more for things like guarded (or unguarded) speculative devirtualization.

benaadams · 2020-09-17T20:48:50Z

Could also sample methods? e.g. periodically switch to an optimised version but with the addition of counters; then switch back. If the counts differ interestingly, could then reoptimize the method with the new weights?

benaadams · 2020-09-22T17:42:24Z

Nice to see. I've tried running some of the bigger benchmarks and don't see consistent improvement (yet)

Plaintext Platform; not much in it

Json, interesting, but not much in it

Caching

Fortunes

Single DB query, more interesting

Multiple DB Query, even better

Data Updates

benaadams · 2020-09-22T17:49:22Z

@AndyAyersMS note 👆 that's 5.0 RC1 PGO; not this PR

AndyAyersMS · 2020-09-22T18:29:54Z

@benaadams interesting -- I don't expect big improvements just yet. Was this just with TieredPGO=1 or did you try and encourage more methods to pass through Tier0?

It's possible ASP scenarios overflow the 512K slab, depending on how much ends up getting jitted at Tier0 (meaning some methods don't get pgo data).

benaadams · 2020-09-22T18:44:35Z

Used the triple

COMPlus_ReadyToRun=0
COMPlus_TC_QuickJitForLoops=1
COMPlus_TieredPGO=1

EgorBo · 2020-09-23T00:28:35Z

I'd love to see TieredPGO + Guarded Devirtualization combo 🙂
e.g.:

void Foo(IDisposable d)
{
    d.Dispose();
}

TieredPGO should somehow detect that in most cases d is let's say Foo with empty Dispose impl:

void Foo(IDisposable d)
{
    if (d is Foo) // guarded devirt.
        return;
    else
        d.Dispose();
}

AndyAyersMS · 2020-09-23T01:16:41Z

Yes, we are headed in that direction... either we'll get "currently monomorphic" info from the runtime or else we'll info ourselves via profiling. For the latter we might be able to peek at the VSD cell state (for interface calls) or else profile the types that reach the call with custom instrumentation.

benaadams · 2020-09-26T01:40:08Z

Looks like in TE some of the C++ implementations are running a PGO loopback warmup as part of compile step and then recompiling with the PGO data https://github.com/TechEmpower/FrameworkBenchmarks/blob/11bcc746b3444ad393eaaf350060367611ddc8fa/frameworks/C%2B%2B/lithium/lithium.cc#L44-L56

For us TE do run a warmup test before starting the measurements which works for tiered compilation and would also work for runtime PGO?

AndyAyersMS · 2020-09-26T02:49:54Z

We'll eventually be building a robust solutions for feeding profile data from one run to the next, so some sort of two-phase process might be doable. Though it would be nicer to just get what we need from Tier0.

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 15, 2020

davidwrighton approved these changes Sep 15, 2020

View reviewed changes

formatting

a8bc4e2

AndyAyersMS merged commit 54148e8 into dotnet:master Sep 16, 2020

AndyAyersMS deleted the PgoDataForInlining branch September 16, 2020 18:14

zpodlovics mentioned this pull request Sep 26, 2020

Attribute to prevent fsc inlining but allow JIT inlining fsharp/fslang-suggestions#838

Closed

5 tasks

AndyAyersMS mentioned this pull request Oct 24, 2020

Dynamic PGO #43618

Closed

54 tasks

AndyAyersMS mentioned this pull request Nov 24, 2020

Inliner: look at block weight allocation in inlinees #6096

Closed

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: make profile data available to inlinees #42277

JIT: make profile data available to inlinees #42277

AndyAyersMS commented Sep 15, 2020

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 15, 2020 •

edited

Loading

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 15, 2020 •

edited

Loading

davidwrighton left a comment

davidwrighton Sep 15, 2020

AndyAyersMS Sep 15, 2020

davidwrighton Sep 15, 2020

davmason Sep 15, 2020

AndyAyersMS Sep 16, 2020

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 16, 2020 •

edited

Loading

AndyAyersMS commented Sep 16, 2020

EgorBo commented Sep 17, 2020 •

edited

Loading

AndyAyersMS commented Sep 17, 2020

benaadams commented Sep 17, 2020

benaadams commented Sep 22, 2020

benaadams commented Sep 22, 2020

AndyAyersMS commented Sep 22, 2020

benaadams commented Sep 22, 2020

EgorBo commented Sep 23, 2020

AndyAyersMS commented Sep 23, 2020

benaadams commented Sep 26, 2020

AndyAyersMS commented Sep 26, 2020

JIT: make profile data available to inlinees #42277

JIT: make profile data available to inlinees #42277

Conversation

AndyAyersMS commented Sep 15, 2020

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 15, 2020 • edited Loading

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 15, 2020 • edited Loading

davidwrighton left a comment

Choose a reason for hiding this comment

davidwrighton Sep 15, 2020

Choose a reason for hiding this comment

AndyAyersMS Sep 15, 2020

Choose a reason for hiding this comment

davidwrighton Sep 15, 2020

Choose a reason for hiding this comment

davmason Sep 15, 2020

Choose a reason for hiding this comment

AndyAyersMS Sep 16, 2020

Choose a reason for hiding this comment

AndyAyersMS commented Sep 15, 2020

EgorBo commented Sep 16, 2020 • edited Loading

AndyAyersMS commented Sep 16, 2020

EgorBo commented Sep 17, 2020 • edited Loading

AndyAyersMS commented Sep 17, 2020

benaadams commented Sep 17, 2020

benaadams commented Sep 22, 2020

benaadams commented Sep 22, 2020

AndyAyersMS commented Sep 22, 2020

benaadams commented Sep 22, 2020

EgorBo commented Sep 23, 2020

AndyAyersMS commented Sep 23, 2020

benaadams commented Sep 26, 2020

AndyAyersMS commented Sep 26, 2020

EgorBo commented Sep 15, 2020 •

edited

Loading

EgorBo commented Sep 15, 2020 •

edited

Loading

EgorBo commented Sep 16, 2020 •

edited

Loading

EgorBo commented Sep 17, 2020 •

edited

Loading