Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: make profile data available to inlinees #42277

Merged
merged 2 commits into from
Sep 16, 2020

Conversation

AndyAyersMS
Copy link
Member

Update the jit to try and read profile data for inlinees, and if successful,
scale it appropriately for the inline call site. This kicks in for crossgen
BBOPT and TieredPGO Tier1.

Update VM and Crossgen hosts to handle requests for inlinee profile counts.
Crossgen2 does not seem to support profile data retrieval yet.

Note crossgen experience may not be as good as one might expect, because
crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen
even with this limitation is probably a win overall.

Fix small issue in the jit where we were overly aggressive about merging the
callee block's flags into the callsite block's flags.

Update the jit to try and read profile data for inlinees, and if successful,
scale it appropriately for the inline call site. This kicks in for crossgen
BBOPT and TieredPGO Tier1.

Update VM and Crossgen hosts to handle requests for inlinee profile counts.
Crossgen2 does not seem to support profile data retrieval yet.

Note crossgen experience may not be as good as one might expect, because
crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen
even with this limitation is probably a win overall.

Fix small issue in the jit where we were overly aggressive about merging the
callee block's flags into the callsite block's flags.
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 15, 2020
@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib @davidwrighton

Crossgen diffs (ignore the assemblies with minor diffs, I think my baseline build is a bit off). Impact is on the Roslyn assemblies that have IBC data.

Most of the code size diffs seem to be from changes to block layout; suspect our PGO based layout algorithm is reacting a bit too strongly here (and by way of contrast, the inliner is not reacting strongly enough).

Crossgen CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 33470799
Total bytes of diff: 33494356
Total bytes of delta: 23557 (0.07% of base)
    diff is a regression.
Top file regressions (bytes):
       13619 : Microsoft.CodeAnalysis.CSharp.dasm (0.66% of base)
        8759 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.39% of base)
        2484 : Microsoft.CodeAnalysis.dasm (0.33% of base)
           2 : System.Runtime.Serialization.Formatters.dasm (0.00% of base)
           2 : System.Text.RegularExpressions.dasm (0.00% of base)
           1 : System.Drawing.Common.dasm (0.00% of base)
Top file improvements (bytes):
       -1134 : System.Private.CoreLib.dasm (-0.03% of base)
         -76 : System.Text.Json.dasm (-0.02% of base)
         -51 : System.Private.Xml.dasm (-0.00% of base)
         -24 : System.CodeDom.dasm (-0.01% of base)
         -13 : Microsoft.Diagnostics.FastSerialization.dasm (-0.04% of base)
         -12 : Microsoft.CSharp.dasm (-0.00% of base)
12 total files with Code Size differences (6 improved, 6 regressed), 255 unchanged.
Top method regressions (bytes):
         344 ( 8.79% of base) : Microsoft.CodeAnalysis.dasm - MetadataSizes:.ctor(ImmutableArray`1,ImmutableArray`1,int,int,int,int,bool,bool,bool):this
         192 ( 4.77% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LocalRewriter:RewriteEnumeratorForEachStatement(BoundForEachStatement):BoundStatement:this
         173 ( 7.85% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseModifiers(SyntaxListBuilder):this
         170 ( 3.36% of base) : Microsoft.CodeAnalysis.CSharp.dasm - LanguageParser:ParseMemberDeclarationOrStatement(ushort,String):MemberDeclarationSyntax:this
         166 ( 3.46% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedEventAccessorSymbol:ConstructFieldLikeEventAccessorBody_Regular(SourceEventSymbol,bool,VisualBasicCompilation,DiagnosticBag):BoundBlock
         165 (23.17% of base) : Microsoft.CodeAnalysis.dasm - PeWriter:CreateSectionHeaders(MetadataSizes,int):List`1:this
         162 ( 3.64% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodBodySynthesizer:ConstructFieldLikeEventAccessorBody_Regular(SourceEventSymbol,bool,CSharpCompilation,DiagnosticBag):BoundBlock
         152 ( 6.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DocumentationCommentParser:ParseXmlElement():XmlNodeSyntax:this
         147 (67.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DataFlowPass:VisitLocalDeclaration(BoundLocalDeclaration):BoundNode:this
         144 ( 5.11% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - AnonymousTypeGetHashCodeMethodSymbol:GetBoundMethodBody(DiagnosticBag,byref):BoundBlock:this
         141 ( 7.16% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - LocalRewriter:RewriteIfStatement(VisualBasicSyntaxNode,VisualBasicSyntaxNode,BoundExpression,BoundStatement,BoundStatement,bool,ImmutableArray`1):BoundStatement:this
         135 ( 2.41% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - AbstractFlowPass`1:VisitTryStatement(BoundTryStatement):BoundNode:this (3 methods)
         127 ( 3.96% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNullCoalescingOperator(BinaryExpressionSyntax,DiagnosticBag):BoundExpression:this
         121 (20.40% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNamedAttributeArgumentName(AttributeArgumentSyntax,NamedTypeSymbol,DiagnosticBag,byref,byref):Symbol:this
         108 (13.00% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindIndexerAccess(ExpressionSyntax,BoundExpression,AnalyzedArguments,DiagnosticBag):BoundExpression:this
         108 ( 4.54% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Parser:ParseFromControlVars():SeparatedSyntaxList`1:this
         107 ( 2.71% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodBodySynthesizer:MakeSubmissionInitialization(ArrayBuilder`1,CSharpSyntaxNode,MethodSymbol,SynthesizedSubmissionFields,CSharpCompilation)
         104 (17.51% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BoundIsOperator:Update(BoundExpression,BoundTypeExpression,Conversion,TypeSymbol):BoundIsOperator:this
         102 ( 5.23% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - LocalRewriter:VisitIfStatement(BoundIfStatement):BoundNode:this
         101 ( 2.70% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedStringSwitchHashMethod:GetBoundMethodBody(DiagnosticBag,byref):BoundBlock:this
Top method improvements (bytes):
        -218 (-18.81% of base) : System.Private.CoreLib.dasm - OrdinalCasing:CompareStringIgnoreCase(byref,int,byref,int):int
        -212 (-17.19% of base) : System.Private.CoreLib.dasm - OrdinalCasing:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
        -200 (-16.54% of base) : System.Private.CoreLib.dasm - OrdinalCasing:LastIndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
        -198 (-17.11% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitDelegateCreationExpression(BoundDelegateCreationExpression):BoundNode:this (2 methods)
        -182 (-28.13% of base) : System.Private.CoreLib.dasm - OrdinalCasing:EqualSurrogate(ushort,ushort,ushort,ushort):bool
        -170 (-5.58% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindConstructorInitializer(ArgumentListSyntax,MethodSymbol,DiagnosticBag):BoundExpression:this
        -152 (-10.83% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:ResolveDefaultMethodGroup(BoundMethodGroup,AnalyzedArguments,bool,byref,bool,bool):MethodGroupResolution:this
        -147 (-21.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitEventAssignmentOperator(BoundEventAssignmentOperator):BoundNode:this (2 methods)
        -145 (-2.82% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:BindBinaryOperator(VisualBasicSyntaxNode,BoundExpression,BoundExpression,ushort,int,bool,DiagnosticBag,bool,byref):BoundExpression:this
        -143 (-17.61% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitUsingStatement(BoundUsingStatement):BoundNode:this (3 methods)
        -135 (-10.00% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:AdjustConditionalState(BoundExpression):this (3 methods)
        -117 (-17.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitLockStatement(BoundLockStatement):BoundNode:this (3 methods)
        -111 (-10.30% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitNullCoalescingOperator(BoundNullCoalescingOperator):BoundNode:this (2 methods)
        -110 (-2.43% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:PerformCompileTimeBinaryOperation(ushort,byte,CConst,CConst,ExpressionSyntax):CConst
         -92 (-3.97% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:LookupMembersInSubmissions(LookupResult,TypeSymbol,String,int,ConsList`1,int,Binder,bool,byref):this
         -91 (-10.29% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitSequence(BoundSequence):BoundNode:this (2 methods)
         -91 (-13.52% of base) : System.Private.CoreLib.dasm - OrdinalCasing:ToUpperOrdinal(ReadOnlySpan`1,Span`1)
         -86 (-23.31% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitQueryClause(BoundQueryClause):BoundNode:this (2 methods)
         -86 (-26.79% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitArrayLength(BoundArrayLength):BoundNode:this (2 methods)
         -75 (-4.20% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitCompoundAssignmentOperator(BoundCompoundAssignmentOperator):BoundNode:this (2 methods)
Top method regressions (percentages):
         147 (67.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DataFlowPass:VisitLocalDeclaration(BoundLocalDeclaration):BoundNode:this
          77 (37.38% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:LabeledStatement(SyntaxToken,SyntaxToken,StatementSyntax):LabeledStatementSyntax:this
          75 (36.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:ConstructorConstraint(SyntaxToken,SyntaxToken,SyntaxToken):ConstructorConstraintSyntax:this
          75 (36.41% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:QueryContinuation(SyntaxToken,SyntaxToken,QueryBodySyntax):QueryContinuationSyntax:this
          33 (28.95% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:ThisReference(CSharpSyntaxNode,NamedTypeSymbol,bool,bool):BoundThisReference:this
          50 (24.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSimpleCaseClause:Update(BoundExpression,BoundExpression):BoundSimpleCaseClause:this
          50 (24.63% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundCaseBlock:Update(BoundCaseStatement,BoundBlock):BoundCaseBlock:this
          37 (24.34% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSequencePoint:Update(BoundStatement):BoundSequencePoint:this
          44 (24.04% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundParenthesized:Update(BoundExpression,TypeSymbol):BoundParenthesized:this
          44 (24.04% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundGotoStatement:Update(LabelSymbol,BoundLabel):BoundGotoStatement:this
          44 (23.91% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundSequencePointExpression:Update(BoundExpression,TypeSymbol):BoundSequencePointExpression:this
          44 (23.66% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:QueryExpression(FromClauseSyntax,QueryBodySyntax):QueryExpressionSyntax:this
          58 (23.58% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:CheckedStatement(ushort,SyntaxToken,BlockSyntax):CheckedStatementSyntax:this
         165 (23.17% of base) : Microsoft.CodeAnalysis.dasm - PeWriter:CreateSectionHeaders(MetadataSizes,int):List`1:this
          42 (22.83% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:WhereClause(SyntaxToken,ExpressionSyntax):WhereClauseSyntax:this
          65 (22.57% of base) : Microsoft.CodeAnalysis.CSharp.dasm - DocumentationCommentParser:ParseXmlAttributeEndQuote(ushort):SyntaxToken:this
          48 (21.92% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:GlobalStatement(StatementSyntax):GlobalStatementSyntax:this
          71 (21.85% of base) : Microsoft.CodeAnalysis.CSharp.dasm - ContextAwareSyntax:BracketedParameterList(SyntaxToken,SeparatedSyntaxList`1,SyntaxToken):BracketedParameterListSyntax:this
          39 (21.79% of base) : Microsoft.CodeAnalysis.dasm - BlobWriter:WriteBytes(ubyte,int):this
         121 (20.40% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindNamedAttributeArgumentName(AttributeArgumentSyntax,NamedTypeSymbol,DiagnosticBag,byref,byref):Symbol:this
Top method improvements (percentages):
         -21 (-77.78% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetInt16WithAllBitsSet():short
         -20 (-76.92% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetUInt16WithAllBitsSet():ushort
         -19 (-76.00% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetSByteWithAllBitsSet():byte
         -18 (-75.00% of base) : System.Private.CoreLib.dasm - ConstantHelper:GetByteWithAllBitsSet():ubyte
         -20 (-36.36% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadInt16BigEndian(ReadOnlySpan`1,byref):bool
         -19 (-35.85% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadUInt16BigEndian(ReadOnlySpan`1,byref):bool
        -182 (-28.13% of base) : System.Private.CoreLib.dasm - OrdinalCasing:EqualSurrogate(ushort,ushort,ushort,ushort):bool
         -86 (-26.79% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitArrayLength(BoundArrayLength):BoundNode:this (2 methods)
         -86 (-23.31% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitQueryClause(BoundQueryClause):BoundNode:this (2 methods)
        -147 (-21.12% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitEventAssignmentOperator(BoundEventAssignmentOperator):BoundNode:this (2 methods)
         -13 (-19.70% of base) : System.Private.CoreLib.dasm - BinaryPrimitives:TryReadHalfBigEndian(ReadOnlySpan`1,byref):bool
         -24 (-19.51% of base) : System.Private.CoreLib.dasm - Rune:ToString():String:this
        -218 (-18.81% of base) : System.Private.CoreLib.dasm - OrdinalCasing:CompareStringIgnoreCase(byref,int,byref,int):int
        -143 (-17.61% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitUsingStatement(BoundUsingStatement):BoundNode:this (3 methods)
        -117 (-17.33% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitLockStatement(BoundLockStatement):BoundNode:this (3 methods)
        -212 (-17.19% of base) : System.Private.CoreLib.dasm - OrdinalCasing:IndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
         -56 (-17.13% of base) : Microsoft.CodeAnalysis.CSharp.dasm - Binder:BindThis(ThisExpressionSyntax,DiagnosticBag):BoundThisReference:this
        -198 (-17.11% of base) : Microsoft.CodeAnalysis.CSharp.dasm - PreciseAbstractFlowPass`1:VisitDelegateCreationExpression(BoundDelegateCreationExpression):BoundNode:this (2 methods)
        -200 (-16.54% of base) : System.Private.CoreLib.dasm - OrdinalCasing:LastIndexOf(ReadOnlySpan`1,ReadOnlySpan`1):int
         -22 (-14.67% of base) : Microsoft.CodeAnalysis.CSharp.dasm - MethodSymbolExtensions:IsTaskReturningAsync(MethodSymbol,CSharpCompilation):bool
1779 total methods with Code Size differences (262 improved, 1517 regressed), 195471 unchanged.

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2020

@AndyAyersMS is there a doc how PgoManager works? How to collect some simple PGO data, how to re-use it?
I assume only crossgen is able to collect it?

upd: wow, TIL COMPlus_TieredPGO - works for tier0, nice!

image

@AndyAyersMS
Copy link
Member Author

@EgorBo yes, that's right -- just set COMPlus_TieredPGO=1 and tier0 will collect counts for tier1.

Still lots to do in this space.

@EgorBo
Copy link
Member

EgorBo commented Sep 15, 2020

@EgorBo yes, that's right -- just set COMPlus_TieredPGO=1 and tier0 will collect counts for tier1.

Still lots to do in this space.

Still quite impressive! Question: do we emit PGO counters to R2R images for BCL libs? so when I promote let's say String.IsNullOrEmpty to tier1 (from R2R'd version) I can get weights for cases when input is empty and not

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but see my comment about possible race conditions with profiler rejit.

{
hr = E_NOTIMPL;
COR_ILMETHOD_DECODER decoder(pMD->GetILHeader());
codeSize = decoder.GetCodeSize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there is some opportunity for a race condition here while interacting with Profiler Rejit, but otherwise this looks good. I'm going to mark this as approved, but we may need a conversation with @davmason to make sure this is safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree rejitting could cause troubles. I wonder this is a broader issue -- we tend to query properties of inlinees over time, and not all in one go.

I actually have the IL size on the jit side so could pass the information down, if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the broader issue is a thing, but given the rarity of rejit we would never find the issue in our testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining has always ignored profiler modified IL, whether it's from SetILFunctionBody or via rejit. When grabbing the IL for inlining the jit grabs it directly from the MethodDesc and doesn't query the ReJITManager or call Module::GetDynamicIL.

It's always been an unexpected gotcha for profilers that if you rewrite method B and method A inlines B, even if the inlining occurs after the IL rewrite it will inline the original IL. I never knew the reason for this, perhaps it's intentional because of the way the jit requests inlinee data. Starting in 3.0 inlining is explicitly blocked for any rejit methods. I added logic to CEEInfo::canInline that will return INLINE_FAIL if the method's active IL has been modified with rejit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks.

It might be worth tracing out the fuller picture and rationale somewhere, if it's not written down.

@AndyAyersMS
Copy link
Member Author

@EgorBo no we don't put counters in R2R, normally.

If you crossgen with the right options it will enable BBINSTR mode in the jit and allow collection of counts when the code is run; these counts can be read back by the jit during a subsequent crossgen that enables BBOPT (this is the "IBC" mode you may have seen referenced in places -- Instrumented Block Counts). The Roslyn assemblies noted above are set up this way, as are (or should be) some/many of our officially built assemblies.

R2R collected counts are not (yet) available to the jit; we've had a case or two where because of this Tier1 code is slower than R2R code. Likewise methods that bypass Tier0 won't get instrumented, by default this is:

  • Methods with loops
  • Methods with AggressiveOptimization
  • Methods that bypass tiering all together (dynamic methods, proably more)

To get TieredPGO profile data for as many Tier1 methods as possible, you thus need:

  • COMPlus_ReadyToRun=0
  • COMPlus_TC_QuickJitForLoops=1
  • COMPlus_TieredPGO=1

Also note the exact time at which a Tier1 rejit happens is unpredictable. The Tier0 method may well be running concurrently, so there is some risk of reading inconsistent counts, and some run to run variation in the counts Tier1 sees.

@EgorBo
Copy link
Member

EgorBo commented Sep 16, 2020

I tried to test a few random micro benchmarks from dotnet/performance with TieredPGO:

[Config(typeof(MyConfig))]
public class PgoBench
{
    private class MyConfig : ManualConfig
    {
        public MyConfig()
        {
            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithId("Default params"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_TieredCompilation", "0"))
                .WithId("No TC"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
                    new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("COMPlus_TieredPGO", "0"))
                .WithId("NoR2R, QJ4L"));

            AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
                .WithEnvironmentVariables(
                    new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
                    new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("COMPlus_TieredPGO", "1"))
                .WithId("NoR2R, QJ4L, TieredPGO"));
        }
    }

    [Benchmark]
    [Arguments(true)]
    public string BoolToString(bool value) => value.ToString();

    [Benchmark]
    [Arguments("4242.5555")]
    public float StringToFloat(string value) => float.Parse(value, CultureInfo.InvariantCulture);

    
    static readonly char[] s_colonAndSemicolon = { ':', ';' };
    [Benchmark]
    public int StringIndexOfAny() =>
        "All the world's a stage, and all the men and women merely players: they have their exits and their entrances; and one man in his time plays many parts, his acts being seven ages."
            .IndexOfAny(s_colonAndSemicolon);
}

Results:

|           Method |                                                   EnvironmentVariables |       Mean |
|----------------- |----------------------------------------------------------------------- |-----------:|
| StringIndexOfAny |                                                     Default JIT params | 12.2355 ns |
| StringIndexOfAny |                                            COMPlus_TieredCompilation=0 |  9.5735 ns |
| StringIndexOfAny | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 |  8.8814 ns |
| StringIndexOfAny | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 |  8.8079 ns |

|    StringToFloat |                                                     Default JIT params | 57.7228 ns |
|    StringToFloat |                                            COMPlus_TieredCompilation=0 | 67.9649 ns |
|    StringToFloat | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 | 59.5685 ns |
|    StringToFloat | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 | 49.3033 ns |

|     BoolToString |                                                     Default JIT params |  0.8825 ns |
|     BoolToString |                                            COMPlus_TieredCompilation=0 |  1.2576 ns |
|     BoolToString | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=0 |  1.4813 ns |
|     BoolToString | COMPlus_ReadyToRun=0,COMPlus_TC_QuickJitForLoops=1,COMPlus_TieredPGO=1 |  1.0969 ns |

👍 as expected.

(.NET 5 Preview 8)
/cc @benaadams @adamsitnik

PS: Is OSR already there in the jit so COMPlus_TC_QuickJitForLoops can not lead to regressions in benchmarks?
PS2: the benchmark doesn't include this PR changes

@AndyAyersMS
Copy link
Member Author

Nice to see. I've tried running some of the bigger benchmarks and don't see consistent improvement (yet)

Is OSR already there

No, OSR is not enabled by default (and there are a few bugs to fix). OSR methods jit at Tier1 so should pick up pgo data too. Enable via

COMPLUS_TC_OnStackReplacement=1

Also note if you try running bigger things that the "profile data slab" the runtime creates is fixed size and may run out of space, meaning some Tier0 code doesn't get profiled. Current size is 512K bytes, which is 64K profile entries. A Tier0 method uses 2 entry slots for header data, and N slots for counters (where N is number of basic blocks). So roughly speaking it will fill up at around 8K methods or so. There is no notification/detection if it fills up. The slab design was intended to make import and export easy. It's not ideal for in-process pgo.

I have various plans to make the in-process PGO more space efficient, like

  • don't bother probing methods with one basic block
  • use spanning tree tricks to reduce probe count when there is flow (requires count reconstruction, which in turn requires we have pred lists built very early; also might benefit from simple profile synthesis at Tier0 so we can construct a quasi-maximum weight spanning tree)
  • rely on IL consistency and stop recording IL offsets in entries (implicit schema)
  • allow sharded slabs so space can expand as needed

Initial focus will be ensuring the jit uses the profile data to its best advantage. There is a lot of work to do there too. I am working on a more detailed plan and will try and share it out soon.

@AndyAyersMS AndyAyersMS merged commit 54148e8 into dotnet:master Sep 16, 2020
@AndyAyersMS AndyAyersMS deleted the PgoDataForInlining branch September 16, 2020 18:14
@EgorBo
Copy link
Member

EgorBo commented Sep 17, 2020

@AndyAyersMS thanks for the detailed explanation! A small issue bother me - it's enough to call a method 30 times to promote it to tier1 and bake BB weights there forever. so when we inline it in a different context (callsite) those weights can be irrelevant.
A small ugly benchmark: https://gist.github.com/EgorBo/4956832bf8f67674f86ef78fd7699156

TieredPGO=0:  272 ms
TieredPGO=1:  344 ms

does jit need an ability to deoptimize methods to solve this problem? Or maybe weights should be reset for new callsites?

@AndyAyersMS
Copy link
Member Author

does jit need an ability to deoptimize methods...?

Good question. Profile based optimizations rely on the past being a good predictor of the future. So any sort of profiling scheme is vulnerable to the profile not accurately representing future behavior.

We are using Tier0 to collect counts because it was fairly easy to set up, and provided an easy way to get profile data into the jit, so we could start improving the ways the jit uses profile data. But as you note, that means we only get to observe the first few calls for some methods. It remains to be seen how well (or poorly) this predicts future behavior in general.

I think we will get around to deopt eventually, as we start making stronger bets based on profile data and need the ability to reconsider and reoptimize if we have bet wrong. I don't know if we'd deopt because of poor block layout, though -- I was imagining it would be more for things like guarded (or unguarded) speculative devirtualization.

@benaadams
Copy link
Member

Could also sample methods? e.g. periodically switch to an optimised version but with the addition of counters; then switch back. If the counts differ interestingly, could then reoptimize the method with the new weights?

@benaadams
Copy link
Member

Nice to see. I've tried running some of the bigger benchmarks and don't see consistent improvement (yet)

Plaintext Platform; not much in it
image

Json, interesting, but not much in it
image

Caching
image

Fortunes
image

Single DB query, more interesting
image

Multiple DB Query, even better
image

Data Updates
image

@benaadams
Copy link
Member

@AndyAyersMS note 👆 that's 5.0 RC1 PGO; not this PR

@AndyAyersMS
Copy link
Member Author

@benaadams interesting -- I don't expect big improvements just yet. Was this just with TieredPGO=1 or did you try and encourage more methods to pass through Tier0?

It's possible ASP scenarios overflow the 512K slab, depending on how much ends up getting jitted at Tier0 (meaning some methods don't get pgo data).

@benaadams
Copy link
Member

Used the triple

COMPlus_ReadyToRun=0
COMPlus_TC_QuickJitForLoops=1
COMPlus_TieredPGO=1

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2020

I'd love to see TieredPGO + Guarded Devirtualization combo 🙂
e.g.:

void Foo(IDisposable d)
{
    d.Dispose();
}

TieredPGO should somehow detect that in most cases d is let's say Foo with empty Dispose impl:

void Foo(IDisposable d)
{
    if (d is Foo) // guarded devirt.
        return;
    else
        d.Dispose();
}

@AndyAyersMS
Copy link
Member Author

Yes, we are headed in that direction... either we'll get "currently monomorphic" info from the runtime or else we'll info ourselves via profiling. For the latter we might be able to peek at the VSD cell state (for interface calls) or else profile the types that reach the call with custom instrumentation.

@benaadams
Copy link
Member

Looks like in TE some of the C++ implementations are running a PGO loopback warmup as part of compile step and then recompiling with the PGO data https://github.com/TechEmpower/FrameworkBenchmarks/blob/11bcc746b3444ad393eaaf350060367611ddc8fa/frameworks/C%2B%2B/lithium/lithium.cc#L44-L56

For us TE do run a warmup test before starting the measurements which works for tiered compilation and would also work for runtime PGO?

@AndyAyersMS
Copy link
Member Author

We'll eventually be building a robust solutions for feeding profile data from one run to the next, so some sort of two-phase process might be doable. Though it would be nicer to just get what we need from Tier0.

@AndyAyersMS AndyAyersMS mentioned this pull request Oct 24, 2020
54 tasks
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants