Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose an eeIsJitIntrinsic method for quickly determining if a method handle represents an intrinsic method #51124

Merged
merged 6 commits into from
Apr 14, 2021

Conversation

tannergooding
Copy link
Member

This should resolve #51006

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 12, 2021
private bool isJitIntrinsic(CORINFO_METHOD_STRUCT_* ftn)
{
MethodDesc method = HandleToObject(ftn);
return method.IsIntrinsic;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only particular thing of note is this (as per https://github.com/dotnet/runtime/compare/main...tannergooding:isJitIntrinsic?expand=1#diff-132a77bcd3f74cf0e0b04fbccda246c97c91e40562d78cb01fff61cf69403573L872-L873) isn't fine grained enough to determine just CORINFO_FLG_INTRINSIC or CORINFO_FLG_JIT_INTRINSIC

@AndyAyersMS
Copy link
Member

Did you measure crossgen2 throughput with this? You could probably also use crossgen1, it might be simpler.

@tannergooding
Copy link
Member Author

Did you measure crossgen2 throughput with this? You could probably also use crossgen1, it might be simpler.

Working on doing that now, I was just planning on running the same benchmark from #51006

@tannergooding
Copy link
Member Author

tannergooding commented Apr 12, 2021

@AndyAyersMS, @jkotas, on my own box, I see the below.

Crossgen2

Base:

    Crossgen2 Throughput - Single - System.Private.Corelib
    Metric               |Average        |Min            |Max
    ---------------------|---------------|---------------|---------------
    Process Time         |2195.994 ms    |2152.472 ms    |2241.161 ms
    Loading Interval     |478.729 ms     |473.132 ms     |488.765 ms
    Emitting Interval    |585.857 ms     |568.642 ms     |601.387 ms
    Jit Interval         |831.327 ms     |810.561 ms     |864.635 ms
    Compilation Interval |2143.333 ms    |2097.013 ms    |2188.326 ms
    Time on Thread       |11848.433 ms   |11634.819 ms   |12010.432 ms

Resolve tokens (+6.03% from Base):

    Crossgen2 Throughput - Single - System.Private.Corelib
    Metric               |Average        |Min            |Max
    ---------------------|---------------|---------------|---------------
    Process Time         |2328.495 ms    |2242.713 ms    |2375.665 ms
    Loading Interval     |494.870 ms     |474.765 ms     |510.811 ms
    Emitting Interval    |582.028 ms     |570.934 ms     |595.669 ms
    Jit Interval         |937.977 ms     |903.238 ms     |985.727 ms
    Compilation Interval |2274.020 ms    |2188.263 ms    |2322.155 ms
    Time on Thread       |13275.996 ms   |12798.085 ms   |13556.782 m

Jan's change (+2.24% from Base, -3.58% from Resolve tokens):

    Crossgen2 Throughput - Single - System.Private.Corelib
    Metric               |Average        |Min            |Max
    ---------------------|---------------|---------------|---------------
    Process Time         |2245.198 ms    |2207.725 ms    |2283.867 ms
    Loading Interval     |484.585 ms     |477.682 ms     |502.124 ms
    Emitting Interval    |569.878 ms     |557.664 ms     |585.911 ms
    Jit Interval         |905.695 ms     |872.238 ms     |917.978 ms
    Compilation Interval |2192.197 ms    |2155.402 ms    |2230.611 ms
    Time on Thread       |13092.984 ms   |12916.836 ms   |13276.031 ms

This change (+1.94% from Base, -3.86% from Resolve tokens, -0.29% from Jan's change):

    Crossgen2 Throughput - Single - System.Private.Corelib
    Metric               |Average        |Min            |Max
    ---------------------|---------------|---------------|---------------
    Process Time         |2238.630 ms    |2178.115 ms    |2301.655 ms
    Loading Interval     |479.965 ms     |473.764 ms     |485.390 ms
    Emitting Interval    |568.050 ms     |564.785 ms     |574.498 ms
    Jit Interval         |894.025 ms     |854.444 ms     |941.489 ms
    Compilation Interval |2186.992 ms    |2126.187 ms    |2249.631 ms
    Time on Thread       |11828.096 ms   |11491.262 ms   |12034.082 ms

Notably, its not clear why this change, which is doing a lot less work than eeGetCallInfo is such a small difference (nor why "maxTime" is worse).
Would the JITEEVersionIdentifer change impact this at all?

@jkotas
Copy link
Member

jkotas commented Apr 12, 2021

The "Time on Thread" has improved significantly with this change. Is it a sign that this change works well?

@AndyAyersMS
Copy link
Member

Error bars in lab measurements suggest there is a few % variation in run times for crossgen2.

image

How many runs are you averaging?

@tannergooding
Copy link
Member Author

Is it a sign that this change works well?

@DrewScoggins might be able to give more insight into what each value is for.
My hypothesis is Jit Interval is the main one we want to be looking at here, given that these changes are in the JIT and should likely impact that.

There is a good bit of noise here as well, and multiple independent runs can result in some pretty variable results.

For example, in the change I just pushed which drops the impResolveToken call for CEE_CONSTRAINED (since we don't need it given we aren't calling eeGetCallInfo anymore), has higher process time and higher time on thread, but less Jit interval:

    Crossgen2 Throughput - Single - System.Private.Corelib
    Metric               |Average        |Min            |Max
    ---------------------|---------------|---------------|---------------
    Process Time         |2247.620 ms    |2192.448 ms    |2386.135 ms
    Loading Interval     |480.200 ms     |475.420 ms     |485.793 ms
    Emitting Interval    |587.992 ms     |562.316 ms     |660.547 ms
    Jit Interval         |870.072 ms     |846.114 ms     |920.709 ms
    Compilation Interval |2195.054 ms    |2140.109 ms    |2334.002 ms
    Time on Thread       |12126.867 ms   |11708.378 ms   |12998.084 ms

@tannergooding
Copy link
Member Author

How many runs are you averaging?

Console output indicates 1 warm up and 5 iterations, for all 4 (now 5) shared results.
The results for the crossgen tests are a bit different from the regular benchmarks.

@AndyAyersMS
Copy link
Member

Given that we're making measurements where we don't have a good model for the underlying distribution, median may be a more useful thing to report than mean.

Historically for TP measurements the jit team would use crossgen1 running under PIN (or some other instruction counting technique), and there the run to run repeatability was quite good. So you might try doing that instead.

@tannergooding
Copy link
Member Author

I can run some more benchmarks later this afternoon. I can collect some instruction counts using AMD uProf (pin largely only works on Intel).

I'll need to figure the command line the benchmark is passing to crossgen as part of that.

@DrewScoggins
Copy link
Member

if compiletype == const.CROSSGEN2_SINGLEFILE:
            referencefilenames = ['System.*.dll', 'Microsoft.*.dll', 'netstandard.dll', 'mscorlib.dll']
            # single assembly filename: example.dll
            filename, ext = os.path.splitext(self.singlefile)
            outputdir = os.path.join(os.getcwd(), const.CROSSGENDIR)
            if not os.path.exists(outputdir):
                os.mkdir(outputdir)
            outputfile = os.path.join(outputdir, filename+'.ni'+ext )
            
            crossgen2args = [
                os.path.join(self.coreroot, self.singlefile),
                '-o', outputfile,
                '-O'
            ]

            for reffile in referencefilenames:
                crossgen2args.extend(['-r', os.path.join(self.coreroot, reffile)])

Here is the code that we use to generate the command line. Looks like we call crossgen2 with the path to the file we are crossgening, give an output file name and use the -O flag as well as -r for the different ref files.

@MichalStrehovsky
Copy link
Member

Error bars in lab measurements suggest there is a few % variation in run times for crossgen2.

For the Native AOT compiler (which is pretty much crossgen2), turning off Tiered JIT stabilized throughput by a lot. We actually also had a compilation throughput improvement with that so it's just off by default there.

@tannergooding
Copy link
Member Author

@AndyAyersMS, @jkotas

I'm seeing an average of:

  • 22 retired instructions in "Base"
  • 234 retired instructions in "Resolve tokens"
  • 190 retired instructions in "Jan's change"
  • 224 retired instructions in "This change"

It isn't clear to me why this is the case initially, since this should be doing less work now.
Are we aware of anything that might impact this, like changing the JITEEVersionIdentifier or modifying the JIT/EE interface invalidating something (PGO data, some cache, etc)

@tannergooding
Copy link
Member Author

Rebased to resolve merge conflict

@jkotas
Copy link
Member

jkotas commented Apr 13, 2021

What does the retired instructions number mean?

@tannergooding
Copy link
Member Author

tannergooding commented Apr 13, 2021

What does the retired instructions number mean?

The number of instructions that were executed and correct (meaning this does not include speculatively executed instructions).
This is roughly equivalent to the instruction count output by the pin tool Andy mentioned.

In general, it isn't clear why calling eeIsJitIntrinsic is not cheaper than calling eeGetCallInfo, considering the latter does everything eeIsJitIntrinsic does, plus more.
Hence my wondering if this is potentially due to something like PGO data or otherwise.

@AndyAyersMS
Copy link
Member

We usually just look at the overall numbers for the whole process, not try and attribute them to particular methods.

@tannergooding
Copy link
Member Author

We usually just look at the overall numbers for the whole process, not try and attribute them to particular methods.

Its lost in the noise of the overall process. The average for all 4 variations, measured across 5 runs each, is about 92000 retired instructions with not enough variation between them to see anything discernable in terms of hardware metrics that ETL or uProf collect.

@AndyAyersMS
Copy link
Member

This small variation in instructions retired is seen across all 4 variants? (base, regression, fix1, fix2...)? If so then something is off in the measurement process.

I can try looking with PIN later and post what I see.

@AndyAyersMS
Copy link
Member

Here's the full process pin icount data for crossgen SPC. Seems to track your crossgen2 data above fairly well.

Looks like we're maybe 1.5% slower than we were before all this. I think we can live with that.

Baseline
Count 19254296972
Count 19256217769
Count 19253906826

Resolve Tokens
Count 20138271494
Count 20142288269
Count 20140098703

Jan's fix
Count 19988303504
Count 19988649133
Count 19990292972

This change
Count 19566027492
Count 19567077283
Count 19564959405

@jkotas jkotas merged commit f24879f into dotnet:main Apr 14, 2021
@tannergooding
Copy link
Member Author

Thanks! I'll need to figure out why uProf wasn't reporting the right numbers on my end

@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Apr 15, 2021
@ghost ghost locked as resolved and limited conversation to collaborators May 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in Crossgen2 Throughput
6 participants