-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: defer some flow graph reordering until after loop recognition #69878
JIT: defer some flow graph reordering until after loop recognition #69878
Conversation
The JIT currently will aggressively reorder the flow graph before running its loop recognition phases. When there is PGO data this sometimes perturbs the block order so that loops are no longer recognized, and we miss out on some loop optimizations. This change defers most block reordering until after the JIT has gone through the optimization phases. There is still a limited form of flow cleanup done early on. There is also a compensating change in loop recognition in one place where it was relying on adjacent blocks being combined. Fixes dotnet#67318.
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThe JIT currently will aggressively reorder the flow graph before running its This change defers most block reordering until after the JIT has gone through There is also a compensating change in loop recognition in one place where it was Fixes #67318.
|
@BruceForstall PTAL This is going to lead to a massive number of diffs. There's no way I've found to incrementally approach this. If/when we go to a more graph-based loop recognition we can reconsider, but in general we should not be concerned with block ordering early on in the phase pipeline. So I think deferring these sorts of ordering opts is the right long-term approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Will be interesting to see how the perf runs react.
Should run jitstress on this.
Needed to pull in a bit more of |
@dotnet/dncenghot CI seems to be in a bad way
|
@epananth can you handle this? Repro is as simple as going to {"$id":"1","innerException":null,"message":"Can't find the package 'microsoft.netcore.app.ref' in feed 'dotnet-tools'.","typeName":"Microsoft.VisualStudio.Services.NuGet.WebApi.Exceptions.PackageNotFoundException, Microsoft.VisualStudio.Services.NuGet.WebApi","typeKey":"PackageNotFoundException","errorCode":0,"eventId":3000} but instead if prompts for auth despite dotnet-tools being a public feed. |
Yes on it |
created ICM to track this https://portal.microsofticm.com/imp/v3/incidents/details/310473385/home |
Local x64 windows diff summary. Overall regression expected as we should be enabling more cloning.
Generating all these dasm and rc took forever. |
SPMI for arm32 timed out, too many diffs to cope with (looks like it finished running but got killed before summary could upload)
Artifacts for the libraries arm run missing so hard to say why it failed. Am going to retry. |
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
Got an arm artifact from the failure this time, looks like something to investigate:
|
Seems to be happening more now |
JitStress not happy. |
No luck reproing this yet. |
My local build is a bit behind main, going to merge up. |
I can repro now, but kind of hard to believe this change is responsible. There are very similar failures in https://dev.azure.com/dnceng/public/_build/results?buildId=1790509&view=ms.vss-test-web.build-test-results-tab which is a few days old. So looks like jit stress is in a bad way in general. Last green was 9859f70 and the next rolling after that at 0dba0ee is in bad shape. |
Suspect this is related to the W^X changes. I can see the jit is crashing in |
@AndyAyersMS Add test from #67318? |
I have a follow-on change in the works where I'll add these... |
/azp run runtime-coreclr jitstress |
Azure Pipelines successfully started running 1 pipeline(s). |
Improvements Arm64-Windows: dotnet/perf-autofiling-issues#5755 |
@AndyAyersMS - we found the following regressions that seemed to line up with this PR. We detected this from our analysis while creating the perf report for August. Would you consider these regressions as "by design" as there are improvements associated with this PR as noted above. Regressions:
EDIT(s):
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "zqj", Options: None)
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "aqj", Options: None)
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "aei", Options: Compiled)
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "aqj", Options: Compiled)
System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "aqj", Options: NonBacktracking)
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, Ordinal, False))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreCase, False))
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, Ordinal, False))
System.Linq.Tests.Perf_Enumerable.Count(input: IEnumerable)
System.IO.Tests.Perf_Path.GetDirectoryName
System.Tests.Perf_Double.Parse(value: "1.7976931348623157e+308")
|
Yes. This kind of change will always end up causing some regressions. |
I don't know what the absolute numbers are, but some of these appear to be huge, e. g.
Am I reading that correctly as a 6x regression? S that reproable? |
I wonder if that was data from @tannergooding The perf lab doesn't have any x86+amd data. We have x64+amd (not sure which HW version) and shows at best a small regression: intel x86 windows -- more "typical" regression ~ 25% |
Another regression caused by this from the perf report, this time for System.Collections.IterateForEach.HashSet(Size: 512)
|
Not this time, my two are (both
The reporting |
As far as I can tell, this data was supplied by @adamsitnik: Reaching out to try to repro locally. |
I've re-run the Regex benchmarks for x86 using the same machine for .NET 6, 7 p5 and 7 p7. For this particular case ("zdj" + Options.None) the Mean was:
Compared to .NET 6, Preview 7 is 8 times faster, but compared to Preview 5 it's 3 times slower. BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=6.0.109
[Host] : .NET 6.0.8 (6.0.822.36306), X86 RyuJIT AVX2
Job-NDWKRI : .NET 6.0.8 (6.0.822.36306), X86 RyuJIT AVX2
BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.5.22276.3
[Host] : .NET 7.0.0 (7.0.22.27203), X86 RyuJIT AVX2
Job-LKHRHJ : .NET 7.0.0 (7.0.22.27203), X86 RyuJIT AVX2
BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22370.3
[Host] : .NET 7.0.0 (7.0.22.36904), X86 RyuJIT AVX2
Job-USYZGL : .NET 7.0.0 (7.0.22.36904), X86 RyuJIT AVX2
|
This regex is going to be heavily dominated by calls to span.IndexOf(string). Would this PR have had material impact on that method's codegen? It's unlikely changes anywhere else in or impacting other areas of regex would show up significantly with this expression. Do benchmarks directly on IndexOf show such a hit between previews 5 and 7? |
Seems like the Please let me know if you have a specific test you'd like data for. |
How about data for |
And specifically, a test like this: private string _haystack;
private string _needle;
[GlobalSetup]
public async Task Setup()
{
using HttpClient hc = new HttpClient();
_haystack = await hc.GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt");
_needle = "zqj";
}
[Benchmark]
public int IndexOf() => _haystack.AsSpan().IndexOf(_needle); This appears to have regressed significantly for me between .NET 6 and .NET 7 (this is on my Win11 x64); I've not checked if it regressed between Preview 5 and 7 on my machine. @EgorBo, I know we spoke about this particular case before, but I thought it was mostly addressed. Is this still expected?
|
@stephentoub Yeah it's that the worst case for the new algorithm - extremely low density of the first char ('z') - plain indexOf where we align data and only perform a single comparison at once is faster - I'll check again what I can do for it but overall it's expected. E.g. other inputs for your benchmark: [Benchmark]
[Arguments("zqj")]
[Arguments("tbd")]
[Arguments("suppressing")]
public int IndexOf(string needle) => _haystack.AsSpan().IndexOf(needle);
|
Thanks. I'm interested in the results on @adamsitnik's box. Your box is showing a 25% regression for that input. My box is showing a 50% regression for that input. And apparently something on Adam's box was resulting in this input being 6x slower for a regex that should basically boil down to that same IndexOf call. |
The regression for [Last]IndexOf_Word_NotFound* showed-up for .NET 6 vs .NET 7-rc2 report: System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, IgnoreCase, False))
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (en-US, IgnoreCase, False))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (, IgnoreCase, False))
System.Globalization.Tests.StringSearch.LastIndexOf_Word_NotFound(Options: (en-US, IgnoreCase, False))
|
The JIT currently will aggressively reorder the flow graph before running its
loop recognition phases. When there is PGO data this sometimes perturbs the
block order so that loops are no longer recognized, and we miss out on some
loop optimizations.
This change defers most block reordering until after the JIT has gone through
the optimization phases. There is still a limited form of flow cleanup done
early on.
There is also a compensating change in loop recognition in one place where it was
relying on adjacent blocks being combined.
Fixes #67318.