Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Don't do aggressive block compaction too early #105041

Closed
wants to merge 1 commit into from

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Jul 17, 2024

Several of the regressions in #103972 stem from early aggressive block compaction pessimizing loop inversion. For example, consider the following block layout early in compilation for System.Linq.Tests.Perf_Enumerable.Aggregate_Seed. If we don't aggressively compact in fgUpdateFlowGraphPhase, we get the following layout right before loop inversion:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..003)-> BB03(0.5),BB02(0.5)     ( cond )                     i
BB02 [0001]  1       BB01                  0    [003..00A)                           (throw )                     i rare hascall gcsafe
BB03 [0002]  1       BB01                  1    [00A..00D)-> BB05(0.5),BB04(0.5)     ( cond )                     i
BB04 [0003]  1       BB03                  0    [00D..013)                           (throw )                     i rare hascall gcsafe
BB05 [0004]  1       BB03                  1    [013..01C)-> BB06(1)                 (always)                     i hascall gcsafe
BB06 [0005]  1  0    BB05                  1    [01C..01E)-> BB08(1)                 (always) T0      try {       i keep
BB07 [0006]  1  0    BB08                  1    [01E..02E)-> BB08(1)                 (always) T0                  i hascall gcsafe bwd bwd-target
BB08 [0007]  2  0    BB06,BB07             1    [02E..036)-> BB07(0.5),BB16(0.5)     ( cond ) T0      }           i hascall bwd bwd-src
BB16 [0015]  1       BB08                  1    [038..03B)-> BB17(1)                 (always)                     i keep cfb
BB17 [0016]  1       BB16                  1    [03B..042)-> BB15(1)                 (always)                     i hascall gcsafe
BB15 [0012]  1       BB17                  1    [042..044)                           (return)                     i
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB12 [0009]  1     0                       1    [038..03B)-> BB14(0.5),BB13(0.5)     ( cond )    H0 F fault {     i keep flet
BB13 [0010]  1     0 BB12                  1    [03B..041)-> BB14(1)                 (always)    H0               i hascall gcsafe
BB14 [0011]  2     0 BB12,BB13             1    [041..042)                           (falret)    H0   }           i
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

In the try region, there's an opportunity to aggressively compact, and get the following loop shape:

BB06 [0005]  2  0    BB05,BB07             1    [01C..036)-> BB07(0.5),BB16(0.5)     ( cond ) T0      try {       i keep hascall bwd
BB07 [0006]  1  0    BB06                  1    [01E..02E)-> BB06(1)                 (always) T0      }           i hascall gcsafe bwd bwd-target
BB16 [0015]  1       BB06                  1    [038..03B)-> BB17(1)                 (always)                     i keep cfb

Loop inversion won't kick in for this, and since fgMoveHotJumps won't modify the first block of an EH region, we don't get the opportunity to flip this shape, so we end up with subpar loop layout.

If we don't aggressively compact, loop inversion kicks in, and we get the following shape:

BB06 [0005]  1  0    BB05                  1    [01C..01E)-> BB15(1)                 (always) T0      try {       i keep
BB15 [0018]  1  0    BB06                  1    [???..???)-> BB09(0.5),BB07(0.5)     ( cond ) T0                  internal
BB07 [0006]  2  0    BB08,BB15             1    [01E..02E)-> BB08(1)                 (always) T0                  i hascall gcsafe bwd bwd-target
BB08 [0007]  1  0    BB07                  1    [02E..036)-> BB07(0.5),BB09(0.5)     ( cond ) T0      }           i hascall bwd bwd-src
BB09 [0015]  2       BB08,BB15             1    [038..03B)-> BB10(1)                 (always)                     i keep cfb

After flow opts and block layout run, our final shape looks like this:

BB04 [0005]  1  0    BB03                  1    [01C..01E)-> BB06(0.5),BB05(0.5)     ( cond ) T0      try {       i keep
BB05 [0019]  2  0    BB04,BB05             4    [01E..036)-> BB05(0.5),BB06(0.5)     ( cond ) T0      }           i loophead gcsafe bwd
BB06 [0015]  2       BB04,BB05             1    [038..044)                           (return)                     i keep gcsafe cfb

I don't think we are losing anything by reserving aggressive compaction for later flow opt phases, and the upside is more opportunities for loop inversion, which may unlock other loop opts.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 17, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member Author

Here's another interesting consequence: Aggressive compaction would sometimes move a loop inside a try region up, such that the loop header is now the first block in the region. Loop cloning gives up if it finds the start of a try region in the loop, so with this change, I'm now seeing more loop cloning.

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs have some pretty big code size increases. The jit-analyze output reveals much of the net size increase stems from more loop cloning. Looking at the example diffs, I see loop inversion is frequently increasing code size slightly, but ultimately improving PerfScore (in general, it looks like many of the small size improvements are PerfScore regressions, and the small size regressions tend to be PerfScore improvements). #103785 had some pretty substantial TP improvements, so the TP cost here isn't as bad as it looks.

@AndyAyersMS
Copy link
Member

I worry about this much churn this late in the cycle. Seems like it will give the usual mix of improvements and some regressions.

I wonder if we could just make loop inversion (and perhaps loop cloning) smarter? Inversion is very pattern-matchy. Though perhaps that causes the same or worse level of churn.

@amanasifkhalid
Copy link
Member Author

I worry about this much churn this late in the cycle. Seems like it will give the usual mix of improvements and some regressions.
I wonder if we could just make loop inversion (and perhaps loop cloning) smarter? Inversion is very pattern-matchy. Though perhaps that causes the same or worse level of churn.

I share your concerns about churn; for similar reasons, I'm nervous about tweaking loop inversion. The diffs for this change are large, but the problem it's trying to address is somewhat narrow: We compact blocks early on such that we get this weird loop shape that loop inversion can't fix, and the loop is now the beginning of a try region, so fgMoveHotJumps won't try to fix it. Since this is a pretty narrow case, I'm ok with us not taking this until later (I think in the next release, we ought to try relaxing some of the EH restrictions during layout, like being able to change the first block of a try region if it doesn't create flow into the middle of it).

I also think the diffs on this PR are misleadingly large. I ran SPMI on this change versus not doing aggressive compaction at all (i.e. without #103785), and this change seems to be reversing a lot of the churn I initially created: The diffs are much smaller. Here's the short summary:

Diffs are based on 2,696,572 contexts (1,091,355 MinOpts, 1,605,217 FullOpts).

MISSED contexts: base: 29,302 (1.07%), diff: 29,214 (1.07%)

Overall (-2,847 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 64,452,395 -470 -3.19%
benchmarks.run.windows.x64.checked.mch 11,307,991 -215 -16.48%
benchmarks.run_pgo.windows.x64.checked.mch 41,871,649 -298 -5.17%
benchmarks.run_tiered.windows.x64.checked.mch 12,538,466 +73 +16.53%
coreclr_tests.run.windows.x64.checked.mch 392,985,123 +712 +1.44%
libraries.crossgen2.windows.x64.checked.mch 36,986,365 -87 -1.19%
libraries.pmi.windows.x64.checked.mch 62,824,530 +83 +22.81%
libraries_tests.run.windows.x64.Release.mch 313,793,334 -3,665 -8.06%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 146,552,499 +1,272 -7.66%
realworld.run.windows.x64.checked.mch 12,214,184 -90 -2.95%
smoke_tests.nativeaot.windows.x64.checked.mch 3,489,201 -162 +85.66%
FullOpts (-2,847 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 36,418,630 -470 -3.19%
benchmarks.run.windows.x64.checked.mch 11,307,379 -215 -16.48%
benchmarks.run_pgo.windows.x64.checked.mch 22,288,213 -298 -5.17%
benchmarks.run_tiered.windows.x64.checked.mch 3,234,203 +73 +16.53%
coreclr_tests.run.windows.x64.checked.mch 120,195,377 +712 +1.44%
libraries.crossgen2.windows.x64.checked.mch 36,984,660 -87 -1.19%
libraries.pmi.windows.x64.checked.mch 62,711,569 +83 +22.81%
libraries_tests.run.windows.x64.Release.mch 125,753,062 -3,665 -8.06%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 135,705,717 +1,272 -7.66%
realworld.run.windows.x64.checked.mch 11,801,081 -90 -2.95%
smoke_tests.nativeaot.windows.x64.checked.mch 3,488,230 -162 +85.66%

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

@AndyAyersMS
Copy link
Member

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

I am thinking we should hold off. But let me look at which benchmarks had regressions a bit more closely first.

@AndyAyersMS
Copy link
Member

@AndyAyersMS are you ok with including this in .NET 9, or should we hold off?

I am thinking we should hold off. But let me look at which benchmarks had regressions a bit more closely first.

Ok, some of them are pretty large, and the number of improvements from #103785 was fairly small. Can you repro some of the worst-case regressions locally, and see that they're now fixed?

I built a collated table here: #103972 (comment)

@AndyAyersMS
Copy link
Member

Maybe also look at #50204? It might be an interesting alternative, if that's all that it takes to "fix" loop inversion.

@amanasifkhalid
Copy link
Member Author

Ok, some of them are pretty large, and the number of improvements from #103785 was fairly small. Can you repro some of the worst-case regressions locally, and see that they're now fixed?

Thanks for taking a look at the regressions! I'm not able to reproduce any improvement for the largest amd64 regressions locally, unfortunately. Looking at some of the JIT dumps, I see diffs from more conservative block compaction, but this isn't manifesting significant codegen diffs, which is odd...

Maybe also look at #50204? It might be an interesting alternative, if that's all that it takes to "fix" loop inversion.

Thanks for pointing that out. I've opened #105161 -- the asmdiffs looked promising locally. Maybe we can pursue that change for P7, and come back to our block compaction strategy (among other flow opts) in .NET 10.

@amanasifkhalid
Copy link
Member Author

Let's keep the new compaction on for now, and look into adjusting affected phases to handle the shapes they're struggling with.

@amanasifkhalid amanasifkhalid added this to the 10.0.0 milestone Aug 12, 2024
@amanasifkhalid
Copy link
Member Author

No longer needed, since we're moving loop inversion to a graph-based implementation.

@github-actions github-actions bot locked and limited conversation to collaborators Jan 5, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants