Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spill single-def variable at definition to avoid further spilling #54345

Merged
merged 20 commits into from
Jul 10, 2021

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jun 17, 2021

If a variable is a single-def and it was ever decided to be spilled during register allocation, spill it at its first and only definition so we can skip further spilling throughout the method since the value on the stack is already up-to-date.

This is accomplished in following way:

  1. During buildIntervals(), we will mark interval->isSingleDef=true if it is a single-def.
  2. Whenever we spill an interval, and if that interval is a single-def, we mark its firstRefPosition->singleDefSpill = true indicating that it should be spilled right after the first ref position. This concept is similar to our writeThru EH-vars.
  3. During write-back phase, when we iterate over all refPositions, we find the first refpositions that are marked as singleDefSpill and mark corresponding varDsc->lvSpillAtSingleDef=true
  4. During codegen, we will then use varDsc->lvSpillAtSingleDef to decide if we should spill a variable and include it in GC pointers scanning (since it is always on stack).

As a positive side-effect, during spilling, we would prefer to spill a register assigned to "spill at singledef" interval because cost of spilling it is lower than other intervals.

This gives significant performance improvement in scenarios that involve common sub-expression elimination (CSE). In CSEs, we define a temp variable once and use it at multiple places. We would eliminate the spilling of these temporary variables throughout the method. Below is an example diff of GenericArraySortHelper.InsertionSort. The method involves nested loop and we CSE some temps that we spill immediately at their definition. That helps us eliminate the spilling of those variables inside the loop.

image

image

Similar improvement can be seen for "this" argument which is always a single-def.

Diff of InsertionSort: https://www.diffchecker.com/bZEOJkJ6

Contributes to #6761 and #6825.

Credit: Optimized Interval Splitting in a Linear Scan Register Allocator paper by Christian and Hanspeter, section 4.3.

Most intervals have only one instruction that defines
the value, but are used multiple times later on. If such
an interval is spilled and reloaded several times, we
insert a spill move directly after the definition. There�fore, the stack slot is up-to-date in all possible code
paths, and all further stores to this stack slot can be
eliminated

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 17, 2021
@kunalspathak
Copy link
Member Author

kunalspathak commented Jun 30, 2021

jitstressregs failures are related to #54007 and #47094

@kunalspathak
Copy link
Member Author

kunalspathak commented Jul 1, 2021

Below is the summary of diffs. Detail diffs can be seen here.

Code Size

Scenerio Os/Arch main PR Diff Diff%
benchmarks Linux.x64 2028424 2013740 -14684 -0.72%
benchmarks windows.arm64 597164 594792 -2372 -0.40%
benchmarks windows.x64 1084734 1077145 -7589 -0.70%
benchmarks windows.x86 2614926 2591227 -23699 -0.91%
libraries.crossgen2 Linux.arm 6108300 6094282 -14018 -0.23%
libraries.crossgen2 Linux.arm64 3063300 3057996 -5304 -0.17%
libraries.crossgen2 Linux.x64 7054508 7010282 -44226 -0.63%
libraries.crossgen2 windows.arm64 3417052 3411404 -5648 -0.17%
libraries.crossgen2 windows.x64 3797672 3781087 -16585 -0.44%
libraries.pmi Linux.arm 5770054 5749106 -20948 -0.36%
libraries.pmi Linux.arm64 2962256 2951384 -10872 -0.37%
libraries.pmi Linux.x64 9075022 9012979 -62043 -0.68%
libraries.pmi windows.arm64 3238056 3226780 -11276 -0.35%
libraries.pmi windows.x64 4852165 4818860 -33305 -0.69%
libraries.pmi windows.x86 11316890 11223315 -93575 -0.83%

Perf Score

Scenerio Os/Arch main PR Diff Diff%
benchmarks Linux.x64 767855166.9 766740526.5 -1114640.36 -0.15%
benchmarks windows.arm64 907973742.4 906921505.7 -1052236.71 -0.12%
benchmarks windows.x64 946655065.1 945568497.9 -1086567.13 -0.11%
benchmarks windows.x86 1279839380 1227145769 -52693610.3 -4.12%
libraries.crossgen2 Linux.arm 28549039.27 28416554.65 -132484.62 -0.46%
libraries.crossgen2 Linux.arm64 29893024.81 29782327.28 -110697.53 -0.37%
libraries.crossgen2 Linux.x64 27786108.63 27565373.61 -220735.02 -0.79%
libraries.crossgen2 windows.arm64 31240565.32 31128520.16 -112045.16 -0.36%
libraries.crossgen2 windows.x64 23300356.98 23111786.57 -188570.41 -0.81%
libraries.pmi Linux.arm 2.96123E+16 2.96123E+16 -17521187214 0.00%
libraries.pmi Linux.arm64 3.37907E+16 3.37907E+16 -17218540882 0.00%
libraries.pmi Linux.x64 1.63357E+16 1.63357E+16 -17664597865 0.00%
libraries.pmi windows.arm64 3.37907E+16 3.37907E+16 -17219394161 0.00%
libraries.pmi windows.x64 1.63349E+16 1.63349E+16 -17253326644 0.00%
libraries.pmi windows.x86 6.64414E+16 6.64414E+16 -53932960191 0.00%

Instruction Count

Scenerio Os/Arch main PR Diff Diff%
benchmarks Linux.x64 503104 500082 -3022 -0.60%
benchmarks windows.arm64 149291 148698 -593 -0.40%
benchmarks windows.x64 261483 260284 -1199 -0.46%
benchmarks windows.x86 794377 788351 -6026 -0.76%
libraries.crossgen2 Linux.arm 2301422 2295304 -6118 -0.27%
libraries.crossgen2 Linux.arm64 765825 764499 -1326 -0.17%
libraries.crossgen2 Linux.x64 1709422 1701060 -8362 -0.49%
libraries.crossgen2 windows.arm64 854263 852851 -1412 -0.17%
libraries.crossgen2 windows.x64 891037 888170 -2867 -0.32%
libraries.pmi Linux.arm 2076031 2068035 -7996 -0.39%
libraries.pmi Linux.arm64 740564 737846 -2718 -0.37%
libraries.pmi Linux.x64 2231946 2219408 -12538 -0.56%
libraries.pmi windows.arm64 809514 806695 -2819 -0.35%
libraries.pmi windows.x64 1148648 1142978 -5670 -0.49%
libraries.pmi windows.x86 3535862 3512175 -23687 -0.67%

@kunalspathak kunalspathak marked this pull request as ready for review July 1, 2021 17:13
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@kunalspathak
Copy link
Member Author

FYI - @danmoseley

@kunalspathak kunalspathak changed the title Singledef spill Spill single-def variable at definition to avoid further spilling Jul 1, 2021
@kunalspathak
Copy link
Member Author

outerloop failures are #54469

Copy link
Contributor

@sandreenko sandreenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, nice catch!

one question, I looked at the regressions, they mostly look like:

-Generating: N089 (  3,  3) [000140] DA----------              *  STORE_LCL_VAR long   V12 tmp10        d:2 rdx REG rdx
+Generating: N089 (  3,  3) [000140] DA---------z              *  STORE_LCL_VAR long   V12 tmp10        d:2 rdx REG rdx
+IN000b:        mov      qword ptr [V12 rsp+20H], rdx

                                                    
							V12 in reg rdx is becoming live  [000140]
							Live regs: 00000002 {rcx} => 00000006 {rcx rdx}
							Live vars: {V06} => {V06 V12}
genIPmappingAdd: ignoring duplicate IL offset 0xc
Generating: N091 (???,???) [000196] ------------                 IL_OFFSET void   IL offset: 0xc REG NA
Generating: N093 (  1,  1) [000122] ------------       t122 =    LCL_VAR   long   V12 tmp10        u:2 rdx REG rdx $401
Generating: N095 (  1,  1) [000123] -c----------       t123 =    CNS_INT   long   0 REG NA $100
                                                              /--*  t122   long   
                                                              +--*  t123   long   
Generating: N097 (  3,  3) [000124] J------N----              *  EQ        void   REG NA $245
-Not emitting compare due to flags being already set
+IN000c:        test     rdx, rdx
Generating: N099 (  5,  5) [000125] ------------              *  JTRUE     void   REG NA
-IN000b:        je       L_M36406_BB08
+IN000d:        je       L_M36406_BB08

and if I understand correctly mov does not change SZ flag, so can we add a special condition to AreFlagsSetToZeroCmp to look at the instruction before the previous instruction (why does not English language have a work for this?) if the previous is such spill? Will it make this change all improvements no regressions?

This diff in benchmark looks like a potential measurable regression:
232 ( 3.51% of base) : 878.dasm - System.Number:NumberToStringFormat(byref,byref,System.ReadOnlySpan1[Char],System.Globalization.NumberFormatInfo)`

src/coreclr/jit/codegencommon.cpp Outdated Show resolved Hide resolved
unsigned char lvSpillAtSingleDef : 1; // variable has a single def (as determined by LSRA interval scan)
// and is spilled making it candidate to spill right after the
// first (and only) definition.
// Note: We cannot reuse lvSingleDefRegCandidate because it is set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that lvSingleDefRegCandidate==true does not imply that the variable has a single def for lvSpillAtSingleDef purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. lvSingleDefRegCandidate is only used so we can decide if we should enregister EH-var or not (which happens in earlier phase). Things change after that and we can't rely on that information. The best safe place I thought about checking for single-def is while building intervals. I wanted to do it for EH-var as well, but unfortunately we need that information before LSRA.

@kunalspathak
Copy link
Member Author

AreFlagsSetToZeroCmp

I do not see that diff. Here are the diffs that I see. May be my branch didn't have #53214 and you compared it with main?

@sandreenko
Copy link
Contributor

I do not see that diff. Here are the diffs that I see. May be my branch didn't have #53214 and you compared it with main?

I was comparing your branch top with sandreenko@87cd70c (the last change not from your on your branch), strange.

benchmarks.run.windows.x64.checked.mch
Top method regressions (percentages):
           3 ( 3.75% of base) : 19944.dasm - System.Memory.Span`1[Char][System.Char]:Clear():this
         232 ( 3.51% of base) : 878.dasm - System.Number:NumberToStringFormat(byref,byref,System.ReadOnlySpan`1[Char],System.Globalization.NumberFormatInfo)
          86 ( 2.79% of base) : 10294.dasm - Number:ParseNumber(byref,long,int,byref,System.Text.StringBuilder,System.Globalization.NumberFormatInfo,bool):bool

@kunalspathak
Copy link
Member Author

kunalspathak commented Jul 7, 2021

Here are the numbers for aspnet


### Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 231282619.1500001
Total PerfScoreUnits of diff: 231214548.70000002
Total PerfScoreUnits of delta: -68070.45 (-0.03% of base)
Total relative delta: 0.12
    diff is an improvement.
    relative diff is a regression.


--------------------------------------------------------------------------------

### Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 461140
Total bytes of diff: 459362
Total bytes of delta: -1778 (-0.39% of base)
Total relative delta: -1.56
    diff is an improvement.
    relative diff is an improvement.


--------------------------------------------------------------------------------

### Summary of Instruction Count diffs:
(Lower is better)

Total Instructions of base: 113121
Total Instructions of diff: 112787
Total Instructions of delta: -334 (-0.30% of base)
Total relative delta: -1.24
    diff is an improvement.
    relative diff is an improvement.


--------------------------------------------------------------------------------

Detail diffs: https://gist.github.com/kunalspathak/b92c9221560a6c0c9136d6419f50fd03

@kunalspathak
Copy link
Member Author

Thanks to @sandreenko for brining me the test diff to my attention. I will investigate why I didn't notice it earlier.

image

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me. Agree with Sergey that some refactoring to capture the common predicate chains would make this more readable/maintainable.

Seems like some interesting avenues for follow up too:

  • more aggressive CSE policy (since CSEs are usually single def)
  • reconsider which weights we use for ref positions (global/local...?)
  • decide if spilling at def is sub-optimal if all uses are cheap

src/coreclr/jit/lclvars.cpp Show resolved Hide resolved
src/coreclr/jit/lsra.cpp Show resolved Hide resolved
src/coreclr/jit/lsra.cpp Outdated Show resolved Hide resolved
@kunalspathak
Copy link
Member Author

There were some errors in the environment when I collected above numbers. Here are fresh set (no major changes)
Full diff here: https://gist.github.com/kunalspathak/dfb34640229c498d374bdaee4be251b5

Code size

Scenario OS/arch main PR diff diff% methods improved methods regressed methods unchanged
Benchmarks Linux.x64 2028424 2013740 -14684 -0.72% 869 18 948
Benchmarks windows.arm64 597164 594792 -2372 -0.40% 121 3 232
Benchmarks windows.x64 852477 846816 -5661 -0.66% 274 10 379
Benchmarks windows.x86 2223804 2203334 -20470 -0.92% 1661 91 1469
Libraries.crossgen2 Linux.arm 6108300 6094282 -14018 -0.23% 2051 18 3344
Libraries.crossgen2 Linux.arm64 3063300 3057996 -5304 -0.17% 485 1 1130
Libraries.crossgen2 Linux.x64 7054508 7010282 -44226 -0.63% 2967 14 5796
Libraries.crossgen2 windows.arm64 3417052 3411404 -5648 -0.17% 515 2 1210
Libraries.crossgen2 windows.x64 132021 131325 -696 -0.53% 71 0 127
Libraries.crossgen2 windows.x86 1015596 1008170 -7426 -0.73% 1066 20 3835
Libraries.pmi Linux.arm 5770054 5749106 -20948 -0.36% 2455 23 3764
Libraries.pmi Linux.arm64 2962256 2951384 -10872 -0.37% 791 19 971
Libraries.pmi Linux.x64 9075022 9012979 -62043 -0.68% 4491 31 5367
Libraries.pmi windows.arm64 3238056 3226780 -11276 -0.35% 808 19 1055
Libraries.pmi windows.x64 4092097 4063536 -28561 -0.70% 1494 24 1782
Libraries.pmi windows.x86 9859435 9779280 -80155 -0.81% 7787 296 10327

Perf score

Scenario OS/arch main PR diff diff% methods improved methods regressed methods unchanged
Benchmarks Linux.x64 767855166.9 766740526.5 -1114640.36 -0.15% 941 173 721
Benchmarks windows.arm64 907973742.4 906921505.7 -1052236.71 -0.12% 133 56 167
Benchmarks windows.x64 2878282.8 2842281.39 -36001.41 -1.25% 289 84 290
Benchmarks windows.x86 471643297.9 471528942.9 -114355 -0.02% 1823 275 1123
Libraries.crossgen2 Linux.arm 28549039.27 28416554.65 -132484.62 -0.46% 2201 313 2899
Libraries.crossgen2 Linux.arm64 29893024.81 29782327.28 -110697.53 -0.37% 603 139 874
Libraries.crossgen2 Linux.x64 27786108.63 27565373.61 -220735.02 -0.79% 3029 518 5230
Libraries.crossgen2 windows.arm64 31240565.32 31128520.16 -112045.16 -0.36% 642 153 932
Libraries.crossgen2 windows.x64 174485.05 169961.52 -4523.53 -2.59% 79 21 98
Libraries.crossgen2 windows.x86 1243219.4 1233926.25 -9293.15 -0.75% 1207 276 3438
Libraries.pmi Linux.arm 2.96123E+16 2.96123E+16 -17521187214 0.00% 2307 545 3390
Libraries.pmi Linux.arm64 3.37907E+16 3.37907E+16 -17218540882 0.00% 712 334 735
Libraries.pmi Linux.x64 1.63357E+16 1.63357E+16 -17664597865 0.00% 4474 833 4582
Libraries.pmi windows.arm64 3.37907E+16 3.37907E+16 -17219394161 0.00% 732 346 804
Libraries.pmi windows.x64 9.09125E+12 9.09125E+12 -4468993.94 0.00% 1448 399 1453
Libraries.pmi windows.x86 1.74537E+13 1.74171E+13 -36683260555 -0.21% 8260 1704 8446

Instruction count

Scenario OS/arch main PR diff diff% methods improved methods regressed methods unchanged
Benchmarks Linux.x64 503104 500082 -3022 -0.60% 861 10 964
Benchmarks windows.arm64 149291 148698 -593 -0.40% 121 3 232
Benchmarks windows.x64 209502 208582 -920 -0.44% 266 10 387
Benchmarks windows.x86 685335 680070 -5265 -0.77% 1627 77 1517
Libraries.crossgen2 Linux.arm 2301422 2295304 -6118 -0.27% 2052 11 3350
Libraries.crossgen2 Linux.arm64 765825 764499 -1326 -0.17% 485 1 1130
Libraries.crossgen2 Linux.x64 1709422 1701060 -8362 -0.49% 2935 13 5829
Libraries.crossgen2 windows.arm64 854263 852851 -1412 -0.17% 515 2 1210
Libraries.crossgen2 windows.x64 33439 33293 -146 -0.44% 71 0 127
Libraries.crossgen2 windows.x86 351044 349011 -2033 -0.58% 1042 18 3861
Libraries.pmi Linux.arm 2076031 2068035 -7996 -0.39% 2457 15 3770
Libraries.pmi Linux.arm64 740564 737846 -2718 -0.37% 791 19 971
Libraries.pmi Linux.x64 2231946 2219408 -12538 -0.56% 4459 16 5414
Libraries.pmi windows.arm64 809514 806695 -2819 -0.35% 808 19 1055
Libraries.pmi windows.x64 978759 973838 -4921 -0.50% 1481 18 1801
Libraries.pmi windows.x86 3118510 3097750 -20760 -0.67% 7629 218 10563

@kunalspathak
Copy link
Member Author

Thanks to @sandreenko for brining me the test diff to my attention. I will investigate why I didn't notice it earlier.

I did some investigation the regressions related to test and noticed that we re-introduce the test because we just created a definition and want to spill to memory immdiately. Although mov doesn't affect any flags, we do not have a way to look past the previous instruction to determine if it is safe to eliminate test instruction.

@kunalspathak
Copy link
Member Author

@AndyAyersMS , @sandreenko - I have addressed the review feedback.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more thing I'd like you to look at.

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
If a single-def variable is decided to get spilled in its lifetime, then
spill it at the firstRefPosition RefTypeDef so the value of the variable
is always valid on the stack. Going forward, no more spills will be needed
for such variable or no more resolutions (reg to stack) will be needed for
such single-def variables.
Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ghost
Copy link

ghost commented Jul 10, 2021

Hello @kunalspathak!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@kunalspathak
Copy link
Member Author

Failures are infrastructure related.

@kunalspathak
Copy link
Member Author

Just found out a related issue - #7994.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants