Skip to content

Commit

Permalink
Cse tuning (#1463)
Browse files Browse the repository at this point in the history
* cse-tuning branch

1. Changed csdLiveAcrossCall to a bool (zero-diff)

* 2.  Added the remaining zero-diff changes from my old coreclr branch (zero-diff)

* 3. Incoming stack arguments don't use any local stack frame slots

x64  5 improvements 0 regressions,  Total PerfScore diff: -10.72
x86 16 improvements 5 regressions,  Total PerfScore diff: -72.95

* 4.  Locals with no references aren't enregistered  (zero-diffs)

* 5. Fix handling of long integer types, they only use one register not two.

    x64 250 improvements 51 regressions,  Total PerfScore diff:   -459.09
  arm64 162 improvements 16 regressions,  Total PerfScore diff:  -1712.52

* 6. Adjust computation of moderateRefCnt and aggressiveRefCnt values

     x64 280 improvements 81 regressions,  Total PerfScore diff:   -274.78
   arm64 264 improvements 61 regressions,  Total PerfScore diff:   -911.00
     x86  87 improvements 42 regressions,  Total PerfScore diff:   -123.46
   arm32 195 improvements 81 regressions,  Total PerfScore diff:   -239.10

* 7.  slotCount refactor (zero-diffs)

* 8.  Enable the use of the live across call information

      x64 125 improvements 136 regressions, Total PerfScore diff:   +427.43
    arm64  83 improvements 153 regressions, Total PerfScore diff:   +260.68
      x86 218 improvements 193 regressions, Total PerfScore diff:   +199.81
    arm32 145 improvements 181 regressions, Total PerfScore diff: -33283.10

arm32 method with improvement:
    -33864.40 (-2.87% of base) : System.Private.CoreLib.dasm - TypeBuilder:CreateTypeNoLock():TypeInfo:this (2 methods)

* 9.  Adjust the cse_use_costs for the LiveAcrossCall case

      x64  61 improvements  61 regressions, Total PerfScore diff:   -189.03
    arm64  90 improvements  49 regressions, Total PerfScore diff:   -463.42
      x86  88 improvements  80 regressions, Total PerfScore diff:   -238.61
    arm32 101 improvements  63 regressions, Total PerfScore diff:   -259.50

* 10.  If this CSE is live across a call then we may need to spill an additional caller save register

          x64  73 improvements  45 regressions, Total PerfScore diff:   -279.88
        arm64  45 improvements  76 regressions, Total PerfScore diff:    -90.94
          x86  13 improvements  14 regressions, Total PerfScore diff:    -21.55
        arm32  45 improvements  33 regressions, Total PerfScore diff:    -78.60

* 11.  (x64 only)  floating point loads/stores encode larger, so adjust the cse def/use cost for SMALL_CODE

   No diffs in System.Private.Corelib

* 12. Remove extra cse de/use costs for methods that have a largeFrame or a hugeFrame

       x64 199 improvements  50 regressions, Total PerfScore diff:   -2061.36
     arm64  11 improvements   3 regressions, Total PerfScore diff:     -46.84
       x86 136 improvements  80 regressions, Total PerfScore diff:   -1795.00
     arm32  50 improvements  35 regressions, Total PerfScore diff:    -132.30

* clang-format

* Code review feedback

Removed increment of enregCount on _TARGET_X86_ when we have compLongUsed:
    Framework diffs
    Total PerfScoreUnits of diff: -654.75 (-0.00% of base)  diff is an improvement.
    79 total methods with Perf Score differences (55 improved, 24 regressed), 146432 unchanged.

Fixed setting of largeFrame/hugeFrame for ARM64
    Zero framework diffs.

:

* run jit-format

* correct some wording in comments

* reword a comment
  • Loading branch information
briansull authored Jan 15, 2020
1 parent e92e2e6 commit 8b59b12
Show file tree
Hide file tree
Showing 2 changed files with 337 additions and 105 deletions.
14 changes: 7 additions & 7 deletions src/coreclr/src/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -1001,8 +1001,8 @@ class TempDsc
TempDsc(int _tdNum, unsigned _tdSize, var_types _tdType) : tdNum(_tdNum), tdSize((BYTE)_tdSize), tdType(_tdType)
{
#ifdef DEBUG
assert(tdNum <
0); // temps must have a negative number (so they have a different number from all local variables)
// temps must have a negative number (so they have a different number from all local variables)
assert(tdNum < 0);
tdOffs = BAD_TEMP_OFFSET;
#endif // DEBUG
if (tdNum != _tdNum)
Expand Down Expand Up @@ -6144,8 +6144,8 @@ class Compiler

unsigned csdHashKey; // the orginal hashkey

unsigned csdIndex; // 1..optCSECandidateCount
char csdLiveAcrossCall; // 0 or 1
unsigned csdIndex; // 1..optCSECandidateCount
bool csdLiveAcrossCall;

unsigned short csdDefCount; // definition count
unsigned short csdUseCount; // use count (excluding the implicit uses at defs)
Expand Down Expand Up @@ -6242,7 +6242,7 @@ class Compiler
unsigned optCSECandidateCount; // Count of CSE's candidates, reset for Lexical and ValNum CSE's
unsigned optCSEstart; // The first local variable number that is a CSE
unsigned optCSEcount; // The total count of CSE's introduced.
unsigned optCSEweight; // The weight of the current block when we are doing PerformCS
unsigned optCSEweight; // The weight of the current block when we are doing PerformCSE

bool optIsCSEcandidate(GenTree* tree);

Expand Down Expand Up @@ -6301,8 +6301,8 @@ class Compiler
INDEBUG(void optDumpCopyPropStack(LclNumToGenTreePtrStack* curSsaName));

/**************************************************************************
* Early value propagation
*************************************************************************/
* Early value propagation
*************************************************************************/
struct SSAName
{
unsigned m_lvNum;
Expand Down
Loading

0 comments on commit 8b59b12

Please sign in to comment.