Cse tuning (#1463)

* cse-tuning branch 1. Changed csdLiveAcrossCall to a bool (zero-diff) * 2. Added the remaining zero-diff changes from my old coreclr branch (zero-diff) * 3. Incoming stack arguments don't use any local stack frame slots x64 5 improvements 0 regressions, Total PerfScore diff: -10.72 x86 16 improvements 5 regressions, Total PerfScore diff: -72.95 * 4. Locals with no references aren't enregistered (zero-diffs) * 5. Fix handling of long integer types, they only use one register not two. x64 250 improvements 51 regressions, Total PerfScore diff: -459.09 arm64 162 improvements 16 regressions, Total PerfScore diff: -1712.52 * 6. Adjust computation of moderateRefCnt and aggressiveRefCnt values x64 280 improvements 81 regressions, Total PerfScore diff: -274.78 arm64 264 improvements 61 regressions, Total PerfScore diff: -911.00 x86 87 improvements 42 regressions, Total PerfScore diff: -123.46 arm32 195 improvements 81 regressions, Total PerfScore diff: -239.10 * 7. slotCount refactor (zero-diffs) * 8. Enable the use of the live across call information x64 125 improvements 136 regressions, Total PerfScore diff: +427.43 arm64 83 improvements 153 regressions, Total PerfScore diff: +260.68 x86 218 improvements 193 regressions, Total PerfScore diff: +199.81 arm32 145 improvements 181 regressions, Total PerfScore diff: -33283.10 arm32 method with improvement: -33864.40 (-2.87% of base) : System.Private.CoreLib.dasm - TypeBuilder:CreateTypeNoLock():TypeInfo:this (2 methods) * 9. Adjust the cse_use_costs for the LiveAcrossCall case x64 61 improvements 61 regressions, Total PerfScore diff: -189.03 arm64 90 improvements 49 regressions, Total PerfScore diff: -463.42 x86 88 improvements 80 regressions, Total PerfScore diff: -238.61 arm32 101 improvements 63 regressions, Total PerfScore diff: -259.50 * 10. If this CSE is live across a call then we may need to spill an additional caller save register x64 73 improvements 45 regressions, Total PerfScore diff: -279.88 arm64 45 improvements 76 regressions, Total PerfScore diff: -90.94 x86 13 improvements 14 regressions, Total PerfScore diff: -21.55 arm32 45 improvements 33 regressions, Total PerfScore diff: -78.60 * 11. (x64 only) floating point loads/stores encode larger, so adjust the cse def/use cost for SMALL_CODE No diffs in System.Private.Corelib * 12. Remove extra cse de/use costs for methods that have a largeFrame or a hugeFrame x64 199 improvements 50 regressions, Total PerfScore diff: -2061.36 arm64 11 improvements 3 regressions, Total PerfScore diff: -46.84 x86 136 improvements 80 regressions, Total PerfScore diff: -1795.00 arm32 50 improvements 35 regressions, Total PerfScore diff: -132.30 * clang-format * Code review feedback Removed increment of enregCount on _TARGET_X86_ when we have compLongUsed: Framework diffs Total PerfScoreUnits of diff: -654.75 (-0.00% of base) diff is an improvement. 79 total methods with Perf Score differences (55 improved, 24 regressed), 146432 unchanged. Fixed setting of largeFrame/hugeFrame for ARM64 Zero framework diffs. : * run jit-format * correct some wording in comments * reword a comment
dotnet · Jan 15, 2020 · 8b59b12 · 8b59b12
1 parent e92e2e6
commit 8b59b12
Show file tree

Hide file tree

Showing 2 changed files with 337 additions and 105 deletions.
diff --git a/src/coreclr/src/jit/compiler.h b/src/coreclr/src/jit/compiler.h
@@ -1001,8 +1001,8 @@ class TempDsc
     TempDsc(int _tdNum, unsigned _tdSize, var_types _tdType) : tdNum(_tdNum), tdSize((BYTE)_tdSize), tdType(_tdType)
     {
 #ifdef DEBUG
-        assert(tdNum <
-               0); // temps must have a negative number (so they have a different number from all local variables)
+        // temps must have a negative number (so they have a different number from all local variables)
+        assert(tdNum < 0);
         tdOffs = BAD_TEMP_OFFSET;
 #endif // DEBUG
         if (tdNum != _tdNum)
@@ -6144,8 +6144,8 @@ class Compiler
 
         unsigned csdHashKey; // the orginal hashkey
 
-        unsigned csdIndex;          // 1..optCSECandidateCount
-        char     csdLiveAcrossCall; // 0 or 1
+        unsigned csdIndex; // 1..optCSECandidateCount
+        bool     csdLiveAcrossCall;
 
         unsigned short csdDefCount; // definition   count
         unsigned short csdUseCount; // use          count  (excluding the implicit uses at defs)
@@ -6242,7 +6242,7 @@ class Compiler
     unsigned optCSECandidateCount; // Count of CSE's candidates, reset for Lexical and ValNum CSE's
     unsigned optCSEstart;          // The first local variable number that is a CSE
     unsigned optCSEcount;          // The total count of CSE's introduced.
-    unsigned optCSEweight;         // The weight of the current block when we are doing PerformCS
+    unsigned optCSEweight;         // The weight of the current block when we are doing PerformCSE
 
     bool optIsCSEcandidate(GenTree* tree);
 
@@ -6301,8 +6301,8 @@ class Compiler
     INDEBUG(void optDumpCopyPropStack(LclNumToGenTreePtrStack* curSsaName));
 
     /**************************************************************************
-    *               Early value propagation
-    *************************************************************************/
+     *               Early value propagation
+     *************************************************************************/
     struct SSAName
     {
         unsigned m_lvNum;