Add BOLT Makefile #54107

Zentrik · 2024-04-16T22:19:49Z

This uses LLVM's BOLT to optimize libLLVM, libjulia-internal and libjulia-codegen.

This improves the allinference benchmarks by about 10% largely due to the optimization of libjulia-internal.
The example in issue #45395 which stresses LLVM significantly more also sees a ~10% improvement.
We see a 20% improvement on

@time for i in 1:100000000
    string(i)
end

When building corecompiler.ji:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

This only requires a single build of LLVM and theoretically none if we change the binary builder script (i.e. we build with relocations and the -fno-reorder-blocks-and-partition and then we can use BOLT to get binaries with no relocations and reordered blocks and then ship both binaries?) compared to the 2 in PGO. Also, this theoretically can improve performance of a PGO+LTO build by a couple %.

The only reproducible test problem I see is that the BOLT, PGO+LTO and PGO+LTO+BOLT builds all cause readelf to emit warnings as part of the osutils tests.

readelf: Warning: Unrecognised form: 0x22
readelf: Warning: DIE has locviews without loclist
readelf: Warning: Unrecognised form: 0x23
readelf: Warning: DIE at offset 0x227399 refers to abbreviation number 14754 which does not exist
readelf: Warning: Bogus end-of-siblings marker detected at offset 212aa9 in .debug_info section
readelf: Warning: Bogus end-of-siblings marker detected at offset 212ab0 in .debug_info section
readelf: Warning: Further warnings about bogus end-of-sibling markers suppressed

The unrecognised form warnings seem to be a bug in binutils, https://sourceware.org/bugzilla/show_bug.cgi?id=28981.
DIE at offset warning I believe was fixed in binutils 2.36, https://sourceware.org/bugzilla/show_bug.cgi?id=26808, but ld -v says I have 2.38.
I assume these are all benign. I also don't see them on CI here https://buildkite.com/julialang/julia-buildkite/builds/1507#018f00e7-0737-4a42-bcd9-d4061dc8c93e so could just be a local issue.

TODO:

Add PGO+LTO+BOLT makefile
Try and get libjulia-codegen optimised
Address the todo's in the makefile
Run a full test suite on the resulting binary
Try to minimise what gets built with -fno-reorder-blocks-and-partition as we don't run BOLT on all binaries.
Disable -fno-reorder-blocks-and-partition when using clang.
Fix warnings like warning: address range table at offset 0x0 has a premature terminator entry at offset 0x10, filed [BOLT] Optimized binary has premature terminator entry warning llvm/llvm-project#89508.

Zentrik · 2024-04-20T13:27:42Z

The corecompiler.ji build seems representative so I'll use that as an overall benchmark:

BOLT gives about a 13% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 19% improvement (I'll look into why this is slower than PGO+LTO only for corecompiler.ji, probably because we don't profile on it)

Here's a litany of benchmarks:
The script from #45395 (JULIA_LLVM_ARGS=-time-passes optimized.build/julia ../bolt/script-45395.jl 2>&1 | awk -F' ' '($1+0 > 0.1 || !($1 ~ /^[0-9.]*$/))')
On 4a2c593 (did not build LLVM from source):

  8.591355 seconds (6.38 M allocations: 283.723 MiB, 1.46% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.8351 seconds (4.8153 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.5925 ( 33.7%)   0.0047 (  4.1%)   1.5972 ( 33.0%)   1.5946 ( 33.1%)  InstCombinePass
   1.1303 ( 23.9%)   0.0063 (  5.5%)   1.1366 ( 23.5%)   1.1351 ( 23.6%)  GVNPass
   0.5453 ( 11.6%)   0.0241 ( 20.9%)   0.5694 ( 11.8%)   0.5686 ( 11.8%)  IndVarSimplifyPass
   0.4985 ( 10.6%)   0.0185 ( 16.1%)   0.5170 ( 10.7%)   0.5162 ( 10.7%)  LoopFullUnrollPass
   0.1219 (  2.6%)   0.0008 (  0.7%)   0.1228 (  2.5%)   0.1222 (  2.5%)  LateLowerGCPass
   4.7199 (100.0%)   0.1152 (100.0%)   4.8351 (100.0%)   4.8153 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0798 seconds (0.0724 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.7290 seconds (2.7247 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8376 ( 36.5%)   0.0996 ( 22.8%)   0.9373 ( 34.3%)   0.9373 ( 34.4%)  Greedy Register Allocator #16
   0.1348 (  5.9%)   0.2875 ( 65.9%)   0.4223 ( 15.5%)   0.4223 ( 15.5%)  X86 Assembly Printer #7
   0.3971 ( 17.3%)   0.0002 (  0.1%)   0.3974 ( 14.6%)   0.3974 ( 14.6%)  Live Register Matrix
   0.1145 (  5.0%)   0.0000 (  0.0%)   0.1145 (  4.2%)   0.1145 (  4.2%)  Spill Code Placement Analysis #2
   2.2925 (100.0%)   0.4365 (100.0%)   2.7290 (100.0%)   2.7247 (100.0%)  Total

With BOLT:

  7.723599 seconds (6.38 M allocations: 283.722 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.2703 seconds (4.2540 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3162 ( 31.6%)   0.0055 (  5.0%)   1.3218 ( 31.0%)   1.3199 ( 31.0%)  InstCombinePass
   0.9894 ( 23.8%)   0.0044 (  4.0%)   0.9938 ( 23.3%)   0.9925 ( 23.3%)  GVNPass
   0.5282 ( 12.7%)   0.0147 ( 13.2%)   0.5429 ( 12.7%)   0.5419 ( 12.7%)  IndVarSimplifyPass
   0.4815 ( 11.6%)   0.0129 ( 11.6%)   0.4944 ( 11.6%)   0.4935 ( 11.6%)  LoopFullUnrollPass
   0.1207 (  2.9%)   0.0009 (  0.8%)   0.1216 (  2.8%)   0.1212 (  2.8%)  LateLowerGCPass
   4.1593 (100.0%)   0.1110 (100.0%)   4.2703 (100.0%)   4.2540 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0738 seconds (0.0676 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5232 seconds (2.5200 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7793 ( 37.5%)   0.1080 ( 24.3%)   0.8873 ( 35.2%)   0.8874 ( 35.2%)  Greedy Register Allocator #14
   0.1240 (  6.0%)   0.2841 ( 63.9%)   0.4081 ( 16.2%)   0.4082 ( 16.2%)  X86 Assembly Printer #7
   0.3392 ( 16.3%)   0.0000 (  0.0%)   0.3392 ( 13.4%)   0.3392 ( 13.5%)  Interleaved Access Pass #8
   0.1030 (  5.0%)   0.0000 (  0.0%)   0.1030 (  4.1%)   0.1030 (  4.1%)  Spill Code Placement Analysis #3
   2.0787 (100.0%)   0.4445 (100.0%)   2.5232 (100.0%)   2.5200 (100.0%)  Total

With PGO+LTO:

  7.361828 seconds (6.38 M allocations: 283.771 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.0312 seconds (4.0077 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3406 ( 34.3%)   0.0044 (  3.5%)   1.3451 ( 33.4%)   1.3417 ( 33.5%)  InstCombinePass
   0.7992 ( 20.5%)   0.0062 (  4.9%)   0.8054 ( 20.0%)   0.8036 ( 20.1%)  GVNPass
   0.5063 ( 13.0%)   0.0230 ( 18.4%)   0.5293 ( 13.1%)   0.5285 ( 13.2%)  IndVarSimplifyPass
   0.4711 ( 12.1%)   0.0264 ( 21.1%)   0.4975 ( 12.3%)   0.4964 ( 12.4%)  LoopFullUnrollPass
   3.9063 (100.0%)   0.1249 (100.0%)   4.0312 (100.0%)   4.0077 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0797 seconds (0.0698 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4598 seconds (2.4316 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8006 ( 40.2%)   0.1040 ( 22.2%)   0.9046 ( 36.8%)   0.9048 ( 37.2%)  Greedy Register Allocator #13
   0.1059 (  5.3%)   0.2957 ( 63.0%)   0.4015 ( 16.3%)   0.4016 ( 16.5%)  X86 Assembly Printer #7
   0.2927 ( 14.7%)   0.0121 (  2.6%)   0.3048 ( 12.4%)   0.3048 ( 12.5%)  X86 vzeroupper inserter
   1.9905 (100.0%)   0.4693 (100.0%)   2.4598 (100.0%)   2.4316 (100.0%)  Total

With PGO+LTO+BOLT

  7.285692 seconds (6.38 M allocations: 283.713 MiB, 1.63% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 3.9592 seconds (3.9360 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3179 ( 34.2%)   0.0057 (  5.3%)   1.3235 ( 33.4%)   1.3197 ( 33.5%)  InstCombinePass
   0.7914 ( 20.5%)   0.0034 (  3.2%)   0.7948 ( 20.1%)   0.7922 ( 20.1%)  GVNPass
   0.4990 ( 13.0%)   0.0176 ( 16.4%)   0.5166 ( 13.0%)   0.5158 ( 13.1%)  IndVarSimplifyPass
   0.4642 ( 12.1%)   0.0177 ( 16.5%)   0.4819 ( 12.2%)   0.4810 ( 12.2%)  LoopFullUnrollPass
   3.8521 (100.0%)   0.1072 (100.0%)   3.9592 (100.0%)   3.9360 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0794 seconds (0.0715 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4658 seconds (2.4390 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8087 ( 40.9%)   0.1042 ( 21.3%)   0.9128 ( 37.0%)   0.9129 ( 37.4%)  Greedy Register Allocator #10
   0.2963 ( 15.0%)   0.0040 (  0.8%)   0.3003 ( 12.2%)   0.3003 ( 12.3%)  X86 DAG->DAG Instruction Selection #4
   1.9771 (100.0%)   0.4887 (100.0%)   2.4658 (100.0%)   2.4390 (100.0%)  Total

Sysimage and pkgimage building (rm usr/lib/julia/corecompiler.ji; make -f sysimage.mk sysimg-release; make -f pkgimage.mk release):

On 4a2c593 (did not build LLVM from source):

Core.Compiler ──── 208.334 seconds
Base  ────────── 51.496164 seconds
FileWatching  ──  7.485314 seconds
Libdl  ─────────  0.003082 seconds
Artifacts  ─────  0.306612 seconds
SHA  ───────────  0.241494 seconds
Sockets  ───────  0.347978 seconds
LinearAlgebra  ─  8.298057 seconds
Random  ────────  0.921556 seconds
Stdlibs total  ─ 17.613382 seconds
Sysimage built. Summary:
Base ────────  51.496164 seconds 74.5104%
Stdlibs ─────  17.613382 seconds 25.485%
Total ───────  69.112693 seconds
Outputting sysimage file...
Output ──────  40.640285 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 102 seconds

With BOLT:

Core.Compiler ──── 181.249 seconds
Base  ────────── 47.624226 seconds
FileWatching  ──  6.700183 seconds
Libdl  ─────────  0.003140 seconds
Artifacts  ─────  0.302009 seconds
SHA  ───────────  0.255627 seconds
Sockets  ───────  0.335796 seconds
LinearAlgebra  ─  8.249945 seconds
Random  ────────  0.844563 seconds
Stdlibs total  ─ 16.698956 seconds
Sysimage built. Summary:
Base ────────  47.624226 seconds 74.0362%
Stdlibs ─────  16.698956 seconds 25.9601%
Total ───────  64.325567 seconds
Outputting sysimage file...
Output ──────  34.831136 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 94 seconds

With PGO+LTO:

JULIA contrib/pgo-lto-bolt/pgo-only.build/usr/lib/julia/corecompiler.ji
Core.Compiler ──── 164.565 seconds
Base  ────────── 44.721178 seconds
FileWatching  ──  6.396723 seconds
Libdl  ─────────  0.002857 seconds
Artifacts  ─────  0.288473 seconds
SHA  ───────────  0.221199 seconds
Sockets  ───────  0.345902 seconds
LinearAlgebra  ─  7.628382 seconds
Random  ────────  0.785489 seconds
Stdlibs total  ─ 15.676387 seconds
Sysimage built. Summary:
Base ────────  44.721178 seconds 74.0417%
Stdlibs ─────  15.676387 seconds 25.9543%
Total ───────  60.399965 seconds
Outputting sysimage file...
Output ──────  32.994419 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 89 seconds

With PGO+LTO+BOLT:

Core.Compiler ──── 168.119 seconds
Base  ────────── 44.547567 seconds
FileWatching  ──  6.360340 seconds
Libdl  ─────────  0.002854 seconds
Artifacts  ─────  0.287256 seconds
SHA  ───────────  0.244686 seconds
Sockets  ───────  0.347385 seconds
LinearAlgebra  ─  7.679211 seconds
Random  ────────  0.802141 seconds
Stdlibs total  ─ 15.730860 seconds
Sysimage built. Summary:
Base ────────  44.547567 seconds 73.9004%
Stdlibs ─────  15.730860 seconds 26.0961%
Total ───────  60.280557 seconds
Outputting sysimage file...
Output ──────  31.952144 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 87 seconds

@time for i in 1:100000000
    string(i)
end

On 4a2c593 (did not build LLVM from source): 5.45s
With BOLT: 5.1s
With PGO+LTO: 5.1s
With PGO+LTO+BOLT: 5.15s

For this string benchmark, the builds with PGO were sufficiently noisy that I think that all the builds with optimized were as fast as each other.

Zentrik · 2024-04-20T15:12:17Z

Profiling the sysimg and pkgimg build worked.

  7.253939 seconds (6.38 M allocations: 283.772 MiB, 1.66% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 3.9429 seconds (3.9183 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3121 ( 34.3%)   0.0069 (  6.0%)   1.3189 ( 33.5%)   1.3150 ( 33.6%)  InstCombinePass
   0.7904 ( 20.6%)   0.0070 (  6.1%)   0.7974 ( 20.2%)   0.7947 ( 20.3%)  GVNPass
   0.5162 ( 13.5%)   0.0091 (  8.0%)   0.5254 ( 13.3%)   0.5244 ( 13.4%)  IndVarSimplifyPass
   0.4711 ( 12.3%)   0.0089 (  7.7%)   0.4800 ( 12.2%)   0.4791 ( 12.2%)  LoopFullUnrollPass
   3.8281 (100.0%)   0.1148 (100.0%)   3.9429 (100.0%)   3.9183 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0767 seconds (0.0679 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4435 seconds (2.4167 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7997 ( 40.3%)   0.1040 ( 22.5%)   0.9037 ( 37.0%)   0.9038 ( 37.4%)  Greedy Register Allocator #14
   0.1187 (  6.0%)   0.2815 ( 61.1%)   0.4002 ( 16.4%)   0.4002 ( 16.6%)  X86 Assembly Printer #7
   0.2986 ( 15.1%)   0.0058 (  1.3%)   0.3044 ( 12.5%)   0.3038 ( 12.6%)  Live Variable Analysis
   1.9824 (100.0%)   0.4610 (100.0%)   2.4435 (100.0%)   2.4167 (100.0%)  Total

Core.Compiler ──── 160.462 seconds
Base  ────────── 43.983272 seconds
FileWatching  ──  6.255600 seconds
Libdl  ─────────  0.002862 seconds
Artifacts  ─────  0.287458 seconds
SHA  ───────────  0.224349 seconds
Sockets  ───────  0.313991 seconds
LinearAlgebra  ─  7.724898 seconds
Random  ────────  0.825273 seconds
Stdlibs total  ─ 15.641330 seconds
Sysimage built. Summary:
Base ────────  43.983272 seconds 73.7643%
Stdlibs ─────  15.641330 seconds 26.2321%
Total ───────  59.626791 seconds
Outputting sysimage file...
Output ──────  32.006934 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 86 seconds

@time for i in 1:100000000
           string(i)
       end

Now seems to take 4.6s, I'm very sceptical of this result though.

I was concerned that as corecompiler.ji is built with -O0 profiling on it might not be representative of normal julia code but compiling corecompiler.ji without -g0 -O0 took 162s.
The old PGO+LTO+BOLT build took 171s so does seem a lot better to profile on sysimg building as well.
Not sure why we use -O0 in the first place.

Zentrik · 2024-04-20T16:25:57Z

Here's the updated benchmark for BOLT, now profiling the sysimg build:

  7.691131 seconds (6.38 M allocations: 283.722 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.2602 seconds (4.2444 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3148 ( 31.9%)   0.0057 (  4.3%)   1.3205 ( 31.0%)   1.3187 ( 31.1%)  InstCombinePass
   0.9882 ( 23.9%)   0.0030 (  2.3%)   0.9912 ( 23.3%)   0.9899 ( 23.3%)  GVNPass
   0.5326 ( 12.9%)   0.0157 ( 11.7%)   0.5483 ( 12.9%)   0.5475 ( 12.9%)  IndVarSimplifyPass
   0.4764 ( 11.5%)   0.0190 ( 14.2%)   0.4954 ( 11.6%)   0.4945 ( 11.7%)  LoopFullUnrollPass
   0.1115 (  2.7%)   0.0008 (  0.6%)   0.1124 (  2.6%)   0.1119 (  2.6%)  LateLowerGCPass
   4.1261 (100.0%)   0.1341 (100.0%)   4.2602 (100.0%)   4.2444 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0750 seconds (0.0686 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5078 seconds (2.5059 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7985 ( 38.3%)   0.0878 ( 20.6%)   0.8863 ( 35.3%)   0.8864 ( 35.4%)  Greedy Register Allocator #14
   0.1112 (  5.3%)   0.2967 ( 69.7%)   0.4078 ( 16.3%)   0.4079 ( 16.3%)  X86 Assembly Printer #9
   0.3394 ( 16.3%)   0.0000 (  0.0%)   0.3395 ( 13.5%)   0.3395 ( 13.5%)  Machine Copy Propagation Pass #5
   0.1036 (  5.0%)   0.0000 (  0.0%)   0.1036 (  4.1%)   0.1036 (  4.1%)  Spill Code Placement Analysis #2
   2.0822 (100.0%)   0.4256 (100.0%)   2.5078 (100.0%)   2.5059 (100.0%)  Total

Core.Compiler ──── 174.924 seconds
Base  ────────── 48.390739 seconds
FileWatching  ──  6.645787 seconds
Libdl  ─────────  0.003114 seconds
Artifacts  ─────  0.297960 seconds
SHA  ───────────  0.239149 seconds
Sockets  ───────  0.334072 seconds
LinearAlgebra  ─  8.276301 seconds
Random  ────────  0.848730 seconds
Stdlibs total  ─ 16.652748 seconds
Sysimage built. Summary:
Base ────────  48.390739 seconds 74.3946%
Stdlibs ─────  16.652748 seconds 25.6015%
Total ───────  65.046012 seconds
Outputting sysimage file...
Output ──────  34.413725 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 92 seconds

The string benchmark takes 4.85s

Zentrik · 2024-04-20T16:26:01Z

So the new corecompiler.ji results:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

tecosaur · 2024-04-21T02:43:32Z

Oh this looks very cool, thanks for working on this @Zentrik!

KristofferC · 2024-04-21T14:02:51Z

Just a side note but using the tests of LoopVectorization seems non-ideal since it is deprecated, has cases where it asserts or segfaults, and is quite special e.g. I'm that it heavily uses llvmcall.

gitboy16 · 2024-04-23T09:52:23Z

Thank you @Zentrik for working on this.
1/ Correct me if I am wrong but this is just for Linux and MacOS, right? If yes, what's plan for Windows?
2/ For Linux/MacOS, when should I run the steps in the README.md under folder bold/pgo-lto-bolt? Before the build?

Zentrik · 2024-04-23T10:31:20Z

There is no plan for windows, BOLT (the tool we are using to optimize Julia and LLVM libraries) does not support windows and there is no plan to. Looking into it, I don't think it supports mac os either, so I'll remove that from the Readme.
The steps in the README.md build Julia, you do not need to run any commands to build after them.
I.e. in the contrib/pgo-lto-bolt folder, run

make stage1
make stage2
make copy_originals
make bolt_instrument
make finish_stage2
make merge_data
make bolt

in your terminal. Then you can find the optimized binary in the contrib/pgo-lto-bolt/optimized.build directory.

Alternatively, you can download binaries for a PGO+LTO build for x86_64 linux gnu here.

giordano · 2024-07-24T23:37:02Z

@Zentrik do you plan to make any other changes? Is this ready for review?

Zentrik · 2024-07-25T06:33:01Z

@Zentrik do you plan to make any other changes? Is this ready for review?

This is ready

giordano · 2024-07-25T23:54:33Z

This seems to break x86_64 windows build

Seems to prevent rebuilding pkgimages which is necessary as o/w segfaults. Maybe segfault is due to optimizing sys.so?

Relocations bloat binary size, it doesn't get loaded into memory but e.g. libjulia-codegen gains ~60mb. Ofc can be stripped but that's annoying.

TODO: remove BOLTLDFLAGS and BOLTCXXFLAGS, we can just add them straight to llvm's flags once I copy over clang detection.

BaseBenchmarks.SUITE["inference"]["allinference"]["Base.init_stdio(::Ptr{Cvoid})"] takes about 3.1s without optimizing libjulia-internal.so and on master. With optimized libjulia-internal.so, it takes about 2.9s. I suspect InferenceBaseBenchmarks don't spend much time in LLVM so optimizing it doesn't help much. JULIA_LLVM_ARGS=-time-passes julia script-45395.jl spends about 4.3s in LLVM passes with optimized libLLVM (optimizing libjulia-internal has no effect) whilst taking about 4.9s on master. Total time (~7.9s and 8.7s on master) isn't affected by the optimization of libjulia-internal, which makes sense as we spend most time in LLVM passes. This reverts commit 0bb57ec.

This should hopefully be unnecessary now that we don't optimize sys.so.

…lia-codegen This prevents it from rebuilding other stuff like libjulia-codegen which depends on libjulia-internal and so libjulia-codegen loses it's instrumentation.

…et BOLTed Now that we pass `-fno-reorder-blocks-and-partition` to libjulia-codegen it seems BOLTing libjulia-internal no longer segfaults.

For one thing BOLT found split functions in libjulia-internal and codegen which could cause problems. Also it seemed to remove BOLT's performance improvement, given `jl_type_infer` was split maybe BOLT skipped it and other important functions.

I didn't try to reuse the bolt or pgo-lto Makefiles by including them as that seemed difficult and would probably break on any non-trivial change to either.

Was triggering in ssair and codegen tests, also doing `./julia s` where `s` is not a file.

[skip-ci]

BOLT only supports ELF binaries. [no-ci]

[no-ci]

Zentrik · 2024-07-26T08:14:32Z

Seems to have resolved itself.

KristofferC · 2024-07-26T13:54:54Z

Probably good with a NEWS entry?

giordano · 2024-07-26T14:01:44Z

We didn't have it for #45641 either, I guess mainly because these builds are only for advanced users for the time being, that said a NEWS entry wouldn't be bad (and we'd still be on time to add the entry for #45641 in v1.11 release notes).

KristofferC · 2024-07-26T14:22:16Z

Even if it is only for advanced users it is in my opinion good to have some kind of external reference to it (NEWS + (dev)docs).

Like, if I want to test this now I don't really know where to start.

Ref: #54107 (comment). If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11 branch too.

@time

This uses LLVM's BOLT to optimize libLLVM, libjulia-internal and libjulia-codegen. This improves the allinference benchmarks by about 10% largely due to the optimization of libjulia-internal. The example in issue JuliaLang#45395 which stresses LLVM significantly more also sees a ~10% improvement. We see a 20% improvement on ```julia @time for i in 1:100000000 string(i) end ``` When building corecompiler.ji: BOLT gives about a 16% improvement PGO+LTO gives about a 21% improvement PGO+LTO+BOLT gives about a 23% improvement This only requires a single build of LLVM and theoretically none if we change the binary builder script (i.e. we build with relocations and the `-fno-reorder-blocks-and-partition` and then we can use BOLT to get binaries with no relocations and reordered blocks and then ship both binaries?) compared to the 2 in PGO. Also, this theoretically can improve performance of a PGO+LTO build by a couple %. The only reproducible test problem I see is that the BOLT, PGO+LTO and PGO+LTO+BOLT builds all cause `readelf` to emit warnings as part of the `osutils` tests. ``` readelf: Warning: Unrecognised form: 0x22 readelf: Warning: DIE has locviews without loclist readelf: Warning: Unrecognised form: 0x23 readelf: Warning: DIE at offset 0x227399 refers to abbreviation number 14754 which does not exist readelf: Warning: Bogus end-of-siblings marker detected at offset 212aa9 in .debug_info section readelf: Warning: Bogus end-of-siblings marker detected at offset 212ab0 in .debug_info section readelf: Warning: Further warnings about bogus end-of-sibling markers suppressed ``` The unrecognised form warnings seem to be a bug in binutils, https://sourceware.org/bugzilla/show_bug.cgi?id=28981. `DIE at offset` warning I believe was fixed in binutils 2.36, https://sourceware.org/bugzilla/show_bug.cgi?id=26808, but `ld -v` says I have 2.38. I assume these are all benign. I also don't see them on CI here https://buildkite.com/julialang/julia-buildkite/builds/1507#018f00e7-0737-4a42-bcd9-d4061dc8c93e so could just be a local issue.

Ref: JuliaLang#54107 (comment). If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11 branch too.

Zentrik closed this Apr 16, 2024

Zentrik reopened this Apr 19, 2024

Zentrik force-pushed the bolt branch from d440da2 to 8ac5b97 Compare April 19, 2024 12:56

Zentrik closed this Apr 20, 2024

Zentrik reopened this Apr 20, 2024

Zentrik marked this pull request as ready for review April 20, 2024 17:59

Zentrik added performance Must go faster building Build system, or building Julia or its dependencies labels Apr 20, 2024

Zentrik force-pushed the bolt branch from c4916e4 to fdc3a98 Compare April 21, 2024 17:58

Zentrik added 10 commits July 26, 2024 08:03

Initial Bolt

c4ce68f

Got BOLT working on all specified .so's some of the time

1ccb1c2

Don't optimize sys.so

aea1b1d

Seems to prevent rebuilding pkgimages which is necessary as o/w segfaults. Maybe segfault is due to optimizing sys.so?

Experiment with bolt a bit

844b590

Hacky way to only use bolt specific flags with .so we will instrument

5d71322

Relocations bloat binary size, it doesn't get loaded into memory but e.g. libjulia-codegen gains ~60mb. Ofc can be stripped but that's annoying.

Fixup previous commit

d478312

Only BOLT libLLVM

a9f524f

TODO: remove BOLTLDFLAGS and BOLTCXXFLAGS, we can just add them straight to llvm's flags once I copy over clang detection.

Remove rebuild pkgimage step

a2a7585

This should hopefully be unnecessary now that we don't optimize sys.so.

Clean up

3e34a36

Zentrik added 17 commits July 26, 2024 08:03

Manually rebuild pkgimage, fixup profiling message and optimize libju…

6eb0209

…lia-codegen This prevents it from rebuilding other stuff like libjulia-codegen which depends on libjulia-internal and so libjulia-codegen loses it's instrumentation.

Fix message

feee71b

Add trailing new line

7c18d0f

Only use -fno-reorder-blocks-and-partition for binaries that will g…

e259150

…et BOLTed Now that we pass `-fno-reorder-blocks-and-partition` to libjulia-codegen it seems BOLTing libjulia-internal no longer segfaults.

Fix previous commit potentionally

984c714

For one thing BOLT found split functions in libjulia-internal and codegen which could cause problems. Also it seemed to remove BOLT's performance improvement, given `jl_type_infer` was split maybe BOLT skipped it and other important functions.

Fix nit

08fd8f1

Clean up documentation a bit

6d6a83b

Add PGO+LTO+BOLT Makefile

396e2e0

I didn't try to reuse the bolt or pgo-lto Makefiles by including them as that seemed difficult and would probably break on any non-trivial change to either.

Profile sysimg build as well

d8f8694

Fix premature terminator warning

d33de97

Was triggering in ssair and codegen tests, also doing `./julia s` where `s` is not a file.

Fix whitespace

9018311

Remove reference to LoopVectorization

c8fbf3b

[skip-ci]

Remove claim of macos support and clarify Readme

37a64ea

BOLT only supports ELF binaries. [no-ci]

Remove lines that should have been deleted from Readme

4685de3

[no-ci]

Delete checksum

cc176b6

Fix typo

dc59798

Delete checksum

a94e146

Zentrik force-pushed the bolt branch from fae74b4 to a94e146 Compare July 26, 2024 07:03

giordano added the merge me PR is reviewed. Merge when all tests are passing label Jul 26, 2024

giordano merged commit 1dee000 into JuliaLang:master Jul 26, 2024
6 of 8 checks passed

giordano removed the merge me PR is reviewed. Merge when all tests are passing label Jul 26, 2024

giordano mentioned this pull request Jul 28, 2024

Add PGO/LTO/BOLT Makefiles to NEWS.md and HISTORY.md #55282

Merged

giordano added a commit that referenced this pull request Jul 29, 2024

Add PGO/LTO/BOLT Makefiles to NEWS.md and HISTORY.md (#55282)

9be0976

Ref: #54107 (comment). If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11 branch too.

giordano mentioned this pull request Aug 2, 2024

Backports release 1.11 #55344

Merged

68 tasks

lazarusA pushed a commit to lazarusA/julia that referenced this pull request Aug 17, 2024

Add PGO/LTO/BOLT Makefiles to NEWS.md and HISTORY.md (JuliaLang#55282)

cac6a6b

Ref: JuliaLang#54107 (comment). If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11 branch too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BOLT Makefile #54107

Add BOLT Makefile #54107

Zentrik commented Apr 16, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024

tecosaur commented Apr 21, 2024

KristofferC commented Apr 21, 2024

gitboy16 commented Apr 23, 2024

Zentrik commented Apr 23, 2024

giordano commented Jul 24, 2024

Zentrik commented Jul 25, 2024

giordano commented Jul 25, 2024

Zentrik commented Jul 26, 2024

KristofferC commented Jul 26, 2024

giordano commented Jul 26, 2024

KristofferC commented Jul 26, 2024 •

edited

Loading

Add BOLT Makefile #54107

Add BOLT Makefile #54107

Conversation

Zentrik commented Apr 16, 2024 • edited Loading

Zentrik commented Apr 20, 2024 • edited Loading

Zentrik commented Apr 20, 2024 • edited Loading

Zentrik commented Apr 20, 2024 • edited Loading

Zentrik commented Apr 20, 2024

tecosaur commented Apr 21, 2024

KristofferC commented Apr 21, 2024

gitboy16 commented Apr 23, 2024

Zentrik commented Apr 23, 2024

giordano commented Jul 24, 2024

Zentrik commented Jul 25, 2024

giordano commented Jul 25, 2024

Zentrik commented Jul 26, 2024

KristofferC commented Jul 26, 2024

giordano commented Jul 26, 2024

KristofferC commented Jul 26, 2024 • edited Loading

Zentrik commented Apr 16, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

Zentrik commented Apr 20, 2024 •

edited

Loading

KristofferC commented Jul 26, 2024 •

edited

Loading