Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BOLT Makefile #54107

Merged
merged 44 commits into from
Jul 26, 2024
Merged

Add BOLT Makefile #54107

merged 44 commits into from
Jul 26, 2024

Conversation

Zentrik
Copy link
Member

@Zentrik Zentrik commented Apr 16, 2024

This uses LLVM's BOLT to optimize libLLVM, libjulia-internal and libjulia-codegen.

This improves the allinference benchmarks by about 10% largely due to the optimization of libjulia-internal.
The example in issue #45395 which stresses LLVM significantly more also sees a ~10% improvement.
We see a 20% improvement on

@time for i in 1:100000000
    string(i)
end

When building corecompiler.ji:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

This only requires a single build of LLVM and theoretically none if we change the binary builder script (i.e. we build with relocations and the -fno-reorder-blocks-and-partition and then we can use BOLT to get binaries with no relocations and reordered blocks and then ship both binaries?) compared to the 2 in PGO. Also, this theoretically can improve performance of a PGO+LTO build by a couple %.

The only reproducible test problem I see is that the BOLT, PGO+LTO and PGO+LTO+BOLT builds all cause readelf to emit warnings as part of the osutils tests.

readelf: Warning: Unrecognised form: 0x22
readelf: Warning: DIE has locviews without loclist
readelf: Warning: Unrecognised form: 0x23
readelf: Warning: DIE at offset 0x227399 refers to abbreviation number 14754 which does not exist
readelf: Warning: Bogus end-of-siblings marker detected at offset 212aa9 in .debug_info section
readelf: Warning: Bogus end-of-siblings marker detected at offset 212ab0 in .debug_info section
readelf: Warning: Further warnings about bogus end-of-sibling markers suppressed

The unrecognised form warnings seem to be a bug in binutils, https://sourceware.org/bugzilla/show_bug.cgi?id=28981.
DIE at offset warning I believe was fixed in binutils 2.36, https://sourceware.org/bugzilla/show_bug.cgi?id=26808, but ld -v says I have 2.38.
I assume these are all benign. I also don't see them on CI here https://buildkite.com/julialang/julia-buildkite/builds/1507#018f00e7-0737-4a42-bcd9-d4061dc8c93e so could just be a local issue.

TODO:

  • Add PGO+LTO+BOLT makefile
  • Try and get libjulia-codegen optimised
  • Address the todo's in the makefile
  • Run a full test suite on the resulting binary
  • Try to minimise what gets built with -fno-reorder-blocks-and-partition as we don't run BOLT on all binaries.
  • Disable -fno-reorder-blocks-and-partition when using clang.
  • Fix warnings like warning: address range table at offset 0x0 has a premature terminator entry at offset 0x10, filed [BOLT] Optimized binary has premature terminator entry warning llvm/llvm-project#89508.

@Zentrik
Copy link
Member Author

Zentrik commented Apr 20, 2024

The corecompiler.ji build seems representative so I'll use that as an overall benchmark:

  • BOLT gives about a 13% improvement
  • PGO+LTO gives about a 21% improvement
  • PGO+LTO+BOLT gives about a 19% improvement (I'll look into why this is slower than PGO+LTO only for corecompiler.ji, probably because we don't profile on it)

Here's a litany of benchmarks:
The script from #45395 (JULIA_LLVM_ARGS=-time-passes optimized.build/julia ../bolt/script-45395.jl 2>&1 | awk -F' ' '($1+0 > 0.1 || !($1 ~ /^[0-9.]*$/))')
On 4a2c593 (did not build LLVM from source):

  8.591355 seconds (6.38 M allocations: 283.723 MiB, 1.46% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.8351 seconds (4.8153 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.5925 ( 33.7%)   0.0047 (  4.1%)   1.5972 ( 33.0%)   1.5946 ( 33.1%)  InstCombinePass
   1.1303 ( 23.9%)   0.0063 (  5.5%)   1.1366 ( 23.5%)   1.1351 ( 23.6%)  GVNPass
   0.5453 ( 11.6%)   0.0241 ( 20.9%)   0.5694 ( 11.8%)   0.5686 ( 11.8%)  IndVarSimplifyPass
   0.4985 ( 10.6%)   0.0185 ( 16.1%)   0.5170 ( 10.7%)   0.5162 ( 10.7%)  LoopFullUnrollPass
   0.1219 (  2.6%)   0.0008 (  0.7%)   0.1228 (  2.5%)   0.1222 (  2.5%)  LateLowerGCPass
   4.7199 (100.0%)   0.1152 (100.0%)   4.8351 (100.0%)   4.8153 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0798 seconds (0.0724 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.7290 seconds (2.7247 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8376 ( 36.5%)   0.0996 ( 22.8%)   0.9373 ( 34.3%)   0.9373 ( 34.4%)  Greedy Register Allocator #16
   0.1348 (  5.9%)   0.2875 ( 65.9%)   0.4223 ( 15.5%)   0.4223 ( 15.5%)  X86 Assembly Printer #7
   0.3971 ( 17.3%)   0.0002 (  0.1%)   0.3974 ( 14.6%)   0.3974 ( 14.6%)  Live Register Matrix
   0.1145 (  5.0%)   0.0000 (  0.0%)   0.1145 (  4.2%)   0.1145 (  4.2%)  Spill Code Placement Analysis #2
   2.2925 (100.0%)   0.4365 (100.0%)   2.7290 (100.0%)   2.7247 (100.0%)  Total

With BOLT:

  7.723599 seconds (6.38 M allocations: 283.722 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.2703 seconds (4.2540 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3162 ( 31.6%)   0.0055 (  5.0%)   1.3218 ( 31.0%)   1.3199 ( 31.0%)  InstCombinePass
   0.9894 ( 23.8%)   0.0044 (  4.0%)   0.9938 ( 23.3%)   0.9925 ( 23.3%)  GVNPass
   0.5282 ( 12.7%)   0.0147 ( 13.2%)   0.5429 ( 12.7%)   0.5419 ( 12.7%)  IndVarSimplifyPass
   0.4815 ( 11.6%)   0.0129 ( 11.6%)   0.4944 ( 11.6%)   0.4935 ( 11.6%)  LoopFullUnrollPass
   0.1207 (  2.9%)   0.0009 (  0.8%)   0.1216 (  2.8%)   0.1212 (  2.8%)  LateLowerGCPass
   4.1593 (100.0%)   0.1110 (100.0%)   4.2703 (100.0%)   4.2540 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0738 seconds (0.0676 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5232 seconds (2.5200 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7793 ( 37.5%)   0.1080 ( 24.3%)   0.8873 ( 35.2%)   0.8874 ( 35.2%)  Greedy Register Allocator #14
   0.1240 (  6.0%)   0.2841 ( 63.9%)   0.4081 ( 16.2%)   0.4082 ( 16.2%)  X86 Assembly Printer #7
   0.3392 ( 16.3%)   0.0000 (  0.0%)   0.3392 ( 13.4%)   0.3392 ( 13.5%)  Interleaved Access Pass #8
   0.1030 (  5.0%)   0.0000 (  0.0%)   0.1030 (  4.1%)   0.1030 (  4.1%)  Spill Code Placement Analysis #3
   2.0787 (100.0%)   0.4445 (100.0%)   2.5232 (100.0%)   2.5200 (100.0%)  Total

With PGO+LTO:

  7.361828 seconds (6.38 M allocations: 283.771 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.0312 seconds (4.0077 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3406 ( 34.3%)   0.0044 (  3.5%)   1.3451 ( 33.4%)   1.3417 ( 33.5%)  InstCombinePass
   0.7992 ( 20.5%)   0.0062 (  4.9%)   0.8054 ( 20.0%)   0.8036 ( 20.1%)  GVNPass
   0.5063 ( 13.0%)   0.0230 ( 18.4%)   0.5293 ( 13.1%)   0.5285 ( 13.2%)  IndVarSimplifyPass
   0.4711 ( 12.1%)   0.0264 ( 21.1%)   0.4975 ( 12.3%)   0.4964 ( 12.4%)  LoopFullUnrollPass
   3.9063 (100.0%)   0.1249 (100.0%)   4.0312 (100.0%)   4.0077 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0797 seconds (0.0698 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4598 seconds (2.4316 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8006 ( 40.2%)   0.1040 ( 22.2%)   0.9046 ( 36.8%)   0.9048 ( 37.2%)  Greedy Register Allocator #13
   0.1059 (  5.3%)   0.2957 ( 63.0%)   0.4015 ( 16.3%)   0.4016 ( 16.5%)  X86 Assembly Printer #7
   0.2927 ( 14.7%)   0.0121 (  2.6%)   0.3048 ( 12.4%)   0.3048 ( 12.5%)  X86 vzeroupper inserter
   1.9905 (100.0%)   0.4693 (100.0%)   2.4598 (100.0%)   2.4316 (100.0%)  Total

With PGO+LTO+BOLT

  7.285692 seconds (6.38 M allocations: 283.713 MiB, 1.63% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 3.9592 seconds (3.9360 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3179 ( 34.2%)   0.0057 (  5.3%)   1.3235 ( 33.4%)   1.3197 ( 33.5%)  InstCombinePass
   0.7914 ( 20.5%)   0.0034 (  3.2%)   0.7948 ( 20.1%)   0.7922 ( 20.1%)  GVNPass
   0.4990 ( 13.0%)   0.0176 ( 16.4%)   0.5166 ( 13.0%)   0.5158 ( 13.1%)  IndVarSimplifyPass
   0.4642 ( 12.1%)   0.0177 ( 16.5%)   0.4819 ( 12.2%)   0.4810 ( 12.2%)  LoopFullUnrollPass
   3.8521 (100.0%)   0.1072 (100.0%)   3.9592 (100.0%)   3.9360 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0794 seconds (0.0715 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4658 seconds (2.4390 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.8087 ( 40.9%)   0.1042 ( 21.3%)   0.9128 ( 37.0%)   0.9129 ( 37.4%)  Greedy Register Allocator #10
   0.2963 ( 15.0%)   0.0040 (  0.8%)   0.3003 ( 12.2%)   0.3003 ( 12.3%)  X86 DAG->DAG Instruction Selection #4
   1.9771 (100.0%)   0.4887 (100.0%)   2.4658 (100.0%)   2.4390 (100.0%)  Total

Sysimage and pkgimage building (rm usr/lib/julia/corecompiler.ji; make -f sysimage.mk sysimg-release; make -f pkgimage.mk release):

On 4a2c593 (did not build LLVM from source):

Core.Compiler ──── 208.334 seconds
Base  ────────── 51.496164 seconds
FileWatching  ──  7.485314 seconds
Libdl  ─────────  0.003082 seconds
Artifacts  ─────  0.306612 seconds
SHA  ───────────  0.241494 seconds
Sockets  ───────  0.347978 seconds
LinearAlgebra  ─  8.298057 seconds
Random  ────────  0.921556 seconds
Stdlibs total  ─ 17.613382 seconds
Sysimage built. Summary:
Base ────────  51.496164 seconds 74.5104%
Stdlibs ─────  17.613382 seconds 25.485%
Total ───────  69.112693 seconds
Outputting sysimage file...
Output ──────  40.640285 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 102 seconds

With BOLT:

Core.Compiler ──── 181.249 seconds
Base  ────────── 47.624226 seconds
FileWatching  ──  6.700183 seconds
Libdl  ─────────  0.003140 seconds
Artifacts  ─────  0.302009 seconds
SHA  ───────────  0.255627 seconds
Sockets  ───────  0.335796 seconds
LinearAlgebra  ─  8.249945 seconds
Random  ────────  0.844563 seconds
Stdlibs total  ─ 16.698956 seconds
Sysimage built. Summary:
Base ────────  47.624226 seconds 74.0362%
Stdlibs ─────  16.698956 seconds 25.9601%
Total ───────  64.325567 seconds
Outputting sysimage file...
Output ──────  34.831136 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 94 seconds

With PGO+LTO:

JULIA contrib/pgo-lto-bolt/pgo-only.build/usr/lib/julia/corecompiler.ji
Core.Compiler ──── 164.565 seconds
Base  ────────── 44.721178 seconds
FileWatching  ──  6.396723 seconds
Libdl  ─────────  0.002857 seconds
Artifacts  ─────  0.288473 seconds
SHA  ───────────  0.221199 seconds
Sockets  ───────  0.345902 seconds
LinearAlgebra  ─  7.628382 seconds
Random  ────────  0.785489 seconds
Stdlibs total  ─ 15.676387 seconds
Sysimage built. Summary:
Base ────────  44.721178 seconds 74.0417%
Stdlibs ─────  15.676387 seconds 25.9543%
Total ───────  60.399965 seconds
Outputting sysimage file...
Output ──────  32.994419 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 89 seconds

With PGO+LTO+BOLT:

Core.Compiler ──── 168.119 seconds
Base  ────────── 44.547567 seconds
FileWatching  ──  6.360340 seconds
Libdl  ─────────  0.002854 seconds
Artifacts  ─────  0.287256 seconds
SHA  ───────────  0.244686 seconds
Sockets  ───────  0.347385 seconds
LinearAlgebra  ─  7.679211 seconds
Random  ────────  0.802141 seconds
Stdlibs total  ─ 15.730860 seconds
Sysimage built. Summary:
Base ────────  44.547567 seconds 73.9004%
Stdlibs ─────  15.730860 seconds 26.0961%
Total ───────  60.280557 seconds
Outputting sysimage file...
Output ──────  31.952144 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 87 seconds
@time for i in 1:100000000
    string(i)
end

On 4a2c593 (did not build LLVM from source): 5.45s
With BOLT: 5.1s
With PGO+LTO: 5.1s
With PGO+LTO+BOLT: 5.15s

For this string benchmark, the builds with PGO were sufficiently noisy that I think that all the builds with optimized were as fast as each other.

@Zentrik
Copy link
Member Author

Zentrik commented Apr 20, 2024

Profiling the sysimg and pkgimg build worked.

  7.253939 seconds (6.38 M allocations: 283.772 MiB, 1.66% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 3.9429 seconds (3.9183 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3121 ( 34.3%)   0.0069 (  6.0%)   1.3189 ( 33.5%)   1.3150 ( 33.6%)  InstCombinePass
   0.7904 ( 20.6%)   0.0070 (  6.1%)   0.7974 ( 20.2%)   0.7947 ( 20.3%)  GVNPass
   0.5162 ( 13.5%)   0.0091 (  8.0%)   0.5254 ( 13.3%)   0.5244 ( 13.4%)  IndVarSimplifyPass
   0.4711 ( 12.3%)   0.0089 (  7.7%)   0.4800 ( 12.2%)   0.4791 ( 12.2%)  LoopFullUnrollPass
   3.8281 (100.0%)   0.1148 (100.0%)   3.9429 (100.0%)   3.9183 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0767 seconds (0.0679 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.4435 seconds (2.4167 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7997 ( 40.3%)   0.1040 ( 22.5%)   0.9037 ( 37.0%)   0.9038 ( 37.4%)  Greedy Register Allocator #14
   0.1187 (  6.0%)   0.2815 ( 61.1%)   0.4002 ( 16.4%)   0.4002 ( 16.6%)  X86 Assembly Printer #7
   0.2986 ( 15.1%)   0.0058 (  1.3%)   0.3044 ( 12.5%)   0.3038 ( 12.6%)  Live Variable Analysis
   1.9824 (100.0%)   0.4610 (100.0%)   2.4435 (100.0%)   2.4167 (100.0%)  Total
Core.Compiler ──── 160.462 seconds
Base  ────────── 43.983272 seconds
FileWatching  ──  6.255600 seconds
Libdl  ─────────  0.002862 seconds
Artifacts  ─────  0.287458 seconds
SHA  ───────────  0.224349 seconds
Sockets  ───────  0.313991 seconds
LinearAlgebra  ─  7.724898 seconds
Random  ────────  0.825273 seconds
Stdlibs total  ─ 15.641330 seconds
Sysimage built. Summary:
Base ────────  43.983272 seconds 73.7643%
Stdlibs ─────  15.641330 seconds 26.2321%
Total ───────  59.626791 seconds
Outputting sysimage file...
Output ──────  32.006934 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 86 seconds
@time for i in 1:100000000
           string(i)
       end

Now seems to take 4.6s, I'm very sceptical of this result though.

I was concerned that as corecompiler.ji is built with -O0 profiling on it might not be representative of normal julia code but compiling corecompiler.ji without -g0 -O0 took 162s.
The old PGO+LTO+BOLT build took 171s so does seem a lot better to profile on sysimg building as well.
Not sure why we use -O0 in the first place.

@Zentrik
Copy link
Member Author

Zentrik commented Apr 20, 2024

Here's the updated benchmark for BOLT, now profiling the sysimg build:

  7.691131 seconds (6.38 M allocations: 283.722 MiB, 1.59% gc time, 100.00% compilation time)
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 4.2602 seconds (4.2444 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   1.3148 ( 31.9%)   0.0057 (  4.3%)   1.3205 ( 31.0%)   1.3187 ( 31.1%)  InstCombinePass
   0.9882 ( 23.9%)   0.0030 (  2.3%)   0.9912 ( 23.3%)   0.9899 ( 23.3%)  GVNPass
   0.5326 ( 12.9%)   0.0157 ( 11.7%)   0.5483 ( 12.9%)   0.5475 ( 12.9%)  IndVarSimplifyPass
   0.4764 ( 11.5%)   0.0190 ( 14.2%)   0.4954 ( 11.6%)   0.4945 ( 11.7%)  LoopFullUnrollPass
   0.1115 (  2.7%)   0.0008 (  0.6%)   0.1124 (  2.6%)   0.1119 (  2.6%)  LateLowerGCPass
   4.1261 (100.0%)   0.1341 (100.0%)   4.2602 (100.0%)   4.2444 (100.0%)  Total
===-------------------------------------------------------------------------===
                        Analysis execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0750 seconds (0.0686 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 2.5078 seconds (2.5059 wall clock)
   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.7985 ( 38.3%)   0.0878 ( 20.6%)   0.8863 ( 35.3%)   0.8864 ( 35.4%)  Greedy Register Allocator #14
   0.1112 (  5.3%)   0.2967 ( 69.7%)   0.4078 ( 16.3%)   0.4079 ( 16.3%)  X86 Assembly Printer #9
   0.3394 ( 16.3%)   0.0000 (  0.0%)   0.3395 ( 13.5%)   0.3395 ( 13.5%)  Machine Copy Propagation Pass #5
   0.1036 (  5.0%)   0.0000 (  0.0%)   0.1036 (  4.1%)   0.1036 (  4.1%)  Spill Code Placement Analysis #2
   2.0822 (100.0%)   0.4256 (100.0%)   2.5078 (100.0%)   2.5059 (100.0%)  Total
Core.Compiler ──── 174.924 seconds
Base  ────────── 48.390739 seconds
FileWatching  ──  6.645787 seconds
Libdl  ─────────  0.003114 seconds
Artifacts  ─────  0.297960 seconds
SHA  ───────────  0.239149 seconds
Sockets  ───────  0.334072 seconds
LinearAlgebra  ─  8.276301 seconds
Random  ────────  0.848730 seconds
Stdlibs total  ─ 16.652748 seconds
Sysimage built. Summary:
Base ────────  48.390739 seconds 74.3946%
Stdlibs ─────  16.652748 seconds 25.6015%
Total ───────  65.046012 seconds
Outputting sysimage file...
Output ──────  34.413725 seconds
Precompiling all packages for 2 compilation configurations...
  106 dependency configurations successfully precompiled in 92 seconds

The string benchmark takes 4.85s

@Zentrik
Copy link
Member Author

Zentrik commented Apr 20, 2024

So the new corecompiler.ji results:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

@Zentrik Zentrik reopened this Apr 20, 2024
@Zentrik Zentrik marked this pull request as ready for review April 20, 2024 17:59
@Zentrik Zentrik added performance Must go faster building Build system, or building Julia or its dependencies labels Apr 20, 2024
@tecosaur
Copy link
Contributor

Oh this looks very cool, thanks for working on this @Zentrik!

@KristofferC
Copy link
Member

Just a side note but using the tests of LoopVectorization seems non-ideal since it is deprecated, has cases where it asserts or segfaults, and is quite special e.g. I'm that it heavily uses llvmcall.

@gitboy16
Copy link
Contributor

Thank you @Zentrik for working on this.
1/ Correct me if I am wrong but this is just for Linux and MacOS, right? If yes, what's plan for Windows?
2/ For Linux/MacOS, when should I run the steps in the README.md under folder bold/pgo-lto-bolt? Before the build?

@Zentrik
Copy link
Member Author

Zentrik commented Apr 23, 2024

  1. There is no plan for windows, BOLT (the tool we are using to optimize Julia and LLVM libraries) does not support windows and there is no plan to. Looking into it, I don't think it supports mac os either, so I'll remove that from the Readme.
  2. The steps in the README.md build Julia, you do not need to run any commands to build after them.
    I.e. in the contrib/pgo-lto-bolt folder, run
make stage1
make stage2
make copy_originals
make bolt_instrument
make finish_stage2
make merge_data
make bolt

in your terminal. Then you can find the optimized binary in the contrib/pgo-lto-bolt/optimized.build directory.

Alternatively, you can download binaries for a PGO+LTO build for x86_64 linux gnu here.

@giordano
Copy link
Contributor

@Zentrik do you plan to make any other changes? Is this ready for review?

@Zentrik
Copy link
Member Author

Zentrik commented Jul 25, 2024

@Zentrik do you plan to make any other changes? Is this ready for review?

This is ready

@giordano
Copy link
Contributor

This seems to break x86_64 windows build

Zentrik added 10 commits July 26, 2024 08:03
Seems to prevent rebuilding pkgimages which is necessary as o/w segfaults. Maybe segfault is due to optimizing sys.so?
Relocations bloat binary size, it doesn't get loaded into memory but e.g. libjulia-codegen gains ~60mb. Ofc can be stripped but that's annoying.
TODO: remove BOLTLDFLAGS and BOLTCXXFLAGS, we can just add them straight to llvm's flags once I copy over clang detection.
BaseBenchmarks.SUITE["inference"]["allinference"]["Base.init_stdio(::Ptr{Cvoid})"] takes about 3.1s without optimizing libjulia-internal.so and on master.
With optimized libjulia-internal.so, it takes about 2.9s.
I suspect InferenceBaseBenchmarks don't spend much time in LLVM so optimizing it doesn't help much.

JULIA_LLVM_ARGS=-time-passes julia script-45395.jl spends about 4.3s in LLVM passes with optimized libLLVM (optimizing libjulia-internal has no effect) whilst taking about 4.9s on master.
Total time (~7.9s and 8.7s on master) isn't affected by the optimization of libjulia-internal, which makes sense as we spend most time in LLVM passes.

This reverts commit 0bb57ec.
This should hopefully be unnecessary now that we don't optimize sys.so.
Zentrik added 17 commits July 26, 2024 08:03
…lia-codegen

This prevents it from rebuilding other stuff like libjulia-codegen which depends on libjulia-internal and so libjulia-codegen loses it's instrumentation.
…et BOLTed

Now that we pass `-fno-reorder-blocks-and-partition` to libjulia-codegen it seems BOLTing libjulia-internal no longer segfaults.
For one thing BOLT found split functions in libjulia-internal and codegen which could cause problems. Also it seemed to remove BOLT's performance improvement, given `jl_type_infer` was split maybe BOLT skipped it and other important functions.
I didn't try to reuse the bolt or pgo-lto Makefiles by including them as that seemed difficult and would probably break on any non-trivial change to either.
Was triggering in ssair and codegen tests, also doing `./julia s` where `s` is not a file.
BOLT only supports ELF binaries.
[no-ci]
@Zentrik
Copy link
Member Author

Zentrik commented Jul 26, 2024

Seems to have resolved itself.

@giordano giordano added the merge me PR is reviewed. Merge when all tests are passing label Jul 26, 2024
@giordano giordano merged commit 1dee000 into JuliaLang:master Jul 26, 2024
6 of 8 checks passed
@giordano giordano removed the merge me PR is reviewed. Merge when all tests are passing label Jul 26, 2024
@KristofferC
Copy link
Member

Probably good with a NEWS entry?

@giordano
Copy link
Contributor

We didn't have it for #45641 either, I guess mainly because these builds are only for advanced users for the time being, that said a NEWS entry wouldn't be bad (and we'd still be on time to add the entry for #45641 in v1.11 release notes).

@KristofferC
Copy link
Member

KristofferC commented Jul 26, 2024

Even if it is only for advanced users it is in my opinion good to have some kind of external reference to it (NEWS + (dev)docs).

Like, if I want to test this now I don't really know where to start.

giordano added a commit that referenced this pull request Jul 29, 2024
Ref:
#54107 (comment).
If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11
branch too.
@giordano giordano mentioned this pull request Aug 2, 2024
68 tasks
lazarusA pushed a commit to lazarusA/julia that referenced this pull request Aug 17, 2024
This uses LLVM's BOLT to optimize libLLVM, libjulia-internal and
libjulia-codegen.

This improves the allinference benchmarks by about 10% largely due to
the optimization of libjulia-internal.
The example in issue JuliaLang#45395
which stresses LLVM significantly more also sees a ~10% improvement.
We see a 20% improvement on 
```julia
@time for i in 1:100000000
    string(i)
end
```

When building corecompiler.ji:
BOLT gives about a 16% improvement
PGO+LTO gives about a 21% improvement
PGO+LTO+BOLT gives about a 23% improvement

This only requires a single build of LLVM and theoretically none if we
change the binary builder script (i.e. we build with relocations and the
`-fno-reorder-blocks-and-partition` and then we can use BOLT to get
binaries with no relocations and reordered blocks and then ship both
binaries?) compared to the 2 in PGO. Also, this theoretically can
improve performance of a PGO+LTO build by a couple %.

The only reproducible test problem I see is that the BOLT, PGO+LTO and
PGO+LTO+BOLT builds all cause `readelf` to emit warnings as part of the
`osutils` tests.

```
readelf: Warning: Unrecognised form: 0x22
readelf: Warning: DIE has locviews without loclist
readelf: Warning: Unrecognised form: 0x23
readelf: Warning: DIE at offset 0x227399 refers to abbreviation number 14754 which does not exist
readelf: Warning: Bogus end-of-siblings marker detected at offset 212aa9 in .debug_info section
readelf: Warning: Bogus end-of-siblings marker detected at offset 212ab0 in .debug_info section
readelf: Warning: Further warnings about bogus end-of-sibling markers suppressed
```

The unrecognised form warnings seem to be a bug in binutils,
https://sourceware.org/bugzilla/show_bug.cgi?id=28981.
`DIE at offset` warning I believe was fixed in binutils 2.36,
https://sourceware.org/bugzilla/show_bug.cgi?id=26808, but `ld -v` says
I have 2.38.
I assume these are all benign. I also don't see them on CI here
https://buildkite.com/julialang/julia-buildkite/builds/1507#018f00e7-0737-4a42-bcd9-d4061dc8c93e
so could just be a local issue.
lazarusA pushed a commit to lazarusA/julia that referenced this pull request Aug 17, 2024
Ref:
JuliaLang#54107 (comment).
If accepted, I'll add the NEWS.md entry for PGO/LTO in the release-1.11
branch too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
building Build system, or building Julia or its dependencies performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants