Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More codegen optimization options #13505

Closed
straight-shoota opened this issue May 26, 2023 · 19 comments
Closed

More codegen optimization options #13505

straight-shoota opened this issue May 26, 2023 · 19 comments

Comments

@straight-shoota
Copy link
Member

straight-shoota commented May 26, 2023

Code optimization in the Crystal compiler is provided by LLVM and currently has two optimization modes: By default there are no code optimizations, --release builds a single LLVM module and applies the highest available optimizations (-O3, "agressive").
This heavy optimization results long build times (up to 10x longer than non-release). Besides the optimizations themselves being more extensive, the single module prevents parallel processing and re-use for repeated builds.
--release mode is great for release builds with the best possible runtime performance. But there are other use cases for decently performing builds with a more restrained compile time.
This has previously been discussed in the forum: https://forum.crystal-lang.org/t/faster-release-compile-times-but-slightly-worse-performance/3864

#13464 contains a PoC implementation and benchmarks for it. I'm very happy about the numbers because they allow us to make informed decisions about the expected performance profiles.
Thanks @kostya for driving this forward!

I think it's better to move this discussion to a dedicated issue for clarity, independent of the specific PR implementation.

I'm noting my observations (some may have already been mentioned before):

  1. As a first step, I would not change anything about the existing behaviours (what --release and the default mean). Let's first add just additional options for intermediate optimization levels into the compiler. That can give us a better understanding of how they play out on Crystal projects.
    We can later introduce additional changes to the meaning of --release and the default behaviour, or potentially introduce additional CLI options as well.

  2. In both examples, --level1 and --level2 appear almost identical. Any idea why that is? Surely there must be a significant difference between the optimization levels in LLVM. Maybe it doesn't have much effect in the specific types of programs? Or for Crystal programs in general? It could be worth investigating into that because it's very odd that these two levels appear almost redundant to each other.

  3. Other LLVM frontends (clang, rustc) use -O0 -O1 -O2 -O3 for opt level indication. I think it's reasonable to follow with the naming of CLI options. Even if that can mean a different effective impact than the same levels in other compilers.

  4. A distinction between aggressive optimization (O3) and single-module would be nice. It should be possible to use -O3 without single module. -O3 should not imply --single-module, but --release would expand to -O3 --single-module.
    This would give another option between --level2 and --release in the benchmarks which might be a very reasonable choice to get performant builds repetitively.

This is meant to start a discussion, not express final opinions. Please let's discuss these (and other aspects) first before making changes to the PR.

@kostya
Copy link
Contributor

kostya commented May 26, 2023

  1. When compile in separate modules mode (default, level1, level2), it generate llvm module for each class, and its fine because allow parallelization. Every small method like this: https://github.com/crystal-lang/crystal/blob/master/src/char.cr#L189, just generate self llvm method (it is inlined(in O1/O2) only inside this class, if it used here). Every other module which use this method generate call which is much slower than inlined usage. That why optimization with 01, or with 02 (and probably with O3) have no much difference, because most of the methods are too small, they used code transformation, but there is no much to transform when all code is bunch of call. I tried LTO also, but it have not much effect. In C++ such small methods placed in the headers, so they inlined every where they used, but in crystal only inside class. In --release there is no such problem because all classes generated in single llvm module (but it is too big and not allow incremental usage), and every of this small methods inlined. If all small methods in Crystal was macro, O1/O2 would be similar to C++.
    Also as i understand this is not related to methods which use yield, they are always inlined by crystal itself, because in llvm no such concept as yield. There is also annotation AlwaysInline, but it just tell llvm to inline it, so it also in-module inlining only. If we can add annotation like: Header (which meaning like header in c++), which would inline methods on crystal level (like methods with yield), and add this to all small methods, probably we can reach fast and quite performant O2 builds.

@funny-falcon
Copy link
Contributor

funny-falcon commented May 26, 2023

O1 vs O2 could have more difference if loop unrolling will be disabled for O1. Disabling loop unrolling certainly improves compilation speed and hurts performance. I'll try to redo measures till Monday to show numbers.

I don't think new explicit "Header" annotation worth, but implicit one for small methods and "AlwaysInline" is reasonable thing. Crystal still accumulates whole code for analyze, so looks like it would not hurt compilation time much.

I doubt -O3 without --single-module is reasonable. If user doesn't want every bit performance, then -O2 is already good (especially if "header" methods will be implemented). If they wants, then single module compilation in --release/-O3 is just ok. But I could be wrong here.

I really want to change default option. It will be almost useless being -O0. Only very simple scripts will benefits from "super fast compilation" compared to -O1, and debug information with -O1 should be enough in most cases (this sentence should be checked yet).

@zw963
Copy link
Contributor

zw963 commented May 28, 2023

i split opt levels and single module in #13464:

results for my project(runtime performance difference between O1,O2,O3 no so big): -O3 --single-module: initial compile: 0m56,660s, incremental: 0m57,306s, start time: 2.89s -O2 --single-module: initial compile: 0m52,105s, incremental: 0m52,068s, start time: 2.79s -O1 --single-module: initial compile: 0m44,674s, incremental: 0m44,483s, start time: 2.86s --single-module: initial compile: 0m10,975s, incremental: 0m10,979s, start time: 31.77s

-O3: initial compile: 0m12,049s, incremental: 0m5,620s, start time: 9.74s -O2: initial compile: 0m12,005s, incremental: 0m5,640s, start time: 9.77s -O1: initial compile: 0m11,464s, incremental: 0m5,578s, start time: 10.14s : initial compile: 0m7,126s, incremental: 0m5,797s, start time: 32.39s

Run time for this benchmark: https://github.com/kostya/jit-benchmarks#brainfuck -O3 --single-module: 0m4,656s -O2 --single-module: 0m4,701s -O1 --single-module: 0m4,814s --single-module: 0m55,725s

-O3: 0m18,715s -O2: 0m18,427s -O1: 0m22,277s : 0m54,816s

After check those benchmarks, from my point of view, -O1 is much better than default no optimization.

Even a little bit of improvement for incremental is huge!

Sure, i missing the context for why select no optimization as default, there may be concerns for make this decision.

@funny-falcon
Copy link
Contributor

funny-falcon commented May 29, 2023

I've tried to disable loop unrolling for O1 ( funny-falcon@94fe0cd ) on top of @kostya's AlwaysInline optimizations, but didn't see significant effect on Kostya's brainfuck benchmark nor on self-crystal compilation.

@funny-falcon
Copy link
Contributor

funny-falcon commented May 29, 2023

But macros compilation with O1 ( funny-falcon@146069d ) gave 4-5 second improvement on Crystal self-compilation with -O1 level (1:15 -> 1:10 real time, 2:52->2:48 CPU time) and -O2 level (1:18 -> 1:14 real time).

@kostya
Copy link
Contributor

kostya commented May 29, 2023

i think it would have effect only in single-module

@funny-falcon
Copy link
Contributor

@kostya Looks like you're right: just disabling single_module for macros compilation (and keeping O3 optimization level) saves time as well.

@kostya
Copy link
Contributor

kostya commented May 30, 2023

I run some benchmarks:

initial - is compile time with clean cache, incremental - compile same file with 1 line change, run time: benchmark run time in sec.
--release = -O3 --single-module, = no optimization mode many modules, -O2 = O2 and many modules, -O1 = O1 and many modules.

Before (branch opt_level):

binarytrees
--release: initial: 6.60s, incremental: 6.58s, run time: 0.65s
: initial: 1.25s, incremental: 1.01s, run time: 0.99s
-O2: initial: 2.43s, incremental: 1.36s, run time: 0.72s
-O1: initial: 2.18s, incremental: 1.20s, run time: 0.71s
brainfuck
--release: initial: 6.80s, incremental: 6.74s, run time: 4.65s
: initial: 1.22s, incremental: 0.99s, run time: 56.54s
-O2: initial: 2.40s, incremental: 1.33s, run time: 19.71s
-O1: initial: 2.12s, incremental: 1.17s, run time: 20.20s
brainfuck2
--release: initial: 6.55s, incremental: 6.55s, run time: 1.91s
: initial: 1.22s, incremental: 0.98s, run time: 16.87s
-O2: initial: 2.37s, incremental: 1.31s, run time: 5.19s
-O1: initial: 2.18s, incremental: 1.18s, run time: 5.86s
fannkuchredux
--release: initial: 6.75s, incremental: 6.56s, run time: 0.19s
: initial: 1.29s, incremental: 0.98s, run time: 4.78s
-O2: initial: 2.45s, incremental: 1.34s, run time: 1.06s
-O1: initial: 2.14s, incremental: 1.16s, run time: 1.09s
fasta
--release: initial: 6.76s, incremental: 6.55s, run time: 0.78s
: initial: 1.25s, incremental: 0.98s, run time: 4.09s
-O2: initial: 2.47s, incremental: 1.37s, run time: 1.46s
-O1: initial: 2.17s, incremental: 1.18s, run time: 1.46s
mandelbrot
--release: initial: 6.52s, incremental: 6.66s, run time: 0.46s
: initial: 1.26s, incremental: 0.98s, run time: 1.34s
-O2: initial: 2.41s, incremental: 1.33s, run time: 0.69s
-O1: initial: 2.12s, incremental: 1.20s, run time: 0.47s
matmul
--release: initial: 7.83s, incremental: 7.76s, run time: 0.25s
: initial: 1.30s, incremental: 0.99s, run time: 3.53s
-O2: initial: 2.75s, incremental: 1.35s, run time: 1.56s
-O1: initial: 2.47s, incremental: 1.17s, run time: 1.56s
nbody
--release: initial: 7.62s, incremental: 7.61s, run time: 0.69s
: initial: 1.30s, incremental: 1.01s, run time: 2.92s
-O2: initial: 2.82s, incremental: 1.43s, run time: 0.79s
-O1: initial: 2.64s, incremental: 1.20s, run time: 0.83s
pidigits
--release: initial: 7.84s, incremental: 7.81s, run time: 4.80s
: initial: 1.36s, incremental: 1.05s, run time: 5.25s
-O2: initial: 2.83s, incremental: 1.39s, run time: 4.85s
-O1: initial: 2.57s, incremental: 1.22s, run time: 5.28s
spectralnorm
--release: initial: 7.49s, incremental: 7.57s, run time: 0.11s
: initial: 1.26s, incremental: 1.01s, run time: 0.71s
-O2: initial: 2.62s, incremental: 1.36s, run time: 0.23s
-O1: initial: 2.35s, incremental: 1.17s, run time: 0.23s
base64
--release: initial: 6.13s, incremental: 6.21s, run time: 1.23s
: initial: 1.20s, incremental: 0.95s, run time: 45.40s
-O2: initial: 2.27s, incremental: 1.27s, run time: 19.47s
-O1: initial: 2.04s, incremental: 1.08s, run time: 19.06s
json
--release: initial: 6.67s, incremental: 6.71s, run time: 1.28s
: initial: 1.26s, incremental: 0.98s, run time: 7.90s
-O2: initial: 2.51s, incremental: 1.33s, run time: 3.43s
-O1: initial: 2.30s, incremental: 1.17s, run time: 3.38s
json2
--release: initial: 6.65s, incremental: 6.76s, run time: 0.88s
: initial: 1.24s, incremental: 0.95s, run time: 7.29s
-O2: initial: 2.56s, incremental: 1.37s, run time: 2.95s
-O1: initial: 2.19s, incremental: 1.17s, run time: 3.02s
primes
--release: initial: 6.27s, incremental: 6.33s, run time: 0.14s
: initial: 1.14s, incremental: 0.90s, run time: 1.15s
-O2: initial: 2.19s, incremental: 1.23s, run time: 0.39s
-O1: initial: 1.93s, incremental: 1.07s, run time: 0.41s
knucleotide
--release: initial: 7.88s, incremental: 7.83s, run time: 1.90s
: initial: 1.30s, incremental: 1.00s, run time: 16.52s
-O2: initial: 2.83s, incremental: 1.35s, run time: 6.27s
-O1: initial: 2.78s, incremental: 1.37s, run time: 6.82s
regexdna
--release: initial: 7.43s, incremental: 7.29s, run time: 2.76s
: initial: 1.28s, incremental: 1.02s, run time: 2.90s
-O2: initial: 2.53s, incremental: 1.35s, run time: 2.80s
-O1: initial: 2.25s, incremental: 1.18s, run time: 2.80s
revcomp
--release: initial: 6.91s, incremental: 6.91s, run time: 1.88s
: initial: 1.25s, incremental: 0.99s, run time: 10.59s
-O2: initial: 2.50s, incremental: 1.33s, run time: 3.75s
-O1: initial: 2.26s, incremental: 1.16s, run time: 3.62s
my
--release: initial: 54.01s, incremental: 56.26s, run time: 2.87s
: initial: 7.16s, incremental: 5.58s, run time: 32.46s
-O2: initial: 12.32s, incremental: 6.26s, run time: 9.94s
-O1: initial: 11.91s, incremental: 6.01s, run time: 10.21s
compiler
--release: initial: 5m32.11s, run time: 0m32.80s
: initial: 0m32.98s, incremental: 0m21.61s, run time: 0m52.07s
-O1: initial: 1m05.90s, incremental: 0m22.59s, run time: 0m39.15s
-O2: initial: 1m10.09s, incremental: 0m22.63s, run time: 0m39.10s

After optimizations (branch optimizations2):

kostya/crystal@opt_level...kostya:crystal:optimizations2
Shortly about it, now AlwaysInline behave like it sounds, purely inline code on crystal level. Before it just marks it in llvm (so inline works only in current llvm module, no cross module inlining). Not sure if backtrace btw works. So now all methods marked by AlwaysInline, like headers in C++. Also added likely intrinsics, which also add some performance.

binarytrees
--release: initial: 6.24s, incremental: 6.18s, run time: 0.62s
: initial: 1.24s, incremental: 1.00s, run time: 0.97s
-O2: initial: 2.28s, incremental: 1.33s, run time: 0.67s
-O1: initial: 2.10s, incremental: 1.23s, run time: 0.69s
brainfuck
--release: initial: 6.39s, incremental: 6.40s, run time: 4.26s
: initial: 1.24s, incremental: 1.02s, run time: 42.35s
-O2: initial: 2.25s, incremental: 1.32s, run time: 4.90s
-O1: initial: 2.11s, incremental: 1.23s, run time: 5.55s
brainfuck2
--release: initial: 6.30s, incremental: 6.29s, run time: 1.88s
: initial: 1.25s, incremental: 1.01s, run time: 11.97s
-O2: initial: 2.27s, incremental: 1.31s, run time: 1.70s
-O1: initial: 2.10s, incremental: 1.19s, run time: 1.88s
fannkuchredux
--release: initial: 6.28s, incremental: 6.30s, run time: 0.19s
: initial: 1.28s, incremental: 1.06s, run time: 3.09s
-O2: initial: 2.32s, incremental: 1.34s, run time: 0.22s
-O1: initial: 2.08s, incremental: 1.23s, run time: 0.24s
fasta
--release: initial: 6.17s, incremental: 6.14s, run time: 0.77s
: initial: 1.26s, incremental: 1.01s, run time: 3.32s
-O2: initial: 2.30s, incremental: 1.33s, run time: 1.13s
-O1: initial: 2.13s, incremental: 1.21s, run time: 1.08s
mandelbrot
--release: initial: 6.04s, incremental: 6.10s, run time: 0.46s
: initial: 1.24s, incremental: 1.00s, run time: 1.32s
-O2: initial: 2.22s, incremental: 1.30s, run time: 0.75s
-O1: initial: 2.06s, incremental: 1.18s, run time: 0.47s
matmul
--release: initial: 6.80s, incremental: 6.75s, run time: 0.25s
: initial: 1.35s, incremental: 1.06s, run time: 2.87s
-O2: initial: 2.60s, incremental: 1.40s, run time: 0.25s
-O1: initial: 2.36s, incremental: 1.28s, run time: 0.25s
nbody
--release: initial: 6.94s, incremental: 6.93s, run time: 0.69s
: initial: 1.39s, incremental: 1.07s, run time: 2.90s
-O2: initial: 2.57s, incremental: 1.40s, run time: 0.74s
-O1: initial: 2.39s, incremental: 1.26s, run time: 0.81s
pidigits
--release: initial: 7.05s, incremental: 7.03s, run time: 4.97s
: initial: 1.42s, incremental: 1.11s, run time: 5.04s
-O2: initial: 2.64s, incremental: 1.41s, run time: 5.20s
-O1: initial: 2.41s, incremental: 1.29s, run time: 4.90s
spectralnorm
--release: initial: 6.77s, incremental: 6.75s, run time: 0.11s
: initial: 1.33s, incremental: 1.04s, run time: 0.58s
-O2: initial: 2.56s, incremental: 1.39s, run time: 0.11s
-O1: initial: 2.45s, incremental: 1.24s, run time: 0.11s
base64
--release: initial: 5.85s, incremental: 5.87s, run time: 1.21s
: initial: 1.22s, incremental: 1.00s, run time: 27.09s
-O2: initial: 2.23s, incremental: 1.27s, run time: 1.25s
-O1: initial: 2.06s, incremental: 1.14s, run time: 1.23s
json
--release: initial: 6.41s, incremental: 6.46s, run time: 1.39s
: initial: 1.33s, incremental: 1.02s, run time: 6.24s
-O2: initial: 2.39s, incremental: 1.30s, run time: 2.16s
-O1: initial: 2.18s, incremental: 1.20s, run time: 2.15s
json2
--release: initial: 6.39s, incremental: 6.41s, run time: 1.04s
: initial: 1.30s, incremental: 1.02s, run time: 5.59s
-O2: initial: 2.38s, incremental: 1.33s, run time: 1.94s
-O1: initial: 2.21s, incremental: 1.25s, run time: 1.91s
primes
--release: initial: 5.77s, incremental: 5.80s, run time: 0.15s
: initial: 1.23s, incremental: 0.96s, run time: 0.99s
-O2: initial: 2.20s, incremental: 1.27s, run time: 0.24s
-O1: initial: 2.02s, incremental: 1.16s, run time: 0.24s
knucleotide
--release: initial: 7.14s, incremental: 7.22s, run time: 1.95s
: initial: 1.36s, incremental: 1.05s, run time: 13.42s
-O2: initial: 2.69s, incremental: 1.37s, run time: 3.04s
-O1: initial: 2.42s, incremental: 1.23s, run time: 3.06s
regexdna
--release: initial: 6.54s, incremental: 6.54s, run time: 2.75s
: initial: 1.28s, incremental: 1.03s, run time: 2.82s
-O2: initial: 2.33s, incremental: 1.33s, run time: 2.77s
-O1: initial: 2.16s, incremental: 1.21s, run time: 2.77s
revcomp
--release: initial: 6.45s, incremental: 6.53s, run time: 1.89s
: initial: 1.27s, incremental: 1.02s, run time: 8.89s
-O2: initial: 2.42s, incremental: 1.30s, run time: 2.42s
-O1: initial: 2.21s, incremental: 1.18s, run time: 2.46s
my
--release: initial: 56.63s, incremental: 56.51s, run time: 2.74s
: initial: 7.19s, incremental: 5.64s, run time: 19.60s
-O2: initial: 12.96s, incremental: 6.20s, run time: 4.73s
-O1: initial: 12.34s, incremental: 6.11s, run time: 4.73s
compiler
--release: initial: 5m44.55s, run time: 0m35.25s
: initial: 0m34.24s, incremental: 0m19.47s, run time: 0m52.59s
-O1: initial: 1m13.41s, incremental: 0m20.38s, run time: 0m39.03s
-O2: initial: 1m20.53s, incremental: 0m20.61s, run time: 0m38.71s

Best results:

After optimizations in brainfuck, my, base64 best gain.

Not all optimizations was done.

There is remains optimizations in #hash and Hasher, and .new, which not worked nice with AlwaysInline. Also hard to optimize Tuple.includes? because it use enumerable, so thats why in base64 i replace in?.

Conclusion:

After optimizations -O2 became is a decent level of compilation, which runtime is 0.9-2 times slower than release (in 1 test even faster), but compile time is 3-4 times faster, incremental compile time: 5-10 times faster.
--release just little faster after all this optimizations, because single-module and O3 doing similar thing with inlining.
On the recompile compiler this optimizations have no visible effect.

@zw963
Copy link
Contributor

zw963 commented May 30, 2023

I run some benchmarks:

Now -O2 is a decent level of compilation, which runtime is 0.9-2 times slower than release (in 1 test even faster), but compile time is 3-4 times faster, incremental compile time: 5-10 times faster. --release just little faster after all this optimizations, because single-module and O3 doing similar thing with inlining.

It seem like still no-optimization give a better result than -O1/-O2 when increment?

@kostya
Copy link
Contributor

kostya commented May 30, 2023

@zw963 compile time yes, but this was expected, look at run time.

@funny-falcon
Copy link
Contributor

It seem like still no-optimization give a better result than -O1/-O2 when increment?

It depends on definition of "better".
Save 0.3s on compilation and loose 5s in runtime - I don't call it "better".

@beta-ziliani
Copy link
Member

My take on this, mostly what's been said already:

  • Use -Ox with x between 0 and 3 (current default and current release, resp.)
  • Default to -O1.
  • Split --single-module

I consider the idea of a pure Crystal inlining interesting, but orthogonal. I'd rather put that in another issue.

@straight-shoota
Copy link
Member Author

This was resolved in #13464

@zw963
Copy link
Contributor

zw963 commented Jan 12, 2024

Hi, @kostya , i see this issue closed, so, what is the difference between optimizations2 and old branch opt_level ? is that merged to 1.11.0 too?

@kostya
Copy link
Contributor

kostya commented Jan 12, 2024

this optimizations kostya/crystal@opt_level...kostya:crystal:optimizations2 was not merged, its quite questionable, but it optimize incremental release compilation runtime by huge.

@zw963
Copy link
Contributor

zw963 commented Jan 12, 2024

this optimizations kostya/crystal@opt_level...kostya:crystal:optimizations2 was not merged, its quite questionable, but it optimize incremental release compilation runtime by huge.

Yes, questionable but we can discuss, whether create a new issue for this?

@kostya
Copy link
Contributor

kostya commented Jan 12, 2024

this what I gain with this optimizations:
-O2 became is a decent level of compilation, which runtime is 0.9-2 times slower than release (in 1 test even faster), but compile time is 3-4 times faster, incremental compile time: 5-10 times faster.
compare to --release

@n-pn
Copy link

n-pn commented Jan 12, 2024

That is promising. Can you please create another pull request contains only those parts?

I would like to add a compiler switch that when turned on will apply all the patches. The official compiler might do not need it but it is surely useful for my daily coding session.

@kostya
Copy link
Contributor

kostya commented Jan 12, 2024

open it here: #14225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants