Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incremental optimization levels #13464

Merged
merged 2 commits into from
Nov 6, 2023

Conversation

kostya
Copy link
Contributor

@kostya kostya commented May 12, 2023

UPDATE: Revised description reflecting the actually merged changes (by @straight-shoota):

Adds four distinct optimization levels:

  • -O0: No optimization
  • -O1: Low optimization
  • -O2: Middle optimization
  • -O3: High optimization

Each level activates the respective LLVM RunPasses and CodeGenOptLevel optimizations.

-O3 corresponds to the existing release mode and -O0 corresponds to the default non-release mode. -O0 remains the default and --release is equivalent to -O3 --single-module.

Effectively, this introduces two optimization choices between the previous full or nothing. And it's now possible to use high optimization without --single-module.

Each optimization level increasingly trades compile time performance for runtime performance. The exact effect depends on individual code bases. But in general, even slight optimizations can significantly improve runtime performance with barely noticable impact at compile time.
When using any kind of optimizations --single-module probably has the biggest effect on both compile and runtime performance. It enables optimizations across module boundaries but makes it impossible to generate modules in parallel.


Original Post:

This is just draft, for discussion. Here i add two new levels of optimization (--level2 and --level1) to existing ones --release and default.
Reason: I have quite big project which is compiled very slow, and also have big start time (it read data from db, and do some heavy calculations on it). So if i compile it without optimization it starts very slow, if i compile it with --release, compilation take too long. So debugging such project is real pain. By adding new incremental optimization level, i can solve this problem.

Of course this level2, and level1 optimization not even close in terms of performance to --release option, because it optimize every module (which is class in crystal), unlike --release which optimize united module (using hard inlining). But it fast enough to debug my project.

This is results for my project:
--release: initial compile: 61s, incremental compile(change 1 file): 61s, start time: 2s
--level2: initial compile: 12.6s, incremental compile(change 1 file): 5.3s, start time: 7s
--level1: initial compile: 12s, incremental compile(change 1 file): 5.2s, start time: 7.5s
default: initial compile: 7,4s, incremental compile(change 1 file): 5.3s, start time: 23.5s

Build crystal compiler (run time here is recompile crystal by new binary with clean cache and level0, to reduce llvm interference):
--release: initial compile: 7m38,818s, incremental compile(change 1 file): 7m34,734s, run time: 0m36,086s
--level2: initial compile: 1m5,084s, incremental compile(change 1 file): 0m22,129s, run time: 0m42,691s
--level1: initial compile: 1m1,384s, incremental compile(change 1 file): 0m22,106s, run time: 0m42,980s
default: initial compile: 0m27,484s, incremental compile(change 1 file): 0m22,071s, run time: 0m55,572s

@Blacksmoke16
Copy link
Member

Related: https://forum.crystal-lang.org/t/faster-release-compile-times-but-slightly-worse-performance/3864

@funny-falcon
Copy link
Contributor

funny-falcon commented May 12, 2023

I suggest following option names:

  • --no-opt - matches to current <default> mode (and --level0 in 099c080 )
  • <default> - optimization with O1 and separate/incremental compilation (--level1)
  • --opt - optimization with O2 and separate/incremental compilation (--level2)
  • --release - remains same: O3 and "single_module" compilation

I strongly believe, default mode should be with optimizations enabled, since it is most of users use first. Given it doesn't harm compilation time much and provides significant improvement to performance of resulting binary, I don't see why non-optimized mode should remain default.

@Sija
Copy link
Contributor

Sija commented May 12, 2023

@funny-falcon Long time no see!

builder.use_inliner_with_threshold = 275
when OptimizationMode::Level1
builder.opt_level = 1
builder.use_inliner_with_threshold = 150
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is other useful option to speedup compilation:

builder.disable_unroll_loops = true

But I don't know, how to match it in optimize_with_new_pass_manager for newer LLVM.

Copy link
Contributor

@funny-falcon funny-falcon May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like, there is need to add LLVM::PassBuilderOptions#set_loop_unrolling which should map to LLVMPassBuilderOptionsSetLoopUnrolling

@kostya
Copy link
Contributor Author

kostya commented May 13, 2023

@jkthorne

What about more following the optimization levels of compilers like GCC and LLVM?

Instead of "--level2" it would be "-O2"?

This is would be quite bad because create confusion, level2 here is not even close to gcc -O2. In gcc -O2 is very good level of optimization, but in crystal it would be much slower, because use separate module compilation (so no inlining).

@kostya kostya changed the title Add incremental release compilation Add incremental optimization levels May 13, 2023
LLVM::CodeGenOptLevel::Less
else
LLVM::CodeGenOptLevel::None
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anybody know what for this opt_level, is it options for linker?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly speaking, this level is required for the codegen passes, whereas the other one is for the optimization passes

@zw963
Copy link
Contributor

zw963 commented May 13, 2023

I propose use --level1 or --level2 instead of default, because for a project, the first time build time is always can be ignored, because, we only need it to be do once, right?

But, we should keep old default mode for user use it manually.

@kostya
Copy link
Contributor Author

kostya commented May 14, 2023

I like idea to use level1 as default.

minuses:

  1. 1.5-2 times slower initial compile (done once). For those who use crystal for scripting need to use --level0 option manually.
  2. less backtrace, need to compile with --debug option, to get same backtrace as default. (adding this option make compilation slower by 15%)

pluses:

  1. much faster run time, good for debug big applications, good for big amount of specs.
  2. similar speed of incremental compile as in default. Most of time we spend in incremental compile.

@funny-falcon
Copy link
Contributor

Still no progress? pitty

@kostya
Copy link
Contributor Author

kostya commented Oct 31, 2023

why not merge it? it not change current compilation (default and --release), only add more options for build customization.

@funny-falcon
Copy link
Contributor

Yeah, quite strange unwilling to improve user experience.

@straight-shoota
Copy link
Member

I'm sorry this has been sitting for so long. It's not unwillingness. There's a lot of review work and limited resources. Sometimes PRs fall through the crack. 😢
Thanks for calling for attention on this. This is definitely one of the contributions that should not be neglected.

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overal! I have some suggestions for small improvements.
And merge conflicts need to be resolved via git merge master.

src/compiler/crystal/command.cr Outdated Show resolved Hide resolved
src/compiler/crystal/command.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
current_bc_flags = "#{@codegen_target}|#{@mcpu}|#{@mattr}|#{@release}|#{@link_flags}|#{@mcmodel}"
bc_flags_filename = "#{output_dir}/bc_flags"
current_bc_flags = "#{@codegen_target}|#{@mcpu}|#{@mattr}|#{@link_flags}|#{@mcmodel}"
bc_flags_filename = "#{output_dir}/bc_flags#{optimization_mode_suffix}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Putting the optimization level in the filename seems like a great idea. It changes from the current behaviour where it's only written in the file contents.
I'm wondering about the effects of this.
I suppose it means the caches for different optimization modes won't override each other.
Does it mean different caches stay around? But what about the actual data files in output_dir?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this PR, all objects files was mixed, but release used only 2 files: _main.o and _main.bc, so mixing was not so important. But in this PR added many modes, so every mode have separate compiled objects(.o and .bc) for each class.
About the actual data, still this is only cache files, its became useless, it can be removed just with rm -rf ~/.cache/crystal. And of course all caches for each mode would stay around. But in every day usage this cache would be the similar as before this PR, because default compile - would generate .o0 files, and release would generate just 2 files _main.o.o3 _main.bc.o3

src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
@@ -755,7 +810,7 @@ module Crystal
end

if must_compile
compiler.optimize llvm_mod if compiler.release?
compiler.optimize llvm_mod if compiler.optimization_mode != OptimizationMode::O0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Predicate method is type safe and more concise:

Suggested change
compiler.optimize llvm_mod if compiler.optimization_mode != OptimizationMode::O0
compiler.optimize llvm_mod unless compiler.optimization_mode.o0?

@@ -145,7 +145,7 @@ class Crystal::Program

# Although release takes longer, once the bc is cached in .crystal
# the subsequent times will make program execution faster.
host_compiler.release = true
host_compiler.release!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: According to #13505 (comment) ff. it seems to be more efficient to build macros not in a single module.
It's probably best to leave the current behaviour in place here. We can follow up with a change to macro generation config. Since the host compiler inherits its configuration from the target compiler there's more involved than just switching this to host_compiler.optimization_mode = :03.

Copy link
Contributor Author

@kostya kostya Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about changing this, because this is for compile run macroses, as i understand. Which should have fast runtime, and this is done only once. So release! here is in place.

Copy link
Contributor

@funny-falcon funny-falcon Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kostya macroses should run "fast-enough". Compiled without "single-module" is certainly fast enough for macroses, since they are not computation heavy. I doubt you could measure difference, I bet case of beer on it. But delta of time consumed by compilation of macroses is certainly measurable.

@kostya
Copy link
Contributor Author

kostya commented Nov 4, 2023

rebased, squashed

Copy link
Member

@straight-shoota straight-shoota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

P.S. Next time, please do not force push. Just merge and amend new commits. That makes reviews easier. Thanks. 🙏 (ref: https://github.com/crystal-lang/crystal/blob/master/CONTRIBUTING.md#making-good-pull-requests).

@straight-shoota straight-shoota added this to the 1.11.0 milestone Nov 4, 2023
@zw963
Copy link
Contributor

zw963 commented Nov 5, 2023

Cool.

@straight-shoota straight-shoota merged commit e838701 into crystal-lang:master Nov 6, 2023
54 of 55 checks passed
Blacksmoke16 pushed a commit to Blacksmoke16/crystal that referenced this pull request Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants