You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As was proposed here, I decided to perform various tests with optimization resvg with more advanced compiler optimizations like LTO, PGO, PLO. Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Here are my results for the project - I hope they will be helpful to someone.
Test environment
Fedora 39
Linux kernel 6.8.9
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78.0
resvg version: the latest for now from the master branch on commit 4b4e8970de29407e6257aac3d2f501b60e88236a
Disabled Turbo boost
Benchmark
For benchmark purposes, I use a simple scenario of converting an SVG file to a PNG file with the resvg input.svg output.png command. For PGO optimization I use cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented - cargo pgo build, PGO-optimized - cargo pgo optimize build.
taskset -c 0 is used for reducing the OS scheduler's influence on the results during all measurements. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).
As an input file for the training purposes for the resvg input.svg output.png command, I use this file.
Additionally, I decided to enable back LTO for the tool. You disabled this optimization nearly 5 years ago due to some compiler bugs. I guess during the last 5 years the LTO implementation in the compiler became much more stable, and we can consider enabling it once again. So, for resvg during the benchmarks I enabled it with the following addition to the Cargo.toml file:
[profile.release]
codegen-units = 1
lto = true
Post-Link Optimization is also done with cargo-pgo with the same training workload as for the PGO step.
Results
Firstly, let's check the scenario when the training workload and the benchmark workload are the same. Such a benchmark is still useful for scenarios where you need to convert the same file many times (like a part of CI without caching):
hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 3.349 s ± 0.011 s [User: 3.082 s, System: 0.257 s]
Range (min … max): 3.333 s … 3.368 s 15 runs
Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 3.062 s ± 0.018 s [User: 2.802 s, System: 0.250 s]
Range (min … max): 3.040 s … 3.120 s 15 runs
Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 2.631 s ± 0.008 s [User: 2.368 s, System: 0.255 s]
Range (min … max): 2.622 s … 2.644 s 15 runs
Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 2.611 s ± 0.007 s [User: 2.347 s, System: 0.256 s]
Range (min … max): 2.598 s … 2.622 s 15 runs
Summary
taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_1.svg large_png_file_1.png ran
1.01 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_1.svg large_png_file_1.png
1.17 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_1.svg large_png_file_1.png
1.28 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_1.svg large_png_file_1.png
According to the results, LTO and PGO measurably improve performance. However, BOLT didn't improve the situation too much.
What if training and benchmarking workloads are different files? For this, I used the same file for training as above but for the benchmarks, I use another file. Here we go:
hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png
Time (mean ± σ): 2.398 s ± 0.006 s [User: 2.260 s, System: 0.131 s]
Range (min … max): 2.391 s … 2.414 s 15 runs
Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
Time (mean ± σ): 2.130 s ± 0.008 s [User: 1.991 s, System: 0.133 s]
Range (min … max): 2.123 s … 2.157 s 15 runs
Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png
Time (mean ± σ): 1.846 s ± 0.006 s [User: 1.707 s, System: 0.134 s]
Range (min … max): 1.838 s … 1.859 s 15 runs
Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
Time (mean ± σ): 1.864 s ± 0.021 s [User: 1.723 s, System: 0.135 s]
Range (min … max): 1.851 s … 1.935 s 15 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
taskset -c 0 ./resvg_lto_optimized large_svg_file_3.svg large_png_file_3.png ran
1.01 ± 0.01 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_3.svg large_png_file_3.png
1.15 ± 0.01 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_3.svg large_png_file_3.png
1.30 ± 0.01 times faster than taskset -c 0 ./resvg_release large_svg_file_3.svg large_png_file_3.png
We got a performance boost once again for a different file. I suppose it's because these two files execute similar paths inside the tool but cannot say more since I am not an SVG expert at all :)
However, there are cases that show that training on only one file is not sufficient - e.g. let's use this file for the benchmark (the training file remains the same as in the tests above):
hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png' 'taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png'
Benchmark 1: taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png
Time (mean ± σ): 1.415 s ± 0.003 s [User: 1.040 s, System: 0.357 s]
Range (min … max): 1.409 s … 1.421 s 15 runs
Benchmark 2: taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
Time (mean ± σ): 1.439 s ± 0.004 s [User: 1.055 s, System: 0.365 s]
Range (min … max): 1.429 s … 1.445 s 15 runs
Benchmark 3: taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
Time (mean ± σ): 1.488 s ± 0.002 s [User: 1.107 s, System: 0.361 s]
Range (min … max): 1.483 s … 1.491 s 15 runs
Benchmark 4: taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png
Time (mean ± σ): 1.497 s ± 0.002 s [User: 1.116 s, System: 0.363 s]
Range (min … max): 1.493 s … 1.502 s 15 runs
Summary
taskset -c 0 ./resvg_release large_svg_file_2.svg large_png_file_2.png ran
1.02 ± 0.00 times faster than taskset -c 0 ./resvg_release_lto large_svg_file_2.svg large_png_file_2.png
1.05 ± 0.00 times faster than taskset -c 0 ./resvg_lto_optimized large_svg_file_2.svg large_png_file_2.png
1.06 ± 0.00 times faster than taskset -c 0 ./resvg_lto_bolt_optimized large_svg_file_2.svg large_png_file_2.png
Here we see some performance decrease from all optimizations (even from LTO that's strange). It shows that the training PGO set should be wider.
Just for reference, I also measured the tool slowdown during the PGO and PLO training phases:
hyperfine --warmup 5 --min-runs 15 'taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png' 'taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png'
Benchmark 1: taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 3.670 s ± 0.062 s [User: 3.397 s, System: 0.262 s]
Range (min … max): 3.638 s … 3.891 s 15 runs
Benchmark 2: taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png
Time (mean ± σ): 4.593 s ± 0.010 s [User: 4.223 s, System: 0.338 s]
Range (min … max): 4.572 s … 4.610 s 15 runs
Summary
taskset -c 0 ./resvg_lto_instrumented large_svg_file_1.svg large_png_file_1.png ran
1.25 ± 0.02 times faster than taskset -c 0 ./resvg_lto_bolt_instrumented large_svg_file_1.svg large_png_file_1.png
Enable LTO. I expect in general performance boost "for free" and the binary size reduction.
Perform more PGO benchmarks with other datasets (if you are interested enough in it). If it shows improvements - add a note to the documentation (the README file, I guess) about possible improvements in the resvg's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference. Like more aggressive inlining.
Testing Post-Link Optimization techniques (like LLVM BOLT) with wider datasets would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting from the usual PGO since it's a much more stable technology with much fewer limitations.
I would be happy to answer your questions about PGO.
P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered:
Hi!
As was proposed here, I decided to perform various tests with optimization
resvg
with more advanced compiler optimizations like LTO, PGO, PLO. Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Here are my results for the project - I hope they will be helpful to someone.Test environment
resvg
version: the latest for now from themaster
branch on commit4b4e8970de29407e6257aac3d2f501b60e88236a
Benchmark
For benchmark purposes, I use a simple scenario of converting an SVG file to a PNG file with the
resvg input.svg output.png
command. For PGO optimization I use cargo-pgo tool. Release build is done withcargo build --release
, PGO instrumented -cargo pgo build
, PGO-optimized -cargo pgo optimize build
.taskset -c 0
is used for reducing the OS scheduler's influence on the results during all measurements. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).As an input file for the training purposes for the
resvg input.svg output.png
command, I use this file.Additionally, I decided to enable back LTO for the tool. You disabled this optimization nearly 5 years ago due to some compiler bugs. I guess during the last 5 years the LTO implementation in the compiler became much more stable, and we can consider enabling it once again. So, for
resvg
during the benchmarks I enabled it with the following addition to theCargo.toml
file:Post-Link Optimization is also done with
cargo-pgo
with the same training workload as for the PGO step.Results
Firstly, let's check the scenario when the training workload and the benchmark workload are the same. Such a benchmark is still useful for scenarios where you need to convert the same file many times (like a part of CI without caching):
where:
resvg_release
- regular Release buildresvg_release_lto
- Release + LTOresvg_lto_optimized
- Release + LTO + PGO optimizedresvg_lto_bolt_optimized
- Release + LTO + PGO optimized + BOLT optimizedAccording to the results, LTO and PGO measurably improve performance. However, BOLT didn't improve the situation too much.
What if training and benchmarking workloads are different files? For this, I used the same file for training as above but for the benchmarks, I use another file. Here we go:
We got a performance boost once again for a different file. I suppose it's because these two files execute similar paths inside the tool but cannot say more since I am not an SVG expert at all :)
However, there are cases that show that training on only one file is not sufficient - e.g. let's use this file for the benchmark (the training file remains the same as in the tests above):
Here we see some performance decrease from all optimizations (even from LTO that's strange). It shows that the training PGO set should be wider.
Just for reference, I also measured the tool slowdown during the PGO and PLO training phases:
where:
resvg_lto_instrumented
- Release + LTO + PGO instrumentationresvg_lto_bolt_instrumented
- Release + LTO + PGO optimization + BOLT instrumentationAlso, I want to report the binary size changes (without
strip
-ing that can influence the binary size a lot):Further steps
I can suggest the following action points:
resvg
's performance with PGO.I would be happy to answer your questions about PGO.
P.S. Please do not treat the issue like a bug or something like that - it's just a benchmark report. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered: