Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable LTO optimizations in release builds to reduce binary size #5904

Merged
merged 1 commit into from
Aug 8, 2024

Conversation

davfsa
Copy link
Contributor

@davfsa davfsa commented Aug 8, 2024

Summary

In the same spirit as #5745, release builds could be a bit slightly more size efficient by enabling LTO, which removes dead code (either in uv through fully inlined functions or the libraries it depends on). Also has the side-effect (more what LTO was created for) of slighly speeding up uv.

In this case, I have measured a 5MB size decrease!.

Unfortunately, this change also comes with the disadvantage of more than doubling the build time of a clean build on my machine (see "Test Plan"). I have opened this pull request to show my findings and suggest this as an option.

I have also started looking into what effects optimizing for size rather than speed could have, but that calls for another pr

Test Plan

Comparing the binary size before and after (starting off in just a simple clone of the repository)

System info:

CPU: AMD Ryzen 7 3700X (16) @ 3.600GHz
Memory: 32GB @ 3200 MT/s
Uname: Linux galaxy 6.6.44-1-MANJARO #1 SMP PREEMPT_DYNAMIC Sat Aug  3 10:09:33 UTC 2024 x86_64 GNU/Linux

Before:

$ cargo build --release
<snip>
Finished `release` profile [optimized] target(s) in 1m 29s

$ du target/release/uv -h
30M	target/release/uv

After:

$ cargo build --release
<snip>
Finished `release` profile [optimized] target(s) in 3m 43s

$ du target/release/uv -h
25M	target/release/uv

@davfsa davfsa changed the title Enable LTO optimizations to decrease binary size Enable LTO optimizations in release builds to reduce binary size Aug 8, 2024
Copy link

codspeed-hq bot commented Aug 8, 2024

CodSpeed Performance Report

Merging #5904 will improve performances by 14.9%

Comparing davfsa:task/enable-lto (537ecf6) with main (acbd367)

Summary

⚡ 6 improvements
✅ 7 untouched benchmarks

Benchmarks breakdown

Benchmark main davfsa:task/enable-lto Change
build_platform_tags[burntsushi-archlinux] 1.3 ms 1.1 ms +13.69%
wheelname_parsing[flyte-long-compatible] 10 µs 8.8 µs +12.91%
wheelname_parsing[flyte-short-compatible] 6.3 µs 5.5 µs +14.54%
wheelname_parsing[flyte-short-incompatible] 6.3 µs 5.5 µs +14.28%
wheelname_parsing_failure[flyte-long-extension] 1.9 µs 1.6 µs +14.9%
wheelname_parsing_failure[flyte-short-extension] 1.9 µs 1.7 µs +14.64%

@konstin
Copy link
Member

konstin commented Aug 8, 2024

I'm seeing a small speedup, too:

hyperfine --prepare "rm -f uv.lock" --warmup 3 --runs 30 "~/projects/uv-main/uv-main lock" "~/projects/uv-main/uv-lto lock"
Benchmark 1: ~/projects/uv-main/uv-main lock
  Time (mean ± σ):     218.1 ms ±   3.6 ms    [User: 217.5 ms, System: 129.5 ms]
  Range (min … max):   212.5 ms … 227.1 ms    30 runs
 
Benchmark 2: ~/projects/uv-main/uv-lto lock
  Time (mean ± σ):     208.5 ms ±   4.3 ms    [User: 204.8 ms, System: 128.0 ms]
  Range (min … max):   201.7 ms … 222.1 ms    30 runs

@konstin konstin added the performance Potential performance improvement label Aug 8, 2024
@T-256
Copy link
Contributor

T-256 commented Aug 8, 2024

Comparing the binary size before and after (starting off in just a simple clone of the repository)

Can you also share final released wheels size (compressed)?

@davfsa
Copy link
Contributor Author

davfsa commented Aug 8, 2024

Can you also share final released wheels size (compressed)?

Absolutely! Currently struggling to find how to build those (can't find any job where that is done and the release-pypi ci eludes to cargo-dist being what I need, but I have found no documentation on how to generate only the wheels

@konstin
Copy link
Member

konstin commented Aug 8, 2024

You can use maturin build --release (or even uvx maturin build --release)

@davfsa
Copy link
Contributor Author

davfsa commented Aug 8, 2024

Thanks @konstin!

@T-256 Here are your results (a bit underwhelming actually):

Before:

$ du targets/wheels/uv-0.2.34-py3-none-linux_x86_64.whl -h
12M	uv-0.2.34-py3-none-linux_x86_64.whl

After:

$ du target/wheels/uv-0.2.34-py3-none-linux_x86_64.whl -h
11M	target/wheels/uv-0.2.34-py3-none-linux_x86_64.whl

Btw, thanks for sharing the discussion about this in the ruff repository! If necessary, I can spend some time playing around a bit with build optimizations to make it take less time to compile release builds. Build times did take a massive hit (at least on my machine, from ~1m to ~3m on a fresh build).

Just through a quick read of discussion you linked and the other ones linked by other people, it seems like there is a lot more to consider here than I initially thought.

@charliermarsh charliermarsh merged commit cac3c4d into astral-sh:main Aug 8, 2024
57 checks passed
@charliermarsh
Copy link
Member

I'm supportive of this. Saving on binary size is a win for the registry, CDN, and users. We can always revisit in the future.

@BurntSushi
Copy link
Member

I would say I'm somewhat mildly negative on this change, largely for the same reasons I discussed here for ruff: astral-sh/ruff#9031

Building uv already takes a very long time, and doubling its build time for small improvements to binary size and runtime speed doesn't seem like the best trade off to me. Although the improvement to binary size does look pretty good here. Compounded over lots of downloads, that could really add up.

With that said, I'm already using --profile profiling locally when building uv for benchmarking, which is different from the release configuration (but only insomuch as debug symbols aren't stripped). With this change, the configurations differ even more now, with optimizations potentially being different between the builds. This means that when we're doing benchmarking locally, we'll be measuring the thing we aren't actually shipping. It is very difficult for me to estimate just how big of a deal that is. It's probably not a huge deal in the vast majority of circumstances. But it's a concern I have.

I'm not strongly opposed to this change though. So I fine with keeping it for now and evaluating just how much the longer build times annoy us.

@davfsa
Copy link
Contributor Author

davfsa commented Aug 8, 2024

First of all, thanks for joining into the conversation @BurntSushi. Your comments on astral-sh/ruff#9224 and similar were insanely insightful!

With that said, I'm already using --profile profiling locally when building uv for benchmarking, which is different from the release configuration (but only insomuch as debug symbols aren't stripped). With this change, the configurations differ even more now, with optimizations potentially being different between the builds. This means that when we're doing benchmarking locally, we'll be measuring the thing we aren't actually shipping.

The profiling profile inherits from the release one, so I believe it should have lto enabled too (not a cargo expert). Then the only differences would be those that are specifically marked (being not stripping the binary and enabling debug at the time of writing)

I have also opened #5909 to try to further optimize binary size without giving up too much performance nor build times :)

@BurntSushi
Copy link
Member

The profiling profile inherits from the release one, so I believe it should have lto enabled too (not a cargo expert).

Oof. Right. That needs to be undone. It is not tenable to wait minutes between benchmarking runs IMO. (The compile times are already too long.)

@davfsa
Copy link
Contributor Author

davfsa commented Aug 8, 2024

Oof. Right. That needs to be undone. It is not tenable to wait minutes between benchmarking runs IMO. (The compile times are already too long.)

I'm sorry for the build slowdowns! Didn't initially think they were done for anything else than a release, where I thought it would be fine to give up some build time to reduce the final binary.

If #5909 is not good enough (from a clean build, it now only takes about 30s more than before and 1m 45s less than with LTO only, at a slight performance cost that is), I would understand this PR being reverted

@BurntSushi
Copy link
Member

I don't think we need to revert the PR. I think we just need to make the profiling profile not inherit from release.

BurntSushi added a commit that referenced this pull request Aug 9, 2024
This PR tweaks the change made in #5904 so that the `profiling` Cargo
profile does _not_ have LTO enabled. With LTO enabled, compile times
even after just doing a `touch crates/uv/src/bin/uv.rs` are devastating:

    $ cargo b --profile profiling -p uv
       Compiling uv-cli v0.0.1 (/home/andrew/astral/uv/crates/uv-cli)
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 3m 47s

Even with `lto = "thin"`, compile times are not great, but an
improvement:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 53.98s

But our original configuration for `profiling`, prior to #5904, was with
LTO completely disabled:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 30.09s

This gives reasonable-ish compile times, although I still want them to
be better.

This setup does risk that we are measuring something in benchmarks that
we are shipping, but in order to make those two the same, we'd either
need to make compile times way worse for development, or take a hit
to binary size and a slight hit to runtime performance in our release
builds. I would weakly prefer that we accept the hit to runtime
performance and binary size in order to bring our measurements in line
with what we ship, but I _strongly_ feel that we should not have compile
times exceeding minutes for development. When doing performance testing,
long compile times, for me anyway, break "flow" state.

A confounding factor here was that #5904 enabled LTO for the `release`
profile, but the `dist` profile (used by `cargo dist`) was still setting
it to `lto = "thin"`. However, because of shenanigans in our release
pipeline, we we actually using the `release` profile for binaries we
ship. This PR does not make any changes here other than to remove `lto =
"thin"` from the `dist` profile to make the fact that they are the same
a bit clearer.

cc @davfsa
BurntSushi added a commit that referenced this pull request Aug 9, 2024
This PR tweaks the change made in #5904 so that the `profiling` Cargo
profile does _not_ have LTO enabled. With LTO enabled, compile times
even after just doing a `touch crates/uv/src/bin/uv.rs` are devastating:

    $ cargo b --profile profiling -p uv
       Compiling uv-cli v0.0.1 (/home/andrew/astral/uv/crates/uv-cli)
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 3m 47s

Even with `lto = "thin"`, compile times are not great, but an
improvement:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 53.98s

But our original configuration for `profiling`, prior to #5904, was with
LTO completely disabled:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 30.09s

This gives reasonable-ish compile times, although I still want them to
be better.

This setup does risk that we are measuring something in benchmarks that
we are shipping, but in order to make those two the same, we'd either
need to make compile times way worse for development, or take a hit
to binary size and a slight hit to runtime performance in our release
builds. I would weakly prefer that we accept the hit to runtime
performance and binary size in order to bring our measurements in line
with what we ship, but I _strongly_ feel that we should not have compile
times exceeding minutes for development. When doing performance testing,
long compile times, for me anyway, break "flow" state.

A confounding factor here was that #5904 enabled LTO for the `release`
profile, but the `dist` profile (used by `cargo dist`) was still setting
it to `lto = "thin"`. However, because of shenanigans in our release
pipeline, we we actually using the `release` profile for binaries we
ship. This PR does not make any changes here other than to remove `lto =
"thin"` from the `dist` profile to make the fact that they are the same
a bit clearer.

cc @davfsa
zanieb pushed a commit that referenced this pull request Aug 9, 2024
This PR tweaks the change made in #5904 so that the `profiling` Cargo
profile does _not_ have LTO enabled. With LTO enabled, compile times
even after just doing a `touch crates/uv/src/bin/uv.rs` are devastating:

    $ cargo b --profile profiling -p uv
       Compiling uv-cli v0.0.1 (/home/andrew/astral/uv/crates/uv-cli)
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 3m 47s

Even with `lto = "thin"`, compile times are not great, but an
improvement:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 53.98s

But our original configuration for `profiling`, prior to #5904, was with
LTO completely disabled:

    $ cargo b --profile profiling -p uv
       Compiling uv v0.2.34 (/home/andrew/astral/uv/crates/uv)
        Finished `profiling` profile [optimized + debuginfo] target(s) in 30.09s

This gives reasonable-ish compile times, although I still want them to
be better.

This setup does risk that we are measuring something in benchmarks that
we are shipping, but in order to make those two the same, we'd either
need to make compile times way worse for development, or take a hit
to binary size and a slight hit to runtime performance in our release
builds. I would weakly prefer that we accept the hit to runtime
performance and binary size in order to bring our measurements in line
with what we ship, but I _strongly_ feel that we should not have compile
times exceeding minutes for development. When doing performance testing,
long compile times, for me anyway, break "flow" state.

A confounding factor here was that #5904 enabled LTO for the `release`
profile, but the `dist` profile (used by `cargo dist`) was still setting
it to `lto = "thin"`. However, because of shenanigans in our release
pipeline, we we actually using the `release` profile for binaries we
ship. This PR does not make any changes here other than to remove `lto =
"thin"` from the `dist` profile to make the fact that they are the same
a bit clearer.

cc @davfsa
@zanieb
Copy link
Member

zanieb commented Aug 9, 2024

This change was not tested for all of our release builds unfortunately, it's failing in #5984 — considering reverting to unblock the release and revisiting.

tmeijn pushed a commit to tmeijn/dotfiles that referenced this pull request Aug 10, 2024
This MR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [astral-sh/uv](https://github.com/astral-sh/uv) | patch | `0.2.33` -> `0.2.35` |

MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot).

**Proposed changes to behavior should be submitted there as MRs.**

---

### Release Notes

<details>
<summary>astral-sh/uv (astral-sh/uv)</summary>

### [`v0.2.35`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0235)

[Compare Source](astral-sh/uv@0.2.34...0.2.35)

##### CLI

-   Deprecate `--system` and `--no-system` in `uv venv` ([#&#8203;5925](astral-sh/uv#5925))
-   Make `--upgrade` imply `--refresh` ([#&#8203;5943](astral-sh/uv#5943))
-   Warn when there are missing bounds on transitive dependencies with `--resolution-strategy lowest` ([#&#8203;5953](astral-sh/uv#5953))

##### Configuration

-   Add support for `no-build-isolation-package` ([#&#8203;5894](astral-sh/uv#5894))

##### Performance

-   Enable LTO optimizations in release builds to reduce binary size ([#&#8203;5904](astral-sh/uv#5904))
-   Prefetch metadata in `--no-deps` mode ([#&#8203;5918](astral-sh/uv#5918))

##### Bug fixes

-   Display portable paths in POSIX virtual environment activation commands ([#&#8203;5956](astral-sh/uv#5956))
-   Respect subdirectories when locating Git workspaces ([#&#8203;5944](astral-sh/uv#5944))

##### Documentation

-   Improve the `uv venv` CLI documentation ([#&#8203;5963](astral-sh/uv#5963))

### [`v0.2.34`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0234)

[Compare Source](astral-sh/uv@0.2.33...0.2.34)

##### Enhancements

-   Always strip in release mode ([#&#8203;5745](astral-sh/uv#5745))
-   Assume `git+` prefix when URLs end in `.git` ([#&#8203;5868](astral-sh/uv#5868))
-   Support build constraints ([#&#8203;5639](astral-sh/uv#5639))

##### CLI

-   Create help sections for build, install, resolve, and index ([#&#8203;5693](astral-sh/uv#5693))
-   Improve CLI documentation for global options ([#&#8203;5834](astral-sh/uv#5834))
-   Improve `--python` CLI documentation ([#&#8203;5869](astral-sh/uv#5869))
-   Improve display order of top-level commands ([#&#8203;5830](astral-sh/uv#5830))

##### Bug fixes

-   Allow downloading wheels for metadata with `--no-binary` ([#&#8203;5707](astral-sh/uv#5707))
-   Reject `pyproject.toml` in `--config-file` ([#&#8203;5842](astral-sh/uv#5842))
-   Remove double-proxy nodes in error reporting ([#&#8203;5738](astral-sh/uv#5738))
-   Respect pre-release preferences from input files ([#&#8203;5736](astral-sh/uv#5736))
-   Support overlapping local and non-local requirements in forks ([#&#8203;5812](astral-sh/uv#5812))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this MR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box

---

This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJSZW5vdmF0ZSBCb3QiXX0=-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants