Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce the final binary size #4436

Closed
wants to merge 1 commit into from
Closed

reduce the final binary size #4436

wants to merge 1 commit into from

Conversation

tpgxyz
Copy link
Contributor

@tpgxyz tpgxyz commented Feb 25, 2023

Hi, looks like release binary can be a little bit smaller.

[tpg@omv-rockpro64 LOL]$ stat -c%s uu-coreutils
19810944
[tpg@omv-rockpro64 LOL]$ stat -c%s uu-coreutils-opt 
15533632
[tpg@omv-rockpro64 LOL]$ echo $(($(stat -c%s uu-coreutils) - $(stat -c%s uu-coreutils-opt) ))
4277312
[tpg@omv-rockpro64 LOL]$ echo $((($(stat -c%s uu-coreutils) - $(stat -c%s uu-coreutils-opt))/1024/1024 ))
4

Comapred to GNu coreutils compiled with corresponding CFLAGS/LDFLAGS and same LLVM toolchain 15.0.7

[tpg@omv-rockpro64 coreutils]$ stat -c%s /usr/bin/coreutils 
1067680

@zleyyij
Copy link
Contributor

zleyyij commented Feb 25, 2023

This change increases compile times and removes optimizations provided by loop vectorization, so there are some drawbacks. Disabling incremental compiling is especially of note.

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Feb 26, 2023

@zleyyij Hi, thanks for you comment.
I'm aware what does "-Oz" do, and i guess in case of coreutils speed gains of 0.01s should be sub par to size reduction goal.

Compared to GNU coreutils which is 10 MiB i guess this is huge disadvantage for uutils to install 20MiB of the "close-to-original-featureset" binary

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Feb 26, 2023

I'm aware what does "-Oz" do, and i guess in case of coreutils speed gains of 0.01s should be sub par to size reduction goal.

Your disagreement here is interesting, because both perspectives make sense. I think we can cater to both use cases by providing multiple profiles. Maybe a release-small profile? Then users/distros can pick the profile they prefer.

I find the most interesting changes regarding binary size are more in the direction of removing dependencies and those kinds of things.

This change increases compile times

This is not really an issue on release builds is it? Anyway, a separate profile would solve this too.

Let's get into the specific changes though.

  1. opt-level=z seems a bit much to me. Could you do some measurements of the size with opt-level=3, opt-level=s and opt-level=z and see what the differences are?
  2. debug=false should already be the default for release, right?
  3. lto=thin, why did you not go for true here? Does true/"fat" increase compile times too much?
  4. Does incremental=false really make a big difference?

There's no strip here, can we make improvements by setting that?


For reference, I was also looking at this: https://github.com/johnthagen/min-sized-rust

@tertsdiepraam
Copy link
Member

There's also some previous discussion in this issue: #747

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Feb 26, 2023

Maybe a release-small profile?

Well this means 80% of people will not notice it, because by default they will compile as a "release" target

  1. opt-level=z seems a bit much to me. Could you do some measurements of the size with opt-level=3, opt-level=s and opt-level=z and see what the differences are?

I'm building coreutils with "-Oz" and it works like expected on production systems. In case of embedded systems those couple of KiB/MiB saved space means a lot. Anyways that comparision of different opt-levels should give what kind of outcome ? I can assure that "-O3" i not going to bring any significant speed improvements nor big differences on size

3. lto=thin, why did you not go for true here? Does true/"fat" increase compile times too much?

LLVM/clang offers ThinLTO. Thin gives best outcome for embedded systems compared to full or fat. Especially fat here makes not sense to keep intermediate code and normally compiled code, unless someone like bloatware.

4. Does incremental=false really make a big difference?
In release mode you are building it once, so no incremental.
Incremental makes sense if you have a dedicated profile i.e. for Profile-Guided-Optimizations or you have some kind of cache to build over it to speed up things while developing/testing new things.

There's no strip here, can we make improvements by setting that?
Good catch, as my builds were inside RPM so strip is always running on %install
Will update this PR.

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Feb 26, 2023

Well this means 80% of people will not notice it, because by default they will compile as a "release" target

Maybe, but this depends on documentation too. And we could communicate this with distros that are targeted at embedded systems so they can enable it by default.

In case of embedded systems those couple of KiB/MiB saved space means a lot.

I understand, but that's a specific use case. I don't think the default configuration of uutils should be for embedded.

Anyways that comparision of different opt-levels should give what kind of outcome ? I can assure that "-O3" i not going to bring any significant speed improvements nor big differences on size

I'd like to know what tradeoff were making exactly. The problem with setting multiple parameters at the same time (like you do in this PR) is that it's hard to guess the impact of each individual setting.

Especially fat here makes not sense to keep intermediate code and normally compiled code, unless someone like bloatware.

Could you clarify this part? fat should not yield bigger binaries, right? How does that correlate with bloatware?

@tertsdiepraam
Copy link
Member

Btw, I'm trying to run some different combinations of these settings myself at the moment. I'll report back in a bit.

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Feb 26, 2023

I understand, but that's a specific use case. I don't think the default configuration of uutils should be for embedded.

Embedded is one case. Imagine the user experience, so in case of uutils you get twice bigger binary and less feature-completed-set compared to GNU coreutils. This project advertises itself as a replacement for coreutils, so let's give less excuses for GNU coreutils to be used.

Could you clarify this part? fat should not yield bigger binaries, right? How does that correlate with bloatware?

Well guess this is explained here

@tertsdiepraam
Copy link
Member

Well guess this is explained here

I can't find what you're talking about. What I understand from that page is that there's a "fat" object with all the info necessary for compiling that is kept around. But after compilation that can be thrown away, right? So I don't see how that's bad for embedded.

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/timeout. tests/misc/timeout is passing on 'main'. Maybe you have to rebase?

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Feb 26, 2023

So, I've run my experiment. I ran the compilation for each combination of the following parameters:

  • strip=none, strip=debuginfo, strip=symbols
  • panic=unwind, panic=abort
  • opt-level=3, opt-level=s, opt-level=z
  • lto=off, lto=thin, lto=fat

It took a while, because that's 54 combinations 😄 Unsurprisingly, the best combination is:

5.6M coreutils-symbols-abort-z-fat

That's just 5.6MB! So that's exciting.

For fun, here's the worst combination:

27M coreutils-none-unwind-3-off
Click here to see all sizes
5,6M coreutils-symbols-abort-z-fat
6,0M coreutils-symbols-abort-s-fat
6,2M coreutils-symbols-unwind-z-fat
6,4M coreutils-symbols-unwind-s-fat
6,7M coreutils-debuginfo-abort-z-fat
6,8M coreutils-debuginfo-abort-s-fat
7,0M coreutils-symbols-abort-z-thin
7,1M coreutils-symbols-abort-s-thin
7,3M coreutils-symbols-abort-z-off
7,4M coreutils-debuginfo-unwind-s-fat
7,5M coreutils-debuginfo-unwind-z-fat
7,6M coreutils-symbols-abort-s-off
7,7M coreutils-symbols-abort-3-fat
8,3M coreutils-symbols-unwind-s-thin
8,4M coreutils-debuginfo-abort-3-fat
8,7M coreutils-symbols-unwind-z-thin
8,7M coreutils-symbols-unwind-3-fat
8,7M coreutils-symbols-abort-3-off
9,2M coreutils-symbols-abort-3-thin
9,2M coreutils-debuginfo-abort-s-thin
9,7M coreutils-debuginfo-unwind-3-fat
 10M coreutils-symbols-unwind-z-off
 10M coreutils-debuginfo-abort-z-thin
 10M coreutils-symbols-unwind-s-off
 10M coreutils-debuginfo-abort-s-off
 11M coreutils-debuginfo-abort-z-off
 11M coreutils-debuginfo-abort-3-thin
 11M coreutils-symbols-unwind-3-thin
 12M coreutils-debuginfo-unwind-s-thin
 12M coreutils-debuginfo-abort-3-off
 13M coreutils-none-abort-z-fat
 13M coreutils-none-abort-s-fat
 13M coreutils-debuginfo-unwind-z-thin
 14M coreutils-symbols-unwind-3-off
 14M coreutils-none-unwind-s-fat
 14M coreutils-none-unwind-z-fat
 14M coreutils-debuginfo-unwind-s-off
 15M coreutils-debuginfo-unwind-3-thin
 15M coreutils-none-abort-3-fat
 15M coreutils-debuginfo-unwind-z-off
 16M coreutils-none-abort-s-thin
 16M coreutils-none-unwind-3-fat
 17M coreutils-none-abort-z-thin
 18M coreutils-none-abort-3-thin
 18M coreutils-none-abort-s-off
 18M coreutils-none-abort-z-off
 18M coreutils-none-unwind-s-thin
 19M coreutils-debuginfo-unwind-3-off
 20M coreutils-none-abort-3-off
 20M coreutils-none-unwind-z-thin
 21M coreutils-none-unwind-3-thin
 22M coreutils-none-unwind-s-off
 22M coreutils-none-unwind-z-off
 27M coreutils-none-unwind-3-off

But, more interestingly, I've analyzed the average change in binary size due to each of these parameters compared to the "baseline", which is the option that generates the largest binary (e.g. lto=off). The table below contains the percentual changes.

config min avg max
lto=off - - -
lto=fat -49.5 -34.1 -11.1
lto=thin -24.0 -11.7 5.4
opt-level=3 - - -
opt-level=s -26.7 -18.5 -9.25
opt-level=z -28.8 -16.8 -5.63
panic=unwind - - -
panic=abort -37.0 -18.3 -5.27
strip=off - - -
strip=debuginfo -48.2 -39.1 -28.2
strip=symbols -59.8 -53.3 -45.5

Some conclusions from this experiment:

  • lto=fat makes a big difference (~30%)!
  • lto=thin has made at least one size bigger :)
  • opt-level=s and opt-level=z are very close and opt-level=s is somehow even better in most cases. Both are better than 3.
  • panic=abort and strip=symbols are both great basically without downsides (at least on release builds)

I think lto, panic and strip all make sense both for performance and binary size. So I'd like to propose that we set the following parameters on a release build:

lto=true
strip=true
panic="abort"
opt-level=3

Which gives us the following size:

7,7M coreutils-symbols-abort-3-fat

And then we make a custom profile release-small, which is the same but with opt-level=z to squeeze out the last bit to 5.6M if you want to.

Do you agree?

@zleyyij
Copy link
Contributor

zleyyij commented Feb 26, 2023

I like that

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Feb 27, 2023

Hi,
looks like you had a very productive weekend :)

  • lto=fat makes a big difference (~30%)!

That was obvious result, based on what does the FAT mode for LTO.

  • lto=thin has at least one made the size bigger :)

Sounds like a glitch, or rust is doing some unpredicted builds outputs compared to LLVM/clang :)

  • opt-level=s and opt-level=z are very close and opt-level=s is somehow even better in most cases

Strange as above, as "-Oz" is LLVM/clang specific and does more aggressive size optimizations than widely known "-Os". There could be two things here, either rust does some unpredicted builds outputs or "-Oz" optimization level is somehow broken on LLVM/clang. I remember from the past that we at OpenMandriva Lx set up "-Oz" as default optimization level after we did the switch to LLVM/clang in 2015 as default compiler :)

Do you agree?
Will update this PR with your findings.

@sylvestre
Copy link
Contributor

looking at the CI, it moves from:


{
  "Sun, 26 Feb 2023 16:20:25 +0000": {
    "sha": "c1d85adcc930ecbbd5303275a4acdba25bd507bd",
    "size": "103208",
    "multisize": "15108"
  }
}

=>

{
  "Mon, 27 Feb 2023 11:53:20 +0000": {
    "sha": "7ec94c0e45df00a6bd219621aea3fbf7cdd775f7",
    "size": "88412",
    "multisize": "8420"
  }
}

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Feb 28, 2023

looking at the CI, it moves from:

That's because the PR matches roughly these settings:

8,7M coreutils-symbols-unwind-z-thin

@tpgxyz Was this difference intentional? And if so, what is your reasoning for using these settings?

Sounds like a glitch, or rust is doing some unpredicted builds outputs compared to LLVM/clang :)

I think lto reducing binary size is basically a side effect. It's meant as a performance optimization as far as I know, so it might be able to do more monomorphization etc. with lto which could increase binary size (although it won't in most cases).

Strange as above, as "-Oz" is LLVM/clang specific and does more aggressive size optimizations than widely known "-Os". There could be two things here, either rust does some unpredicted builds outputs or "-Oz" optimization level is somehow broken on LLVM/clang.

Yeah it is strange. Luckily the differences between -Oz and -Os were very small.

One thing that we should maybe think about before we commit to this is bug reports. If we strip the binary to the absolute minimum by default (no debuginfo, no panic unwind), then the bug reports we receive will probably not contain a lot of useful information. We would have to ask people to install a debug version and run that to get a proper backtrace, which is a bit cumbersome.

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Feb 28, 2023

One thing that we should maybe think about before we commit to this is bug reports. If we strip the binary to the absolute minimum by default (no debuginfo, no panic unwind), then the bug reports we receive will probably not contain a lot of useful information. We would have to ask people to install a debug version and run that to get a proper backtrace, which is a bit cumbersome.

Welcome to the Linux distributions world then. I'd say like 80% of distributions do ship software stripped of debuginfo in a form of >/dev/null or packaged separately i.e.

coreutils-9.1-4-omv4090.aarch64.rpm <- binary package
coreutils-debuginfo-9.1-4-omv4090.aarch64.rpm <- separate package debug info
coreutils-debugsource-9.1-4-omv4090.aarch64.rpm <- separate package debug sources

When uutils version of coreutils will be widely adopted by distributions than you should expect that you will not get the issue reports with an huge ratio of debuginfo details, as end users does not install these and if they do then running gdb is somehow a hurdle. So it is up to you to be ready before wide adoption, as i assume that is the goal to replace GNU coreutils in near future :)

@tertsdiepraam
Copy link
Member

tertsdiepraam commented Feb 28, 2023

That's true, but that's not an entirely convincing argument to strip by default.

Distributions can already do whatever they prefer and I would actually encourage them to strip the binary. But people installing the uutils/coreutils via cargo are maybe more likely to want to contribute and report bugs, so including debuginfo makes more sense there. They are also probably installing it on a machine where binary size is less of an issue (i.e. not embedded devices). That doesn't mean binary size is not important for them at all, but it is less important.

Also, it's worth noting that this project is very much a work in progress, so we'd expect more bug reports than say after a 1.0 release. So, it could be an option to wait with stripping the debuginfo until we're more stable.

Note that I'm not really disagreeing with you, I just wanted us to consider this before committing to stripping the binary. And I want the opinions of other maintainers as well (@sylvestre what do you think?).

Other projects don't do a lot of these size optimizations themselves either. ripgrep has this as release profile:

[profile.release]
debug = 1

And there are some issues on their repo with very similar discussion to this one:

exa has this:

# use LTO for smaller binaries (that take longer to build)
[profile.release]
lto = true

bat and fd have this:

[profile.release]
lto = true
codegen-units = 1

Also, coming back to what you said before:

Well this means 80% of people will not notice it, because by default they will compile as a "release" target

To improve this case, at least for distributions, we could include a page on packaging uutils/coreutils in the online docs (e.g. how to strip the binary if we don't do that by default, what features to include, what the package should be called, etc.).

@sylvestre
Copy link
Contributor

I think we should match what other binaries are doing

@tertsdiepraam
Copy link
Member

Yes, I think I'll open another PR with a the default like ripgrep and a separate profile for small releases and include some documentation for package maintainers.

@sylvestre
Copy link
Contributor

do we still want to do something here?

@tpgxyz
Copy link
Contributor Author

tpgxyz commented Mar 25, 2023

Looks like no, as different aproach was chosen to achieve same goal
Anyways i'm happy i inspired you :)

@tpgxyz tpgxyz closed this Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants