-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the size of a statically linked binary and library #10740
Comments
I have investigated how to implement |
This registers new snapshots after the landing of #10528, and then goes on to tweak the build process to build a monolithic `rustc` binary for use in future snapshots. This mainly involved dropping the dynamic dependency on `librustllvm`, so that's now built as a static library (with a dynamically generated rust file listing LLVM dependencies). This currently doesn't actually make the snapshot any smaller (24MB => 23MB), but I noticed that the executable has 11MB of metadata so once progress is made on #10740 we should have a much smaller snapshot. There's not really a super-compelling reason to distribute just a binary because we have all the infrastructure for dealing with a directory structure, but to me it seems "more correct" that a snapshot compiler is just a `rustc` binary.
This affects me; I'm not sure how much 500K (see below) matters in the long run for my use cases (including a kernel module), but it's very far from paying only for what I use. On OS X, the base size of a statically linked binary is:
--link-args -dead_strip brings it down:
Using the libnative example at https://gist.github.com/anonymous/8162357 helps:
and -Z lto brings it down a bit more:
But this is still way too high for my liking. Most of the file is text:
A lot of the functions seem to come from unused methods in traits like IoFactory, so it would be very nice (perhaps difficult?) to have some way to prune them from vtables. With rust-core I get a 12K binary. That's a fairly trivial result in some sense, since it just means that without any pretty printing for failure, everything has been optimized away, but the bare minimum for that pretty printing is much closer to 12K than 472K. I could just use rust-core for vaguely embedded stuff, but I don't think this is a good solution in the long run. |
The best prospects you have of a small statically linked binary is to use LTO (as you found out), but you should not be running with You are correct in that the vtables are the major cause of bloat right now. As a result, all I/O code is pulled in to all binaries even if they don't use it (if they're statically linked). This is a consequence of our decision of the architecture of I/O and I don't forsee it changing soon. If you care about using the standard library and having small binaries (not kernel modules), then I would highly recommend dynamic linking as an option. Dynamic linking is optimized for exactly this use case (one library implementation shared among many binaries). If you care about the size of your libraries if you're making a kernel module, then these numbers are all irrelevant. You cannot use And finally, the embedded context is the same as the kernel context. If you're writing an embedded kernel, you cannot use libnative or even rust-core. You are forced to write your own implementation of various components. If you're worried about generating embedded binaries that are large, then I recommend that you use dynamic linking instead. |
You can use rust-core for an embedded kernel, it doesn't have any required dependencies. There's just not much available yet with allocators blocked on fixing destructors. |
@alexcrichton Good point regarding libkernel. It seems like -dead_strip not working properly should be considered a bug, to be fixed by defining whatever metadata needs to be kept in libraries as an exported symbol. Thus unnecessary data could still be removed in statically linked executables, without having to deal with the overhead of LTO. I think I am going to try to implement stripping dead virtual methods in a somewhat hacky way in LLVM to see how much it helps in practice. On a somewhat related note, based on quickly skimming the resulting binary in IDA, I'm somewhat suspicious that part of the problem may be that rustc just generates more verbose code for idiomatic Rust than you see in idiomatic C, e.g. doing a lot of copying things around between the stack and registers, though I could be completely wrong. Is there any easy way to get the equivalent of -Oz/-Os in rustc? |
I do not forsee officially supporting flags like Rust generates a very large amount of IR, but that does not mean that an optimized rust binary is slower or larger than C. Rust has always been on-par with C/C++ code (that I have examined). The only downside is that O0 is like 30x slower than C++ O0 (due to our large amount of codegen). Thankfully LLVM is a pretty good optimizer. |
I do not understand what you mean about dead_strip. Supporting dead_strip properly on OS X requires two things:
I don't think that making one symbol public is more of a pain than a gain; LTO is nice but it is rather slow. I guess this could be a different issue report, though. |
Once #10528 lands, we'll be able to create static libraries and static binaries. While being very useful, we're creating massive binaries. There are a few opportunities for improvement that I can see here:
objcopy -R
, butobjcopy
doesn't exist by default on OSX, and the objcopy I found ended up producing a corrupted executable that didn't run.-ffunction-sections
and-fdata-sections
which places each function and static global in its own section. The linker is then passed--gc-sections
and magically removes everything that's unused.Both of these optimizations are a little dubious, and this is why I chose the default output of libraries to be dynamic libraries for the compiler. These optimizations can benefit the size of an executable, but I've seen the compilation of
fn main() {}
increase by 5-10x when implementing these optimizations (even in the common no-opt compile case).Additionally, these optimizations are going to be difficult to implement across platforms. Most of what I've described is linux-specific. There is a
-dead_strip
option on the OSX linker, but that's the only relevant size optimization flag I can find. I have not checked to see what the mingw linker provides in terms of size optimizations.Empirical data
All of the data here is collected from a 32-bit ubuntu VM, but I imagine the numbers are very similar on other platforms. The program in question is simply
fn main() {}
.-ffunction-sections
+--gc-sections
- 1.6MB-ffunction-sections
+--gc-sections
+#[no_uv]
- 730KNote that
--gc-sections
always removes the metadata. I'm unsure of whether--gc-sections
corrupts our exception-handling sections.From this, the "most optimized normal case" that I can get to is 1.6MB, which is still very large. As a comparison, the "hello world" go executable is 400K. A no_uv 730K executable is pretty reasonable, so it could just be that having M:N/uv means that you're pulling in larger portions of libstd. I believe that this size of 1.6MB means that further investigation is warranted to figure out where all this size is coming from.
Nominating for discussion. I don't think that this should block 1.0, but this is certainly a concern that we should prioritize.
The text was updated successfully, but these errors were encountered: