-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic code is not deterministically lto'ed #52044
Comments
Oh dear now this is a little terrifying. The following script also reproduces the error here and is a bit smaller in that it only uses rustc. This script generates only the IR and object file for the compilation at hand: #!/bin/bash
set -ex
input="
#[no_mangle]
pub extern fn foo() {
panic!(\"foo\");
}
"
rustc="rustc /dev/stdin -O -C lto -C panic=abort -C codegen-units=1"
rustc="$rustc --emit llvm-ir,obj --crate-type staticlib"
for i in `seq 1 100`; do
rm -rf a b
mkdir a b
$rustc --out-dir a <<< $input
$rustc --out-dir b <<< $input
a=$(md5sum a/stdin.ll | awk '{print $1}')
b=$(md5sum b/stdin.ll | awk '{print $1}')
if [ "$a" != "$b" ]; then
echo IR is different
exit 1
fi
a=$(md5sum a/stdin.o | awk '{print $1}')
b=$(md5sum b/stdin.o | awk '{print $1}')
if [ "$a" != "$b" ]; then
echo object is different
exit 1
fi
done The scary part here is that the IR is the same when the object files are different. This means that somehow LLVM's code generator looks like it's being non-deterministic. I can't reproduce this with Somehow this means that the in-process state of LLVM is nondeterministic in just the right way between compilations that it affects the final output file. As to how... that's a bit of a mystery. This isn't at least an obvious determinism bug in rustc in that the IR is the same so we're feeding the same input to LLVM both times. It's seemingly something else that's going awry! |
Trace diff for rustc_codegen_llvm is non-empty but surface level appears to be just pointers changing addresses and hashes being different (still odd, but at least reasonable): https://gist.github.com/Mark-Simulacrum/5d04c7ad104aa8936d45c4ac71f301f3. Trace diff for the whole compilation (RUST_LOG=trace) is here: https://gist.github.com/Mark-Simulacrum/a1dcc978a6b65cd48307ad9407c60cf9. This shows that there is certainly non-determinism inside rustc, but also does not look like there's anything obvious here. This seems likely to have something to do with recent changes to I cannot reproduce with a minimal
|
FWIW, this doesn't reproduce with 1.25, which was the first version with LLVM 6. |
I'll attempt a bisection tomorrow. |
This reproduces with nightly-2018-06-04, which is the first one with #50338 (panic_implementation) merged. But it also does reproduce with nightly-2018-06-03. |
Both the merge of #49051 and the parent merge fail for me, so it seems your bisection went wrong. |
The merge of #49051, 36b6687, reproduces via the script I wrote above for me locally. The previous merge of #47813, 3926453, did not reproduce via the same script after 100 executions of trying to get a different hash. Additionally using 3926453 I am unable to reproduce via the example you gave in the OP. How are you testing 3926453? I'm using To hopefully weed out weird build issues as the problem I've tested the next previous merge as well of #48138, ff2d506, but it also isn't reproducing the bug here. |
Local bisection with builds from scratch point me to the upgrade to LLVM 6, quite consistently. Has the compiler used on automation to build C/C++ code changed later? |
Mmmm it fails both when llvm is built with GCC 5.4 (which is what I was originally bisecting with) and GCC 4.8 (which seems to be what rust automation used for some old builds I looked at the logs for). I wonder if the fact that this is on Ubuntu, where GCC defaults to have some features enabled that upstream GCC doesn't makes a difference. |
At least some of our CI builds have switched to Clang 6.0 about two months ago. |
I was actually able to reproduce with llc-6.0 from Debian, from the llvm-6.0 package version 1:6.0.1-2, from the llvm-ir from latest rustc nightly. |
Unfortunately, llc-7 doesn't like the llvm-ir output from rustc nightly:
|
@glandium when you mentioned you can reproduce with llc-6.0, that's taking Rust's IR and running llc twice, getting different object files? |
Yes, with the same code difference as what I can see with rust. But it takes more attempts than it does with rust. |
FWIW, I've been able to reproduce under rr, so I /might/ be able to compare what's different between two runs, but if I make rr more deterministic by disabling ASLR, it goes away (but disabling ASLR without rr makes it still happen) |
FWIW, this is what the stack traces look like that create the first different instructions: cmp version
mov version
Maybe the DAG traversal depends on addresses? |
Actually working with ir dumps before and after every pass says the difference comes from the "Expand ISel Pseudo-instructions" pass. |
So it turns out the first difference I'm seeing in |
|
I modified llc to not execute a program when doing |
In case this helps: Basic block containing the first Variant 1
Variant 2
I think the matching basic block in the
|
This seems to have been fixed by #51966 |
Bisection on the llvm side says this was fixed by... the additions for retpoline, which makes no sense, as it's not supposed to change anything if retpoline is not enabled. And as crazy as it seems, the simple fact that there's an extra PreEmitPass2 that doesn't run makes it disappear O_O. |
Even better, the retpoline patch has already been backported to llvm 6 used by rust... |
So... this might, in fact, not be fixed. That is, if I build beta locally, it's not happening. But if I use the released beta, it does. |
The patch that you listed as closing this (the LLVM 7 upgrade) wasn't backported to beta, so I'm unclear why you'd expect it to be fixed by it on beta. Perhaps I'm misinterpreting something? |
What I'm saying is that the build environment in which rustc is built makes the problem appear or not, for the same version of rustc. There hasn't been a nightly produced since llvm 7 landed, so I did the build locally, and I marked this issue as fixed because it didn't happen on that build. But since a local build of beta with llvm 6 is also not affected, while the official beta is, that means master with llvm 7 is maybe only not affected on my end because of the compiler I'm using, and the official rustc might still be affected. |
I also can't reproduce with the CI build of 64f7de9 (latest master commit as of now) so it's possible that the upgrade did in fact fix this, if not fully. |
I can reproduce on a locally built beta if I use clang 6 (which is apparently what rustc automation is using) instead of gcc 5.4 (which is what's installed in the VM I'm using). And it does, indeed, not happen with master built with clang 6. |
This is driving me crazy. Here's how things go in my VM:
At least it's more consistent with llvm 7, but considering it disappeared by adding a pass that does nothing (retpoline) while bisecting with gcc 5.4 doesn't make me confident that the actual problem is fixed, even if, in practice, it looks like it is. |
At least I can confirm that this does indeed not happen with the now latest nightly, with llvm 7. So... |
This is something I noticed when comparing two builds of Firefox on automation. Both builds were done with the same flags, same toolchains, same paths, same everything. Without any sort of caching. And yet, they differed in exactly one way: the contents of
std::panicking::default_hook
.See https://taskcluster-artifacts.net/IWpRV77rTgmt-9RKwTzJiA/0/public/diff.html
I tried multiple times, and there seems to be only two variants of the generated code, and that's the only code, in all of what's generated by rust in Firefox build, that differs.
It turns out this is reproducible at a much smaller scale:
It can take a few attempts repeating the last three commands before differences show up, because there are only two variants, and you may end up getting the same variant multiple times in a row.
Cc: @alexcrichton @nikomatsakis
The text was updated successfully, but these errors were encountered: