Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic perf.rust-lang.org runs (e.g. syn-opt). #69060

Open
Mark-Simulacrum opened this issue Feb 11, 2020 · 29 comments
Open

Non-deterministic perf.rust-lang.org runs (e.g. syn-opt). #69060

Mark-Simulacrum opened this issue Feb 11, 2020 · 29 comments
Labels
A-reproducibility Area: Reproducible / deterministic builds C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Mark-Simulacrum
Copy link
Member

https://perf.rust-lang.org/detailed-query.html?commit=e369f6b617bc5124ec5d02626dc1c821589e6eb3&base_commit=4d1241f5158ffd66730e094d8f199ed654ed52ae&benchmark=syn-opt&run_name=patched%20incremental:%20println

This perf run was expected to be a no-op, but appears to have resulted in slightly less queries being run (2 def_span and 2 metadata reads).

cc @eddyb

@Mark-Simulacrum Mark-Simulacrum added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. A-reproducibility Area: Reproducible / deterministic builds labels Feb 11, 2020
@eddyb
Copy link
Member

eddyb commented Feb 11, 2020

This may have been happening a lot, if anyone has the time, going through the history of syn-opt runs would be useful.

It's far noisier than it should be IMO, and non-determinism might play a significant role.

@wesleywiser
Copy link
Member

Something interesting I noticed, apologies if this is already known;

image

Perhaps this is LLVM optimization related?

@Mark-Simulacrum
Copy link
Member Author

It seems possible that the queries are run from codegen, but since they're queries, I believe the problem is within Rust code (to my knowledge we don't have callbacks or anything like that from LLVM).

@eddyb
Copy link
Member

eddyb commented Feb 12, 2020

It might be a combination. LLVM is known to have trouble maintaining deterministic performance, in a similar vein to rustc hashing pointers, but potentially even affecting output for LLVM optimziations.

I highly advise disabling at least userspace ASLR, and manually check (by dumping /proc/self/maps from rustc? or even interned e.g. Ty pointers) that the heap addresses match between two runs, even with a slightly different rustc.
There's also the question of randomness sensitivity, but I suspect we don't want that even in real builds.

IIRC @emberian managed to get the nondeterminism down to a bimodal (i.e. two possible profiles) situation for "irsy" (Is Rust Slim Yet), back in the day.
It's a bit of a shame that the "irsy" lessons were lost in time.

@eddyb
Copy link
Member

eddyb commented Mar 19, 2020

Something suspicious when looking at the query view is one less of each metadata_decode_entry and def_span. That should be impossible in many situations, yet it shows up.
Something is triggering slightly different compilation and I'm dying to know what.

It's possible all benchmarks are affected and syn-opt just has the biggest swing.

@eddyb
Copy link
Member

eddyb commented Mar 19, 2020

@wesleywiser is pointing out this effect of one less of each metadata_decode_entry and def_span also affects cranelift-codegen-debug clean incremental (and probably other crates as well).

This suggests to me this might be a very specific source of non-determinism in rustc or a proc macro.

@eddyb eddyb changed the title non-deterministic syn-opt run Non-deterministic perf.rust-lang.org runs (e.g. syn-opt). Mar 26, 2020
@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

I've done some local experiments and I can't cause this on nightly, even by doing this:

rm -rf target/release
RUSTFLAGS=-Cmetadata=$N cargo rustc --release -- -Zself-profile
mkdir perf-$N
mv syn-*.{events,string_{data,index}} perf-$N

for several values of $N, then running summarize diff and looking at the counts.


Next thing I'm going to do is build a local rustc with channel = "nightly", which will hopefully include a git hash, and then add an empty git commit to get it to change.
My assumption is that Cargo will embed that into std's -Cmetadata and that will trigger this issue.

EDIT: nope, that doesn't trigger it either, but I can keep trying.

@Mark-Simulacrum
Copy link
Member Author

Hm, FWIW perf does not use nightly channel, but maybe you can try the above we pre-produced artifacts? (We can look at some PR changing unimportant things like README or triagebot.toml, if they exist given rollups).

But building locally should work too, you may need to disable ignore-git though in config.toml.

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

@Mark-Simulacrum pointed out that some paths may differ, but building two copies of syn with the same compiler, while it does result in different artifacts (since --remap-path-prefix wasn't used), the query counts do stay the same.

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

Just realized I've been doing all my testing without CARGO_INCREMENTAL=1, so if this requires both release mode and incremental builds, I might've missed it.

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

I was able to reproduce this once, but I'm not sure I know the relevant factors, I had:

  • different rustc -vV output (but built from the same source)
  • different build dirs for syn (EDIT: not necessary)
  • --release mode
  • CARGO_INCREMENTAL=1
Item Self Time Self Time Change Time Time Change Item count Cache hits Blocked time Incremental load time
metadata_decode_entry +30.06691ms +73.60% +37.357079ms +63.56% +3 +0 +0ns +0ns
param_env +2.88433ms +26.61% +5.023573ms +29.19% +1 +0 +0ns +0ns
def_span +560.266µs +20.41% +574.635µs +19.42% +1 +0 +0ns +0ns
associated_item_def_ids +205.239µs +23.59% +319.72µs +17.13% +1 +0 +0ns +0ns
issue33140_self_ty +37.4µs +121.78% +48.729µs +113.77% +1 +0 +0ns +0ns
trait_def +9.161µs +10.86% +16.081µs +12.47% +1 +0 +0ns +0ns

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

This seems to repro as well:

mkdir perf-{1,2}
rm perf-{1,2}/*
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-3-stage1 rustc --release -- -Zself-profile=perf-1
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-4-stage1 rustc --release -- -Zself-profile=perf-2
Item Self Time Self Time Change Time Time Change Item count Cache hits Blocked time Incremental load time
metadata_decode_entry -51.316108ms -55.44% -71.070789ms -54.85% +1 +0 +0ns +0ns
trait_def -70.321µs -47.56% -102.341µs -45.63% +1 +0 +0ns +0ns

Those builds only differ by an empty git commit and have channel = "nightly" so the git hash shows up in rustc +rust-{3,4}-stage1 -vV.

I assume that means every crate hash is different.

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

Which queries are redone seems consistent for a given rustc -vV output, and I went and changed rust-4-stage1 (by adding another empty commit) and got this instead:

Item Self Time Self Time Change Time Time Change Item count Cache hits Blocked time Incremental load time
param_env +1.44951ms +14.28% +1.967982ms +12.27% +3 +0 +0ns +0ns
metadata_decode_entry +638.448µs +1.62% +752.248µs +1.33% +5 +0 +0ns +0ns
def_span +579.528µs +25.07% +603.46µs +24.25% +3 +0 +0ns +0ns
associated_item_def_ids +354µs +47.50% +364.602µs +22.20% +1 +0 +0ns +0ns
specializes +87.26µs +22.44% +155.46µs +22.14% +3 +0 +0ns +0ns
issue33140_self_ty +57.021µs +226.63% +61.661µs +161.50% +3 +0 +0ns +0ns
trait_def +16.969µs +27.16% +26.67µs +25.86% +1 +0 +0ns +0ns

So we clearly have some sort of dependency on that version, maybe via HashStable?
(Since it affects every single def path hash via the crate hashes computed by Cargo)

If we can't fix this, there's technically a potential workaround by making try builds lie about their git hash in rustc -vV output, but I don't like it.

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

Looks like the difference is that for that last pair of runs, the one with extra query counts uses:

  • associated_item_def_ids for core::slice::SlicePartialEq
  • trait_def for core::slice::BytewiseEquality
  • def_span, param_env and issue33140_self_ty for:
  • specializes for:
    • (alloc::string::{{impl}}[16],alloc::string::{{impl}}[77]):
      impl PartialEq for String {
      ($lhs = String, $rhs = str)
      impl<'a, 'b> PartialEq<$rhs> for $lhs {
    • (alloc::string::{{impl}}[77],alloc::string::{{impl}}[16])
    • (alloc::string::{{impl}}[79],alloc::string::{{impl}}[16])
    • (alloc::string::{{impl}}[79],alloc::string::{{impl}}[77])
    • (alloc::string::{{impl}}[86],alloc::string::{{impl}}[16])
    • (alloc::string::{{impl}}[86],alloc::string::{{impl}}[77])
      • I haven't looked up any of the above (EDIT: they seem to all be PartialEq impls, with impl_eq! invocations producing {{impl}}[77] to {{impl}}[86], inclusive)
    • (alloc::vec::{{impl}}[30],alloc::vec::{{impl}}[31]):

      rust/src/liballoc/vec.rs

      Lines 2004 to 2006 in f4c675c

      impl<T, I> SpecExtend<T, I> for Vec<T>
      where
      I: Iterator<Item = T>,

      rust/src/liballoc/vec.rs

      Lines 2035 to 2037 in f4c675c

      impl<T, I> SpecExtend<T, I> for Vec<T>
      where
      I: TrustedLen<Item = T>,
    • (core::slice::{{impl}}[127],core::slice::{{impl}}[129]):

      rust/src/libcore/slice/mod.rs

      Lines 5793 to 5795 in f4c675c

      impl<A, B> SlicePartialEq<B> for [A]
      where
      A: PartialEq<B>,

      rust/src/libcore/slice/mod.rs

      Lines 5825 to 5827 in f4c675c

      impl<A> SlicePartialEq<A> for [A]
      where
      A: PartialEq<A> + BytewiseEquality,
    • (core::slice::{{impl}}[128],core::slice::{{impl}}[129])

      rust/src/libcore/slice/mod.rs

      Lines 5807 to 5809 in f4c675c

      impl<A> SlicePartialEq<A> for [A]
      where
      A: PartialEq<A> + Eq,

      rust/src/libcore/slice/mod.rs

      Lines 5825 to 5827 in f4c675c

      impl<A> SlicePartialEq<A> for [A]
      where
      A: PartialEq<A> + BytewiseEquality,

This is troubling, since it suggests there's some non-determinism in specialization.


To get query keys I've added -Zself-profile-events=default,query-keys to the cargo rustc commands then used the crox tool to get chrome_profiler.json files.

I've used commands like these to extract the differing query keys:

jq -s 'map(map(select(.name == "def_span")) | map(.args.arg0) | sort) | (.[1]-.[0])' perf-{1,2}/chrome_profiler.json

that example outputs:

[
  "alloc::vec::{{impl}}[30]",
  "core::slice::{{impl}}[127]",
  "core::slice::{{impl}}[128]"
]

I've then counted the impls in those modules by hand, ignoring #[derive]s and macro invocations (had to redo some counting because I didn't realize at first that I had to skip them).

@jonas-schievink
Copy link
Contributor

This iterates over an FxHashMap:

let nonblanket = children.nonblanket_impls.iter_mut().flat_map(|(_, v)| v.iter());

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

I guess the order this loop iterates in matters because of side-effects?

for possible_sibling in possible_siblings {

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

Sorting possible_siblings going into that loop seems to fix the problem, but...

This is the FxHashMap:

pub nonblanket_impls: FxHashMap<SimplifiedType, Vec<DefId>>,

And SimplifiedType only has DefId leaves (and usize for some counts, irrelevant here).

So if SimplifiedType differ, that means the DefId differs, that means there's no way to use ordering to get the same result from two different-version compilers.
If we sort by stable hash, that's definitely going to be different since it takes crate metadata into account, which takes compiler version into account (AFAIK).

IndexMap might work, if we know we'll be inserting in a stable ordering, but that requires e.g. the caller of insert to also be stable.

EDIT: see #69060 (comment).

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

As I was afraid, there's explicit sorting upstream, and I'm not sure it helps:

// The coherence checking implementation seems to rely on impls being
// iterated over (roughly) in definition order, so we are sorting by
// negated `CrateNum` (so remote definitions are visited first) and then
// by a flattened version of the `DefIndex`.
trait_impls
.sort_unstable_by_key(|def_id| (-(def_id.krate.as_u32() as i64), def_id.index.index()));
for impl_def_id in trait_impls {
if impl_def_id.is_local() {
// This is where impl overlap checking happens:
let insert_result = sg.insert(tcx, impl_def_id);

@eddyb
Copy link
Member

eddyb commented Mar 26, 2020

@michaelwoerister @Zoxc Do you know what happens to DefId allocation for crates with the same contents but different -C metadata (so different def path root hash)?
I'm trying to figure out if sorting by DefIndex makes sense, in terms of being stable across compiler version changes (assuming the compiler and standard library do not otherwise change).

EDIT: see #69060 (comment).

@eddyb
Copy link
Member

eddyb commented Mar 27, 2020

I'm starting to believe this is the culprit:

// Bring everything into deterministic order for hashing
impls.sort_by_cached_key(|&index| {
tcx.hir().definitions().def_path_hash(LocalDefId { local_def_index: index })
});

The impls are already in a stable order: that in which they appear within the crate.
But the DefPathHashes aren't stable across compilers, so this injects a non-deterministic order.

That is, I believe DefIndex is stable across compilers (for unchanged source), meaning that FxHashMap, while potentially a problem in theory, isn't a problem in practice.

EDIT: it does seem that removing that sort (with no other changes) fixes the non-determinism for me.
EDIT: that sort was introduced by 77b7df3 in #41911, but since then, incremental compilation has changed how cross-crate information is tracked, so it should be fine to remove it IMO.

@eddyb
Copy link
Member

eddyb commented Mar 27, 2020

@Mark-Simulacrum If we want to avoid changing the compiler at all, there might be a way to guarantee stability of DefPathHashes across compiler versions, on perf.rust-lang.org:

  1. redirect Cargo's use of rustc to a wrapper that changes the output of -vV to hide the commit hash entirely, making the before/after look identical to Cargo
    • note that the version inside rustc doesn't matter itself, since it's only what Cargo computes from rustc -vV that results in different DefPathHashes between compiler versions
  2. for std, use a Xargo-like setup (or -Z build-std, although I suspect that might be more expensive)
    • this will result in a std build that has the the effects from 1. on DefPathHashes
    • bonus: you can track the impact on compiler or std changes on std's own compile times

Unlike working around individual between-compiler-versions instabilities, this also has the advantage that anything working with DefPathHashes will have the same performance.

Combine that with disabling ASLR, and the allocation + hashing behavior should be fully stable.

@eddyb
Copy link
Member

eddyb commented Mar 27, 2020

From #70462 (comment):

Nothing I've looked at has any query count differences, which is nice, but there's still a lot of noise from codegen/LLVM, so that needs to be looked into separately. I suppose I was wrong to assume that the query counts and noisy codegen/LLVM were related.

So we'd need to investigate the codegen/LLVM non-determinism separately. And that will be harder since it's all timing differences, not a discrete quantity query counts.

@Mark-Simulacrum What might be useful is grabbing -Zself-profile data from multiple runs, with the same compiler and with compilers that only differ in version commit hash.
If the codegen/LLVM time changes show up only in the latter case, then the -Z build-std trick should help there, but if even the same compiler is noisy, we'll need to dig deeper.

EDIT: I wonder if CGU partitioning can vary wildly and that's causing unpredictable codegen/LLVM performance. -Zprint-mono-items=lazy definitely shows that rustc -vV differences lead to different CGU "names", but the output format makes it hard to determine the shape of the CGUs.

@Mark-Simulacrum
Copy link
Member Author

I had some spare time today and deployed the rustc -vV change -- that should now be canonical starting with 5b594d6bbb9d492633d0f48e9d6c8a864bea2f3d (currently the last commit on perf.rlo). From what I understood talking to @eddyb, we will almost certainly also need something like -Zbuild-std to sort out the nondeterminism fully, but I didn't have the time today to figure that out. Either way probably good to try and observe the effects separately, too.

@eddyb
Copy link
Member

eddyb commented Apr 16, 2020

IIUC, the last 3 points horizontally (on both black and blue lines) are post-rust-lang/rustc-perf#645:
(looks pretty flat to me)
image

@eddyb
Copy link
Member

eddyb commented Apr 16, 2020

We're not out of the water yet, this perf diff has a bunch of query count differences (42 for the top ones, syn-opt itself looks to be 112) and I'm not sure why it would.

Nothing in d223029...ce1ab35 seems like it could explain that.

@eddyb
Copy link
Member

eddyb commented Apr 21, 2020

One week update, it's pretty stable now, and we can clearly see changes that are several times smaller than the variance used to be:
image

@RalfJung
Copy link
Member

So, given that the version number has a measurable impact on performance... (a) I guess this means we'll see a bump in these graphs on each version number bump, and (b) should we do some optimization to find the version number that makes the compiler go fastest? :P

@eddyb
Copy link
Member

eddyb commented Apr 21, 2020

@RalfJung The version gets in via -C metadata, so you could instead just XOR the resulting combined hash with some constant (but let's not, see more below).

I don't think rust-lang/cargo#8073 lets any part of the version into -C metadata, but I could be wrong.
@ehuss is the person who can answer that best.


So why do we get such wild non-determinism for syn-opt and maybe a couple other benchmarks, when -C metadata differs?

Well, it's always in LLVM. Sure, there's some query counts that vary, but those are tiny and ultimately irrelevant.

The other hint is that this happens in syn-opt but not syn-debug. So, LLVM optimizations.

One theory is that syn-opt is sensitive to CGU partitioning, and more so than the average crate we track on perf.rust-lang.org. That is, LLVM gets to do more or less optimizations based on which functions are instantiated in the same CGU.

However, I seem to recall from experimentation that it's not actually the case, or at least I couldn't find differences in CGU partitioning. Which leads to an even simpler explanation:

CGU ordering alone impacts the total time spent in LLVM that much.


So if you want to make rustc slightly faster, don't waste your time with the hashes.
Write a smarter CGU partitioning (or even just sorting) algorithm.
(If that's even possible, since you'd have to anticipate a lot of what LLVM might do...)

@RalfJung
Copy link
Member

RalfJung commented Apr 21, 2020

Maybe I should have made it clearer (I hoped the ":P" was enough but maybe not), but I was joking in (b) above. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-reproducibility Area: Reproducible / deterministic builds C-bug Category: This is a bug. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants