Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid query cache sharding code in single-threaded mode #94084

Merged
merged 5 commits into from
Feb 27, 2022

Conversation

Mark-Simulacrum
Copy link
Member

@Mark-Simulacrum Mark-Simulacrum commented Feb 17, 2022

In non-parallel compilers, this is just adding needless overhead at compilation time (since there is only one shard statically anyway). This amounts to roughly ~10 seconds reduction in bootstrap time, with overall neutral (some wins, some losses) performance results.

Parallel compiler performance should be largely unaffected by this PR; sharding is kept there.

@rustbot rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Feb 17, 2022
@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 17, 2022
@bors
Copy link
Contributor

bors commented Feb 17, 2022

⌛ Trying commit e25a77dca5e5cbfb78d927a9541661428d87331c with merge 471ea6ab86e550c13a729833d90e362bbc7d9622...

@bors
Copy link
Contributor

bors commented Feb 17, 2022

☀️ Try build successful - checks-actions
Build commit: 471ea6ab86e550c13a729833d90e362bbc7d9622 (471ea6ab86e550c13a729833d90e362bbc7d9622)

@rust-timer
Copy link
Collaborator

Queued 471ea6ab86e550c13a729833d90e362bbc7d9622 with parent 30b3f35, future comparison URL.

@joshtriplett
Copy link
Member

joshtriplett commented Feb 17, 2022

We absolutely should make changes that improve the performance of the non-parallel compiler. However, it'd be nice if we had some means of measuring the impact on the parallel compiler, to evaluate the need for separate code paths. Is there any tracking issue for having rustc-perf measure parallel compiler performance?

@Mark-Simulacrum
Copy link
Member Author

The expectation is not necessarily to land this code as-is (I think that's unlikely), but to identify how much of a win this is -- and that will help calibrate the investment into the various next steps -- e.g., (a) keeping parallel compilation equivalent with additional cfg work; (b) not bothering at all with this patch.

If we do see a large enough improvement, benchmarking parallel compilers locally is possible -- just time consuming, since you need to build from scratch on master and with your changes (requiring a good 30-60 minutes minimum each, typically) and then run at least a subset of perf through that. That could help with the evaluation.

Tracking parallel compiler performance is not currently done, and I'm not aware of an issue for it. This is primarily no one is really actively working on that mode, so spending time investing into infra to track them does not seem particularly worthwhile -- it would require essentially doubling our costs (number of metrics, servers, etc.) which seems pretty extreme for a feature with essentially zero active development.

@joshtriplett
Copy link
Member

The expectation is not necessarily to land this code as-is (I think that's unlikely), but to identify how much of a win this is

Ah, got it. In that case, any objections to marking this PR as a draft? That often serves as a good indicator of "this is being used to check performance of an idea".

I absolutely agree that we shouldn't run that tracking in general on every perf run until we have more active development on it. But it'd help to have the ability to enable it, and to be able to run it specifically for PRs we'd expect to affect it. (As well as, perhaps, a perf run per release.)

@Mark-Simulacrum
Copy link
Member Author

The lack of an assigned reviewer (i.e., explicit r? @ghost) is my signal for whether work is not intended for review -- I don't typically use the draft view on GitHub, I don't really care either way though.

Tracking it even irregularly still requires quite a bit of work to get all the pieces in the right order today, but it's not something necessarily blocked on infra work (try builds with the right CI changes are sufficient), so someone well-motivated could start doing so.

@Mark-Simulacrum
Copy link
Member Author

FWIW, one reason I am reluctant to track is that we already cannot really reliably keep up with triaging numerous, typically relatively small, perf regressions. I suspect that the parallel compiler mode will be even more difficult to diagnose regressions in -- at least in the current suite of tools -- so I am reluctant to add that extra data to our perf-tracking work.

@joshtriplett
Copy link
Member

Ah, sorry, missed the r? @ghost.

Good to know that it's something someone could put together if motivated to do so.

FWIW, one reason I am reluctant to track is that we already cannot really reliably keep up with triaging numerous, typically relatively small, perf regressions. I suspect that the parallel compiler mode will be even more difficult to diagnose regressions in -- at least in the current suite of tools -- so I am reluctant to add that extra data to our perf-tracking work.

I absolutely wouldn't expect the perf team to handle diagnosing or dealing with such regressions; the only time we'd want to consider them is when making changes like this that may trade off optimization of one for the other.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (471ea6ab86e550c13a729833d90e362bbc7d9622): comparison url.

Summary: This benchmark run shows 15 relevant improvements 🎉 but 23 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.0%
  • Average relevant improvement: -1.2%
  • Largest improvement in instruction counts: -3.0% on incr-patched: add static arr item builds of coercions debug
  • Largest regression in instruction counts: 1.9% on full builds of inflate check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot added perf-regression Performance regression. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Feb 17, 2022
@Mark-Simulacrum
Copy link
Member Author

Looks like bootstrap data is not actually getting properly sorted -- rust-lang/rustc-perf#1175 should fix that -- but overall this shaves a good 2% of bootstrap times, 15 seconds, with largely neutral overall effect on perf (the regressions here do not seem big, likely to be optimizer noise, given the patch, and there are some improvements of roughly equal magnitude).

The win seems significant enough to be worth spending some time on bringing this from prototype to actually landing it -- I'm not sure how to best do that yet. cc @cjgillot @rust-lang/wg-incr-comp since the primarily thread through incremental code

My initial thinking is we can either just land this (pretty much as-is, modulo some further comment cleanup / renaming struct variables, etc.) or try to cfg gate all the sharding away. Given the relatively small size of this PR, there's probably not too much trouble with the cfg approach, but it would definitely require some plumbing and look pretty unfortunate I suspect.

@joshtriplett's point on parallel compiler performance is likely worth taking into account too, I can try to gather some statistics there but it'll be a bit of a pain for sure; if we choose the 'just cfg all the relevant bits', then that can mean skipping the parallel evaluation, perhaps

@klensy
Copy link
Contributor

klensy commented Feb 17, 2022

I've done some work in #93787 about separating parallel_compiler, but it need review.

@Mark-Simulacrum
Copy link
Member Author

I think this would be largely orthogonal from that PR (or increase the work needed to separate it out), since it's about moving an API from being largely equivalent across the question of parallel compiler to being quite different.

@nnethercote
Copy link
Contributor

My initial thinking is we can either just land this (pretty much as-is, modulo some further comment cleanup / renaming struct variables, etc.) or try to cfg gate all the sharding away.

The former would almost certainly regress the parallel compiler, right? Given that all the shards would be locked where currently a single shard is locked.

@Mark-Simulacrum
Copy link
Member Author

Presuming there's heavy contention on a given query -- yes. It's worth noting that each individual query still has its own lock, so if threads are doing parallel work and largely executing distinct queries, then contention is probably minimal. We don't really hold the locks themselves for that long, either. #61779 added the sharding based on what looks like ~1 data point (at least documented), though the 30% win there is certainly significant. On the other hand, IIRC 1st-gen Ryzen was particularly bad at latency when shuffling cache lines between cores, so I'm not sure how much of a win this ends up being.

I think it's pretty likely that we could fairly minimally adjust the PR to have Sharded<T> be just Lock<T> on non-parallel compilers and [Lock<T>; N] on parallel compilers, with functions appropriately mapping to each use case around that. I can try to experiment with that and make a concrete proposal (i.e., delta on this PR), though I admit that a good part of me sort of wants to just not bother with the cfgs across rustc_query crates necessary to make that happen.

@bjorn3
Copy link
Member

bjorn3 commented Feb 20, 2022

Doesn't non-parallel rustc already use a single shard?

#[cfg(not(parallel_compiler))]
const SHARD_BITS: usize = 0;

@Mark-Simulacrum
Copy link
Member Author

Yes, it does. As the perf results illustrate, the layers of abstraction there do add a fairly considerable chunk of compilation time, though runtime performance is largely unaffected.

I'm working on a revision of this patch that tries to aim to cfg the sharding more carefully alongside some cleanups so we keep equivalent results on parallel builds.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2022
@Mark-Simulacrum
Copy link
Member Author

@bors try @rust-timer queue

Alright, pushed up a new set of commits which do a more thorough refactoring split across multiple commits and keep largely identical high-level behavior for parallel compilation (modulo a few mostly minor details around QueryLookup keeping shard indices and such, rather than recomputing them from scratch. Those are hard to cfg pipe around and do not feel likely to be meaningful to me).

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 20, 2022
@bors
Copy link
Contributor

bors commented Feb 20, 2022

⌛ Trying commit 594ea74 with merge 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3...

@bors
Copy link
Contributor

bors commented Feb 20, 2022

☀️ Try build successful - checks-actions
Build commit: 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3 (18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3)

@rust-timer
Copy link
Collaborator

Queued 18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3 with parent c1aa854, future comparison URL.

@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 20, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (18dcd0a8e0dab57c40141aa2d34a1f9e33c365b3): comparison url.

Summary: This benchmark run shows 26 relevant improvements 🎉 but 21 relevant regressions 😿 to instruction counts.

  • Average relevant regression: 1.4%
  • Average relevant improvement: -0.5%
  • Largest improvement in instruction counts: -1.1% on incr-unchanged builds of ctfe-stress-4 check
  • Largest regression in instruction counts: 2.5% on incr-full builds of deeply-nested-async check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf +perf-regression

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 21, 2022
@Mark-Simulacrum Mark-Simulacrum added the perf-regression-triaged The performance regression has been triaged. label Feb 21, 2022
@Mark-Simulacrum
Copy link
Member Author

Mark-Simulacrum commented Feb 21, 2022

Results look pretty mixed but I think overall neutral -- stress tests dominate regressions and there are also some small improvements; looking at cachegrind diffs locally it doesn't look like they're obviously related to the work in this PR, so I am marking the regression as triaged (rather inlining noise and similar).

@cjgillot
Copy link
Contributor

I'm wondering: is there a longer-term context for these changes? This PR optimizes the serial compiler and degrades the parallel compiler. Should we consider dropping/reimplementing the parallel compiler?

@Mark-Simulacrum
Copy link
Member Author

The latest version of this PR should have roughly neutral effect on parallel compiler performance, since it keeps sharding things in that mode.

IMO, it may not be a bad idea to drop the parallel compiler support unless we have concrete investment expected in the next 6-18 month timeframe, since it does cause constant 'small' pain across many bits of the compiler. But this PR would ideally not be blocked on a decision there :)

@lqd
Copy link
Member

lqd commented Feb 25, 2022

Should we consider dropping/reimplementing the parallel compiler?

The parallel compiler currently suffers from non-deterministic ICEs (in addition to the other known issues about jobserver pipe contention, lack of horizontal scalability, etc) but when/if it works, it seems to be surprisingly effective on compile times.

@cjgillot
Copy link
Contributor

LGTM.
Why did you remove the caching of the key hash using the QueryLookup type? Could we gain a bit of perf by avoiding to hash keys multiple times?
r=me either way

@Mark-Simulacrum
Copy link
Member Author

It was already unused -- if you look at the commit deleting QueryLookup, we're not actually threading it down anywhere. The query shard was used, but recomputing it from the hash we calculate anyway should be pretty cheap (it's just a shift and mask) and threading it through just on parallel compilers seems like more work than we ought to do.

It's possible actually caching the key hash would make sense but I think we see a little benefit from saving some registers/stack allocation to thread the data down too, so it's not guaranteed to help. I expect the query key hash is typically really fast to compute, as the majority of our keys are e.g. DefId or so, which take just a handful of instructions to compute FxHash for.

@bors r=cjgillot

@bors
Copy link
Contributor

bors commented Feb 27, 2022

📌 Commit 594ea74 has been approved by cjgillot

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 27, 2022
@bors
Copy link
Contributor

bors commented Feb 27, 2022

⌛ Testing commit 594ea74 with merge 3b1fe7e...

@bors
Copy link
Contributor

bors commented Feb 27, 2022

☀️ Test successful - checks-actions
Approved by: cjgillot
Pushing 3b1fe7e to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Feb 27, 2022
@bors bors merged commit 3b1fe7e into rust-lang:master Feb 27, 2022
@rustbot rustbot added this to the 1.61.0 milestone Feb 27, 2022
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (3b1fe7e): comparison url.

Summary: This benchmark run shows 55 relevant improvements 🎉 to instruction counts.

  • Arithmetic mean of relevant regressions: 1.1%
  • Arithmetic mean of relevant improvements: -0.8%
  • Arithmetic mean of all relevant changes: -0.6%
  • Largest improvement in instruction counts: -2.3% on full builds of keccak check

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

@rustbot rustbot removed the perf-regression Performance regression. label Feb 27, 2022
@Mark-Simulacrum Mark-Simulacrum deleted the drop-sharded branch February 27, 2022 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants