-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always format to internal String in FmtPrinter #94131
Conversation
Some changes occured to the CTFE / Miri engine cc @rust-lang/miri |
r? @oli-obk (rust-highfive has picked a reviewer for you, use r? to override) |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
☀️ Try build successful - checks-actions |
Queued e19edf03ef30826c4928f4a631b37e4871e43ab4 with parent b8c56fa, future comparison URL. |
Finished benchmarking commit (e19edf03ef30826c4928f4a631b37e4871e43ab4): comparison url. Summary: This benchmark run shows 32 relevant regressions 😿 to instruction counts.
If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never |
While we reduce bootstrap time by 9s, these regressions seem fairly bad. I see three other solutions that could be tried:
|
I plan to dig into the regressions - my guess is that they're fixable along one of those lines. Maybe also preallocating some space in the buffer since we expect to print something. |
I am pretty confident that the regressions here seem to be inlining decisions slightly changing -- looking at cachegrind diffs, it doesn't seem like the diffs belong to formatting code (mostly in ty_relate and similar generic code). I'm going to try and see if I can confirm that more closely with some disassembly, and maybe see if there's a few optimizations that can be applied regardless (e.g., by making use of the new information that our destination buffer is infallible), but my initial sense is that these regressions are likely somewhat outside the direct control here. We're just moving a bunch of code to being codegen'd just the one time, which naturally means there's a little less information for LLVM to make use of in some places -- things like which flags were set on FmtPrinter, for example -- which leads to slightly worse performance. |
impl<'a, 'tcx, F> Deref for FmtPrinter<'a, 'tcx, F> { | ||
type Target = FmtPrinterData<'a, 'tcx, F>; | ||
impl<'a, 'tcx> Deref for FmtPrinter<'a, 'tcx> { | ||
type Target = FmtPrinterData<'a, 'tcx>; | ||
fn deref(&self) -> &Self::Target { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be enough to slap #[inline]
on all the functions in this file that appear in cachegrind for the first time after this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these functions appear in cachegrind -- they're not the direct problem. For example on stm32f4-check-full, we see a diff like this:
245,045,449 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
372,484,116 ???:<core::iter::adapters::map::Map<core::iter::adapters::enumerate::Enumerate<core::iter::adapters::zip::Zip<core::iter::adapters::copied::Copied<core::slice::iter::Iter<rustc_middle::ty::subst::>
-326,700,323 ???:rustc_middle::ty::relate::relate_substs::<rustc_infer::infer::equate::Equate>
123,186,012 ???:<rustc_middle::ty::subst::SubstFolder as rustc_middle::ty::fold::TypeFolder>::fold_ty
86,186,243 ???:rustc_middle::ty::relate::super_relate_tys::<rustc_infer::infer::equate::Equate>
-55,808,143 ???:<rustc_middle::ty::Ty as rustc_middle::ty::fold::TypeFoldable>::super_fold_with::<rustc_middle::ty::subst::SubstFolder>
30,561,468 ???:<alloc::collections::btree::map::entry::OccupiedEntry<rustc_infer::infer::region_constraints::Constraint, rustc_infer::infer::SubregionOrigin>>::remove_entry
-27,764,057 ???:<alloc::collections::btree::node::Handle<alloc::collections::btree::node::NodeRef<alloc::collections::btree::node::marker::Mut, rustc_infer::infer::region_constraints::Constraint, rustc_infer:>
20,749,456 ???:<&mut alloc::vec::Vec<ena::unify::VarValue<rustc_type_ir::TyVid>> as core::convert::AsRef<[ena::unify::VarValue<rustc_type_ir::TyVid>]>>::as_ref
18,135,210 ???:<rustc_middle::mir::Rvalue>::ty::<rustc_middle::mir::Body>
15,490,475 ???:<rustc_middle::ty::fold::RegionFolder as rustc_middle::ty::fold::FallibleTypeFolder>::try_fold_ty
-12,892,781 ???:<rustc_middle::ty::Ty as rustc_middle::ty::fold::TypeFoldable>::super_fold_with::<rustc_middle::ty::fold::RegionFolder>
12,520,606 ???:<rustc_infer::infer::combine::CombineFields>::instantiate
12,024,623 ???:<rustc_data_structures::stable_hasher::StableHasher>::finish::<u128>
-11,945,001 ???:<core::iter::adapters::map::Map<std::collections::hash::map::Iter<rustc_span::def_id::LocalDefId, rustc_hir::hir_id::ItemLocalId>, rustc_data_structures::stable_hasher::stable_hash_reduce<rust>
-10,922,924 ???:<rustc_middle::ty::subst::GenericArg as rustc_middle::ty::relate::Relate>::relate::<rustc_infer::infer::nll_relate::TypeRelating<rustc_borrowck::type_check::relate_tys::NllTypeRelatingDelegate>
9,837,670 ???:<alloc::vec::Vec<rustc_type_ir::TyVid> as alloc::vec::spec_from_iter::SpecFromIter<rustc_type_ir::TyVid, core::iter::adapters::filter_map::FilterMap<core::ops::range::Range<usize>, <rustc_infe>
-8,795,699 ???:<rustc_infer::infer::sub::Sub as rustc_middle::ty::relate::TypeRelation>::relate_with_variance::<rustc_middle::ty::Ty>
8,712,904 ???:<&mut rustc_middle::ty::relate::relate_substs<rustc_infer::infer::nll_relate::TypeRelating<rustc_borrowck::type_check::relate_tys::NllTypeRelatingDelegate>>::{closure#0} as core::ops::function>
8,332,971 ???:<rustc_infer::infer::equate::Equate as rustc_middle::ty::relate::TypeRelation>::relate_with_variance::<rustc_middle::ty::Ty>
8,262,191 ???:<rustc_infer::infer::sub::Sub as rustc_middle::ty::relate::TypeRelation>::tys
-8,099,008 ???:<rustc_infer::infer::InferCtxt>::unsolved_variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah lol, yea, inlining cost boundaries are fun. Yea, just slap the attribute on all previously generic methods, that should make callers big enough again to stop them from being inlined. Alternatively add inline(never)
with a comment to functions that disappeared from cachegrind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not practical -- almost every single method became non-generic after removing the parameter; that's the whole reason for the big win on compile times. IMO trying to second-guess LLVM here on inlining (in either direction) is not the right approach, we're not likely to find a magic bullet.
The diff above is illustrative of typical inlining decisions slightly changing, but trying to recover from that by adding attributes isn't likely to really help that much -- the changes are too scattered and ultimately I suspect arise from things that we can't really influence easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the gains are from not instantiating the generics in 5 different crates and generating LLVM IR for them and optimizing them?
Having them only get compiled once is a different aspect than removing the implicit #[inline]
that generic function have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[inline]
on all of these methods will -- AFAIK -- force generation in every crate that uses them; it's equivalent to generics in that respect. It also adds an inline hint, but that's a separate effect and unlikely to matter as much given the large size of many of the functions here. We save a little of course on generating just one copy per crate instead of 2-3, but that's still a higher cost.
81611d8
to
87bbb05
Compare
@bors try @rust-timer queue FWIW, I tried adding some eprintln! to the FmtPrinter::new and FmtPrinter::into_buffer functions, and neither are used at all for deeply-nested-async check builds. I think that further confirms that this is not in and of itself likely causing the regressions, rather that other functions are being shuffled between CGUs or similar due to these functions being omitted. I've added a commit to preallocate a 64-byte buffer which might help with some benchmarks and rebased, let's see if we still see the same impacts. |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 87bbb05815e1b44c940809f9ab261411bf0068fc with merge c65d716943343dccd7e2f9a1ad2455e7ebf9ae49... |
☀️ Try build successful - checks-actions |
Queued c65d716943343dccd7e2f9a1ad2455e7ebf9ae49 with parent e08d569, future comparison URL. |
Preliminary local results suggest the 64-byte preallocation is not likely to have a significant impact here, so probably doesn't matter that much. My suggested approach I think is still to call these relatively minor regressions justified -- they appear to be not directly related to the changes in this PR based on investigatory work I've done locally (cachegrind diffs, the fact that the code modified doesn't execute at all in some of the regressed benchmarks); I am inclined to take the win on bootstrap time and leave the runtime performance aside for this case. The folding code may have a few orthogonal wins that will buy us back some of this regression, but those wins are never going to really relate to this PR in a direct fashion. |
This avoids monomorphizing for different parameters, decreasing generic code instantiated downstream from rustc_middle.
87bbb05
to
2ee6d55
Compare
@bors r+ |
📌 Commit 2ee6d55 has been approved by |
⌛ Testing commit 2ee6d55 with merge 8b9e41cd50112f711aef7f5d28d8458430bee486... |
💔 Test failed - checks-actions |
@bors treeclosed=100 retry |
@bors treeclosed- |
⌛ Testing commit 2ee6d55 with merge a26f75a3c88c9ff5b5bcc90d944e60a37ef8a0c3... |
@bors retry |
☀️ Test successful - checks-actions |
Finished benchmarking commit (4b043fa): comparison url. Summary: This benchmark run did not return any relevant results. 30 results were found to be statistically significant but too small to be relevant. If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf. @rustbot label: -perf-regression |
This avoids monomorphizing for different parameters, decreasing generic code
instantiated downstream from rustc_middle -- locally seeing 7% unoptimized LLVM IR
line wins on rustc_borrowck, for example.
We likely can't/shouldn't get rid of the Result-ness on most functions, though some
further cleanup avoiding fmt::Error where we now know it won't occur may be possible,
though somewhat painful -- fmt::Write is a pretty annoying API to work with in practice
when you're trying to use it infallibly.