Performance of helloworld5000 could be improved #50994

nnethercote · 2018-05-23T10:26:46Z

helloworld5000 is the name I've given to the benchmark that is helloworld with the println!("Hello world"); repeated 5,000 times. It's an interesting stress test for the compiler.

On my machine, a debug build takes 4.5 seconds and an opt build takes 62(!) seconds.

In the debug build, execution time is dominated by take_and_reset_data. Cachegrind measures these instruction counts:

28,047,781,500  PROGRAM TOTALS

7,501,239,976  /home/njn/moz/rust0/src/libcore/slice/mod.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data
5,625,275,003  /home/njn/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/ena-0.9.3/src/snapshot_vec.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data

The reset_unifications call within take_and_reset_data is the expensive part. It all boils down to set_all within the ena crate:

            .      pub fn set_all(&mut self, mut new_elems: impl FnMut(usize) -> D::Value) {
      185,003          if !self.in_snapshot() {
            .              for (slot, index) in self.values.iter_mut().zip(0..) {
5,625,090,000                  *slot = new_elems(index);
            .              }
            .          } else {
            .              for i in 0..self.values.len() {
            .                  self.set(i, new_elems(i));
            .              }
            .          }
            .      }

and iterator code (called from set_all):

            .      fn next(&mut self) -> Option<$elem> {
            .          // could be implemented with slices, but this avoids bounds checks
            .          unsafe {
            .              if mem::size_of::<T>() != 0 {
    1,067,186                  assume(!self.ptr.is_null());
       20,013                  assume(!self.end.is_null());
            .              }
7,621,400,804              if self.ptr == self.end {
            .                  None
            .              } else {
            .                  Some($mkref!(self.ptr.post_inc()))
            .              }
            .          }
            .      }

I did some measurement and found that, in the vast majority of cases, reset_unification is a no-op -- it overwrites the the unification table with the same values that it already has. I wonder if we could do better somehow. It's a shame we have to keep the unbound variables around rather than just clearing them like we do with the other data in take_and_reset_data. I know that this is an extreme example, but profiles indicate that reset_unifications is somewhat hot on more normal programs too. @nikomatsakis, any ideas?

In the opt builds, these are the top functions according to Cachegrind:

555,228,191,235  PROGRAM TOTALS

87,515,336,354  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*)
41,641,142,772  /home/njn/moz/rust0/src/llvm/lib/Analysis/CaptureTracking.cpp:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
35,396,524,596  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::SmallDenseMap<llvm::Instruction const*, unsigned int, 32u, llvm::DenseMapInfo<llvm::Instruction const*>, llvm::detail::DenseMapPair<llvm::Instruction const*, unsigned int> >::grow(unsigned int) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
33,096,763,980  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::getAttributes(unsigned int) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
30,008,031,294  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::OrderedBasicBlock::comesBefore(llvm::Instruction const*, llvm::Instruction const*)
29,931,802,152  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::hasAttribute(unsigned int, llvm::Attribute::AttrKind) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]

That's a lot of time in PointerMayBeCaptured.

The text was updated successfully, but these errors were encountered:

nikic · 2018-05-25T12:38:41Z

I've opened an issue for the LLVM part upstream: https://bugs.llvm.org/show_bug.cgi?id=37588

nnethercote · 2018-05-26T01:18:57Z

Thank you, @nikic! That's extremely helpful.

dotdash · 2019-01-06T14:20:54Z

This seems to benefit a lot from ~~#57315~~ #57351

A non-incremental build of rustc with LTO enabled takes 20s for a debug build, while a rustc with the mentioned PR that was built in incremental mode takes "only" 9.4s. Building a LTO'd version of that rustc now...

dotdash · 2019-01-06T14:47:12Z

LTO'd rustc with ~~#57315~~ #57351 takes 3.1s for me to produce a debug build.

jonas-schievink · 2019-01-06T15:02:15Z

@dotdash Are you sure that's the right PR? It's a Cargo update.

dotdash · 2019-01-06T15:02:47Z

oops, typo'd, it's #57351

dotdash · 2019-01-06T15:06:03Z

Thinking about it, given that the test case originally only took 4.5s, I suspect that the const_eval part might have been a regression that was introduced in the meantime.

perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994

perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994, rust-lang#86244

…dtwco perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994, rust-lang#86244

perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang/rust#50994, rust-lang/rust#86244

workingjubilee · 2024-05-05T10:57:20Z

Either my machine is that much of a beast, or this was simply fixed.

$ time rustc helloworld5000.rs

real    0m0.423s
user    0m0.384s
sys     0m0.039s
$ time rustc -Copt-level=3 helloworld5000.rs

real    0m1.724s
user    0m1.579s
sys     0m0.144s

kennytm added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 23, 2018

jonas-schievink added the A-inference Area: Type inference label Mar 14, 2020

memoryruins mentioned this issue Jun 16, 2021

Large memory usage and long time on compiling large number of println #86244

Open

Mark-Simulacrum mentioned this issue Jan 15, 2024

perf: Don't track specific live points for promoteds #120003

Merged

workingjubilee closed this as completed May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of helloworld5000 could be improved #50994

Performance of helloworld5000 could be improved #50994

nnethercote commented May 23, 2018

nikic commented May 25, 2018

nnethercote commented May 26, 2018

dotdash commented Jan 6, 2019 •

edited

Loading

dotdash commented Jan 6, 2019 •

edited

Loading

jonas-schievink commented Jan 6, 2019

dotdash commented Jan 6, 2019

dotdash commented Jan 6, 2019

workingjubilee commented May 5, 2024

Performance of helloworld5000 could be improved #50994

Performance of helloworld5000 could be improved #50994

Comments

nnethercote commented May 23, 2018

nikic commented May 25, 2018

nnethercote commented May 26, 2018

dotdash commented Jan 6, 2019 • edited Loading

dotdash commented Jan 6, 2019 • edited Loading

jonas-schievink commented Jan 6, 2019

dotdash commented Jan 6, 2019

dotdash commented Jan 6, 2019

workingjubilee commented May 5, 2024

dotdash commented Jan 6, 2019 •

edited

Loading

dotdash commented Jan 6, 2019 •

edited

Loading