Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of helloworld5000 could be improved #50994

Closed
nnethercote opened this issue May 23, 2018 · 8 comments
Closed

Performance of helloworld5000 could be improved #50994

nnethercote opened this issue May 23, 2018 · 8 comments
Labels
A-inference Area: Type inference C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@nnethercote
Copy link
Contributor

helloworld5000 is the name I've given to the benchmark that is helloworld with the println!("Hello world"); repeated 5,000 times. It's an interesting stress test for the compiler.

On my machine, a debug build takes 4.5 seconds and an opt build takes 62(!) seconds.

In the debug build, execution time is dominated by take_and_reset_data. Cachegrind measures these instruction counts:

28,047,781,500  PROGRAM TOTALS

7,501,239,976  /home/njn/moz/rust0/src/libcore/slice/mod.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data
5,625,275,003  /home/njn/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/ena-0.9.3/src/snapshot_vec.rs:rustc::infer::region_constraints::RegionConstraintCollector::take_and_reset_data

The reset_unifications call within take_and_reset_data is the expensive part. It all boils down to set_all within the ena crate:

            .      pub fn set_all(&mut self, mut new_elems: impl FnMut(usize) -> D::Value) {
      185,003          if !self.in_snapshot() {
            .              for (slot, index) in self.values.iter_mut().zip(0..) {
5,625,090,000                  *slot = new_elems(index);
            .              }
            .          } else {
            .              for i in 0..self.values.len() {
            .                  self.set(i, new_elems(i));
            .              }
            .          }
            .      }

and iterator code (called from set_all):

            .      fn next(&mut self) -> Option<$elem> {
            .          // could be implemented with slices, but this avoids bounds checks
            .          unsafe {
            .              if mem::size_of::<T>() != 0 {
    1,067,186                  assume(!self.ptr.is_null());
       20,013                  assume(!self.end.is_null());
            .              }
7,621,400,804              if self.ptr == self.end {
            .                  None
            .              } else {
            .                  Some($mkref!(self.ptr.post_inc()))
            .              }
            .          }
            .      }

I did some measurement and found that, in the vast majority of cases, reset_unification is a no-op -- it overwrites the the unification table with the same values that it already has. I wonder if we could do better somehow. It's a shame we have to keep the unbound variables around rather than just clearing them like we do with the other data in take_and_reset_data. I know that this is an extreme example, but profiles indicate that reset_unifications is somewhat hot on more normal programs too. @nikomatsakis, any ideas?

In the opt builds, these are the top functions according to Cachegrind:

555,228,191,235  PROGRAM TOTALS

87,515,336,354  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*)
41,641,142,772  /home/njn/moz/rust0/src/llvm/lib/Analysis/CaptureTracking.cpp:llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
35,396,524,596  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::SmallDenseMap<llvm::Instruction const*, unsigned int, 32u, llvm::DenseMapInfo<llvm::Instruction const*>, llvm::detail::DenseMapPair<llvm::Instruction const*, unsigned int> >::grow(unsigned int) [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
33,096,763,980  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::getAttributes(unsigned int) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]
30,008,031,294  /home/njn/moz/rust0/src/llvm/include/llvm/ADT/DenseMap.h:llvm::OrderedBasicBlock::comesBefore(llvm::Instruction const*, llvm::Instruction const*)
29,931,802,152  /home/njn/moz/rust0/src/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::hasAttribute(unsigned int, llvm::Attribute::AttrKind) const [/home/njn/moz/rust0/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/codegen-backends/librustc_codegen_llvm-llvm.so]

That's a lot of time in PointerMayBeCaptured.

@kennytm kennytm added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 23, 2018
@nikic
Copy link
Contributor

nikic commented May 25, 2018

I've opened an issue for the LLVM part upstream: https://bugs.llvm.org/show_bug.cgi?id=37588

@nnethercote
Copy link
Contributor Author

Thank you, @nikic! That's extremely helpful.

@dotdash
Copy link
Contributor

dotdash commented Jan 6, 2019

This seems to benefit a lot from #57315 #57351

A non-incremental build of rustc with LTO enabled takes 20s for a debug build, while a rustc with the mentioned PR that was built in incremental mode takes "only" 9.4s. Building a LTO'd version of that rustc now...

@dotdash
Copy link
Contributor

dotdash commented Jan 6, 2019

LTO'd rustc with #57315 #57351 takes 3.1s for me to produce a debug build.

@jonas-schievink
Copy link
Contributor

@dotdash Are you sure that's the right PR? It's a Cargo update.

@dotdash
Copy link
Contributor

dotdash commented Jan 6, 2019

oops, typo'd, it's #57351

@dotdash
Copy link
Contributor

dotdash commented Jan 6, 2019

Thinking about it, given that the test case originally only took 4.5s, I suspect that the const_eval part might have been a regression that was introduced in the meantime.

@jonas-schievink jonas-schievink added the A-inference Area: Type inference label Mar 14, 2020
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 15, 2024
perf: Don't track specific live points for promoteds

We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands.

For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds).

cc rust-lang#50994
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 16, 2024
perf: Don't track specific live points for promoteds

We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands.

For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds).

cc rust-lang#50994, rust-lang#86244
bors added a commit to rust-lang-ci/rust that referenced this issue Jan 20, 2024
…dtwco

perf: Don't track specific live points for promoteds

We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands.

For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds).

cc rust-lang#50994, rust-lang#86244
github-actions bot pushed a commit to rust-lang/miri that referenced this issue Jan 21, 2024
perf: Don't track specific live points for promoteds

We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands.

For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds).

cc rust-lang/rust#50994, rust-lang/rust#86244
@workingjubilee
Copy link
Member

Either my machine is that much of a beast, or this was simply fixed.

$ time rustc helloworld5000.rs

real    0m0.423s
user    0m0.384s
sys     0m0.039s
$ time rustc -Copt-level=3 helloworld5000.rs

real    0m1.724s
user    0m1.579s
sys     0m0.144s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-inference Area: Type inference C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants