Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify some Iterator methods. #64572

Closed

Conversation

nnethercote
Copy link
Contributor

PR #64545 got a big speed-up by replacing a hot call to all() with
explicit iteration. This is because the implementation of all() is
excessively complex: it wraps the given predicate in a closure that
returns a LoopState, passes that closure to try_for_each(), which
wraps the first closure in a second closure, passes that second closure
to try_fold(), which does the actual iteration using the second
closure.

A sufficient smart compiler could optimize all this away; rustc is
currently not sufficiently smart.

This commit does the following.

  • Changes the implementations of all(), any(), find() and
    find_map() to use the simplest possible code, rather than using
    try_for_each(). (I am reminded of "The Evolution of a Haskell
    Programmer".) These are both shorter and faster than the current
    implementations, and will permit the undoing of the all() removal in
    More ObligationForest improvements #64545.

  • Changes ResultShunt::next() so it doesn't call self.find(),
    because that was causing infinite recursion with the new
    implementation of find(), which itself calls self.next(). (I
    honestly don't know how the old implementation of
    ResultShunt::next() didn't cause an infinite loop, given that it
    also called self.next(), albeit via try_for_each() and
    try_fold().)

  • Changes nth() to use self.next() in a while loop rather than for x in self, because using self-iteration within an iterator method
    seems dubious, and self.next() is used in all the other iterator
    methods.

PR rust-lang#64545 got a big speed-up by replacing a hot call to `all()` with
explicit iteration. This is because the implementation of `all()` is
excessively complex: it wraps the given predicate in a closure that
returns a `LoopState`, passes that closure to `try_for_each()`, which
wraps the first closure in a second closure, passes that second closure
to `try_fold()`, which does the actual iteration using the second
closure.

A sufficient smart compiler could optimize all this away; rustc is
currently not sufficiently smart.

This commit does the following.

- Changes the implementations of `all()`, `any()`, `find()` and
  `find_map()` to use the simplest possible code, rather than using
  `try_for_each()`. (I am reminded of "The Evolution of a Haskell
  Programmer".) These are both shorter and faster than the current
  implementations, and will permit the undoing of the `all()` removal in
  rust-lang#64545.

- Changes `ResultShunt::next()` so it doesn't call `self.find()`,
  because that was causing infinite recursion with the new
  implementation of `find()`, which itself calls `self.next()`. (I
  honestly don't know how the old implementation of
  `ResultShunt::next()` didn't cause an infinite loop, given that it
  also called `self.next()`, albeit via `try_for_each()` and
  `try_fold()`.)

- Changes `nth()` to use `self.next()` in a while loop rather than `for
  x in self`, because using self-iteration within an iterator method
  seems dubious, and `self.next()` is used in all the other iterator
  methods.
@rust-highfive
Copy link
Collaborator

r? @cramertj

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 18, 2019
@nnethercote
Copy link
Contributor Author

@simulacrum: time to test the new all-in-one perf CI command:

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion

@bors
Copy link
Contributor

bors commented Sep 18, 2019

⌛ Trying commit 7e98fb9 with merge 5be48aa...

bors added a commit that referenced this pull request Sep 18, 2019
Simplify some `Iterator` methods.

PR #64545 got a big speed-up by replacing a hot call to `all()` with
explicit iteration. This is because the implementation of `all()` is
excessively complex: it wraps the given predicate in a closure that
returns a `LoopState`, passes that closure to `try_for_each()`, which
wraps the first closure in a second closure, passes that second closure
to `try_fold()`, which does the actual iteration using the second
closure.

A sufficient smart compiler could optimize all this away; rustc is
currently not sufficiently smart.

This commit does the following.

- Changes the implementations of `all()`, `any()`, `find()` and
  `find_map()` to use the simplest possible code, rather than using
  `try_for_each()`. (I am reminded of "The Evolution of a Haskell
  Programmer".) These are both shorter and faster than the current
  implementations, and will permit the undoing of the `all()` removal in
  #64545.

- Changes `ResultShunt::next()` so it doesn't call `self.find()`,
  because that was causing infinite recursion with the new
  implementation of `find()`, which itself calls `self.next()`. (I
  honestly don't know how the old implementation of
  `ResultShunt::next()` didn't cause an infinite loop, given that it
  also called `self.next()`, albeit via `try_for_each()` and
  `try_fold()`.)

- Changes `nth()` to use `self.next()` in a while loop rather than `for
  x in self`, because using self-iteration within an iterator method
  seems dubious, and `self.next()` is used in all the other iterator
  methods.
@mati865
Copy link
Contributor

mati865 commented Sep 18, 2019

@nnethercote it's @Mark-Simulacrum, not @simulacrum.

@Centril
Copy link
Contributor

Centril commented Sep 18, 2019

r? @scottmcm

@rust-highfive rust-highfive assigned scottmcm and unassigned cramertj Sep 18, 2019
@bors
Copy link
Contributor

bors commented Sep 18, 2019

☀️ Try build successful - checks-azure
Build commit: 5be48aa

@rust-timer
Copy link
Collaborator

Queued 5be48aa with parent 5283791, future comparison URL.

else { LoopState::Break(()) }
while let Some(x) = self.next() {
if !f(x) {
return false;
Copy link
Contributor

@tesuji tesuji Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that Iterator::all is that complex.
Here is the result from godbolt:https://godbolt.org/z/QlHZ0m

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm inexperienced with godbolt, but the results seem to strongly support my simplification, i.e. the microbenchmark static instruction count drops from 29 to 9 and the cycle count drops from 735 to 260. Is that right? Is there anything else in there of note?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCA results are misleading here, because of the warning at the bottom of the output that "note: program counter updates are ignored" and thus it doesn't understand the loops. If you consider that the first one is 4x unrolled, the cycle difference is not unreasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in that godbolt link isn't actually correct:

while let Some(&x) = a.iter().next() {

a.iter() is called on every loop iteration so it's only looking at the first element in the slice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I updated the link: https://godbolt.org/z/QlHZ0m

@rust-timer
Copy link
Collaborator

Finished benchmarking try commit 5be48aa, comparison URL.

@hanna-kruppe
Copy link
Contributor

Note that using internal iteration over for loops in all of these iterator "sinks" was done not out of cleverness but because internal iteration tends to optimize much better for some iterators (e.g., anything involving chain). See for example the benchmark cited in #45595 for sum. It's very bad that this turns out to be worse for the iterations that are common in rustc and I won't try to block this PR, but I think we should expect that this causes regressions for other kinds of code. Ideally we could have our cake and eat it too.

@nnethercote
Copy link
Contributor Author

@rkruppe: I read through #45595 but I don't see an explanation of the sum benchmark, and intuitively I'm struggling to understand how it could optimize better. Can you explain some more?

I'd be interested in concrete examples of where this change causes slowdowns, because the perf results for rustc itself are ridiculously good. A 48.8% instruction count win? A 50.5% wall-time win? That's crazy. I had only tested with Check runs of keccak and inflate, because they are benchmarks where I know that the slowness of all, and they gave the slight speed-ups I expected. I will double check this with local runs and try to work out why it's such a big improvement.

@nnethercote
Copy link
Contributor Author

nnethercote commented Sep 18, 2019

I just analyzed clap-rs-debug with Cachegrind. Here is the first part of the diff:

--------------------------------------------------------------------------------
Ir                       
--------------------------------------------------------------------------------
-31,396,128,952 (100.0%)  PROGRAM TOTALS
  
--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
 -908,744,061 ( 2.89%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:(anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
  677,448,255 (-2.16%)  ???:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -673,479,754 ( 2.15%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -584,694,311 ( 1.86%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:_int_malloc
 -561,763,542 ( 1.79%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::Attribute::hasAttribute(llvm::StringRef) const
 -558,013,599 ( 1.78%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetNodeID::AddInteger(unsigned int)
 -556,005,022 ( 1.77%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:_int_free
  539,305,095 (-1.72%)  ???:(anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
 -513,211,306 ( 1.63%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SmallPtrSet.cpp:llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const
 -506,034,675 ( 1.61%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/MC/MCExpr.cpp:llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCAsmLayout const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long> > const*, bool) const 
  485,780,159 (-1.55%)  ???:llvm::Value::getValueName() const
 -481,651,644 ( 1.53%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/MachineInstr.cpp:llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm::MachineOperand const&)
  476,394,033 (-1.52%)  ???:llvm::FoldingSetNodeID::AddInteger(unsigned int)
 -469,995,575 ( 1.50%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::AttributeSetNode::hasAttribute(llvm::StringRef) const
 -461,988,232 ( 1.47%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/Hashing.h:std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int const>(unsigned int const*, unsigned int const*)
 -457,036,946 ( 1.46%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)
 -421,492,023 ( 1.34%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SmallPtrSet.cpp:llvm::SmallPtrSetImplBase::insert_imp_big(void const*)
 -419,943,393 ( 1.34%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Bitstream/BitstreamWriter.h:llvm::BitstreamWriter::Emit(unsigned int, unsigned int)
 -385,636,839 ( 1.23%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:malloc
 -379,430,774 ( 1.21%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Support/DJB.h:llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
 -371,939,886 ( 1.18%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -368,831,902 ( 1.17%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetBase::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&)
 -351,172,291 ( 1.12%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -348,577,656 ( 1.11%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:llvm::Value::getValueName() const
 -330,542,916 ( 1.05%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SHA1.cpp:llvm::SHA1::hashBlock()
 -327,752,246 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Metadata.cpp:llvm::Instruction::getMetadataImpl(unsigned int) const
 -327,405,404 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::getAttributes(unsigned int) const
 -327,346,176 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SHA1.cpp:blk(unsigned int*, int)
 -324,740,192 ( 1.03%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::ScheduleDAGSDNodes::BuildSchedUnits()
 -317,306,265 ( 1.01%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/RegAllocFast.cpp:(anonymous namespace)::RegAllocFast::allocateInstruction(llvm::MachineInstr&)
 -307,280,976 ( 0.98%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Bitstream/BitstreamWriter.h:llvm::BitstreamWriter::EmitVBR(unsigned int, unsigned int)
  299,376,976 (-0.95%)  ???:llvm::Attribute::hasAttribute(llvm::StringRef) const
 -297,071,541 ( 0.95%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetNodeID::AddInteger(unsigned long long)
  294,959,073 (-0.94%)  ???:llvm::AttributeSetNode::get(llvm::LLVMContext&, llvm::AttrBuilder const&)

In short, LLVM is doing way less stuff. My best guess is that clap-rs must use some or all of these iterator methods a lot, and these methods get inlined, and so there ends up being less code to compile. I.e., it's not that rustc is that much faster, but the rust libs are much less bloated.

I did a quick grep for these methods occurring in clap-rs, here are the counts (which could include false positive matches):

  • all: 2
  • any: 59
  • find: 33
  • find_map: 0

That's a lot of calls, so this seems plausible.

@Mark-Simulacrum
Copy link
Member

Hm, so I wonder if #62429 is somewhat related here. I'd be interested in seeing if we could get similar results with the old code but replacing e.g. |_| true with true_bool with fn true_bool() -> bool { true } or so. But I do think it makes sense to just land this if there are no objections given the wins here :)

@hanna-kruppe
Copy link
Contributor

@nnethercote Oh, duh, that makes sense. "LLVM spends a lot less time optimizing iterator code if the iterator code is simpler" is a lot more plausible (and less scary) than rustc code magically getting twice as fast by fiddling with some iterator methods. But there's still instances of all not being optimized as well with the complex implementation than what's in here, right?

@rkruppe: I read through #45595 but I don't see an explanation of the sum benchmark, and intuitively I'm struggling to understand how it could optimize better. Can you explain some more?

I have never investigated this in enough detail to give a very detailed or confident explanation why this is, but I can give you my best guess as compiler engineer.

A loop like for x in a.chain(b) lowers to a single loop which repeatedly checks a flag to see whether the next iteration should pull from a to b. This obscures what's being iterated over, which is an essential first step for many loop optimizations, and hinders optimizations that could have applied to a loop over one of a or b but can't handle them both at the same time. For the "sum of chain of ranges" in the benchmark, the impact is especially drastic because "sum up the numbers from m to n" can be converted into a closed form expression when recognized. But it also affects passes which are more important to real world performance, such as the loop vectorizer.

The fix to all those woes is to split the loop into two loops, one going just over a and the other going just over b. But LLVM does not have a loop splitting pass, so it's often out of luck. In contrast, internal iteration methods like fold can be (are) overriden by Chain to explicitly iterate over each over the components of the chain separately, which gives LLVM two nice separate loops that it can handle more easily.

@nnethercote
Copy link
Contributor Author

Here is a test program that demonstrates @rkruppe's point:

// bench.rs
// - Compile with `rustc --test -O bench.rs`
// - Run with `./bench --bench`
#![feature(test)]

extern crate test;

use test::Bencher;

#[bench]
fn long_all(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(100000);
        let k = test::black_box(1000000);
        (0..2*n).all(|x| x < k)
    })
}

#[bench]
fn long_all_chained(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(100000);
        let k = test::black_box(1000000);
        (0..n).chain(0..n).all(|x| x < k)
    })
}

Before:

running 2 tests
test long_all         ... bench:      47,849 ns/iter (+/- 2,881)
test long_all_chained ... bench:      49,985 ns/iter (+/- 4,589)

After:

running 2 tests
test long_all         ... bench:      47,247 ns/iter (+/- 3,094)
test long_all_chained ... bench:     142,170 ns/iter (+/- 5,201)

So we have a demonstrated slowdown for a chained use in a microbenchmark. It's also interesting that this microbenchmark doesn't show a speed-up for long_all, which is counter to the godbolt microbenchmark, and counter to the real-world experience in rustc itself.

@nnethercote
Copy link
Contributor Author

To summarize the effects of this PR.

  • Programs that use these iterator methods in simple ways might run faster. We have seen this on one microbenchmark and real world code in rustc itself, and not seen it in another microbenchmark. For rustc, if you look at just the check perf results, which do no code generation and so do not benefit from having to generate less code, we get improvements of up to 4%. I know this is mostly due to the very hot all call in this code:
    // if we were stalled on some unresolved variables, first check
    // whether any of them have been resolved; if not, don't bother
    // doing more work yet
    if !pending_obligation.stalled_on.is_empty() {
    if pending_obligation.stalled_on.iter().all(|&ty| {
    // Use the force-inlined variant of shallow_resolve() because this code is hot.
    let resolved = ShallowResolver::new(self.selcx.infcx()).inlined_shallow_resolve(ty);
    resolved == ty // nothing changed here
    }) {
  • Programs that use these iterator methods with chain might run slower. We have seen this on one microbenchmark, though not yet in any real world examples.
  • Programs that use these methods will compile faster, sometimes drastically so. I've been working on making rustc faster since 2016, and I've never gotten close to 30-50% improvements for a single change. I'm normally happy with a 1% improvement on any benchmarks, and ecstatic with 5%. Furthermore, the improvements are for debug and opt builds, which are generally harder to speed up than check builds.

IMO the possible pessimization of the relatively rare chained case is well worth the optimization of the simpler cases and the reductions in compile time.

Furthermore, I'm wondering if any other common library functions are overly complicated and could also be simplified in order to gain more compile-time wins. I'm trying to think of how to find any such functions.

@scottmcm
Copy link
Member

scottmcm commented Sep 19, 2019

Looking at the change in #64545 and at @lzutao's helpful goldbolt link reminded me of something else that gives me an alternative hypothesis: stalled_on is a Vec, so it's using the slice iterator, which has manual unrolling for try_fold:

rust/src/libcore/slice/mod.rs

Lines 3184 to 3201 in eceec57

fn try_fold<B, F, R>(&mut self, init: B, mut f: F) -> R where
Self: Sized, F: FnMut(B, Self::Item) -> R, R: Try<Ok=B>
{
// manual unrolling is needed when there are conditional exits from the loop
let mut accum = init;
unsafe {
while len!(self) >= 4 {
accum = f(accum, next_unchecked!(self))?;
accum = f(accum, next_unchecked!(self))?;
accum = f(accum, next_unchecked!(self))?;
accum = f(accum, next_unchecked!(self))?;
}
while !is_empty!(self) {
accum = f(accum, next_unchecked!(self))?;
}
}
Try::from_ok(accum)
}

That's there because it was ported from search_while which gave @bluss a 37% perf win for all back in #37972 ("Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators").

I'll make a quick PR to remove the unrolling to give us some information on whether the problem is actually the bool-to-LoopState dance (which I feel shouldn't be that bad because LoopState<(), ()> is an i1 even in debug) or the problem is in calling the inner closure 4 times (which might be exacerbated by "Us[ing] the force-inlined variant of shallow_resolve() because this code is hot").

bors added a commit that referenced this pull request Sep 19, 2019
[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold

For context see #64572 (comment)

r? @scottmcm
@bluss
Copy link
Member

bluss commented Sep 19, 2019

Everything makes sense to revisit, but since the internal iteration is a big algorithm shift - it's not just chain, it allows efficient iteration of all kinds of segmented datastructures, including VecDeque and multi dimensional arrays (when we get there, implementing try_fold is still unstable), backing out unrolling would be better.

@scottmcm
Copy link
Member

scottmcm commented Sep 19, 2019

(I honestly don't know how the old implementation of ResultShunt::next() didn't cause an infinite loop, given that it also called self.next(), albeit via try_for_each() and try_fold().)

It also overrides try_fold, so the find called the overridden try_fold, which called the interior iterator's try_fold, and never actually called back to next on the shunt iterator itself.

It's correct (though sometimes suboptimal) to make an iterator by overloading try_fold as the main implementation, and having next be defined in terms of that as self.try_for_each(Err).err().

@nnethercote
Copy link
Contributor Author

So the closure passed to these iterator methods could be duplicated four times? Huh. That's aggressive. Looking at the results in #64600, that accounts for about half of the speedups. Even half of this win is still a big win.

I also wonder if it's worth using the simple implementations for debug builds, where losing some runtime speed in favour of faster compilation is a more acceptable trade-off.

@Centril
Copy link
Contributor

Centril commented Sep 19, 2019

One option: we could ostensibly #[cfg(...)] some code to detect whether rustc is being built and change the code slightly accordingly.

@Mark-Simulacrum
Copy link
Member

Er, no? libstd/core etc are shipped for end-users and for rustc itself as the same binary, so we can't do things differently really in that way. But I also don't think that would be entirely useful.

@scottmcm
Copy link
Member

scottmcm commented Sep 19, 2019

Out of curiosity, has anyone looked at what clap's doing that makes it an order-of-magnitude outlier here? 2x faster for the patched incremental just because of this change seems outlandishly good.

(Oh, I spotted the post above about the call counts for these methods. Is it lots of calls with trivial bodies? Are the bodies really complicated? Is it typechecking or codegen or ...? For example, if a bunch of the cost is coming from the generic instead of bools, we can use a different type in Iterator...)

@bluss
Copy link
Member

bluss commented Sep 19, 2019

This blog post writes in detail about what internal iteration gives to Rust. The headline can be taken with a grain of salt: "Rust’s iterators are inefficient, and here’s what we can do about it."

Just to give more background to why Iterator's .try_fold() and .all() have their current shape.

@rpjohnst
Copy link
Contributor

As a longer-term solution, I wonder if closures themselves could be made cheaper to compile, so rustc could avoid building so much LLVM IR even when handed the internal iterator versions of these functions.

Improvements in that direction could potentially also hugely benefit generator-based iterator implementations, if such a thing becomes common.

For example, the Thorin research IR has some higher-order optimizations that can eliminate closures before they are lowered: http://compilers.cs.uni-saarland.de/papers/lkh15_cgo.pdf. Perhaps something like this could be adapted to run on MIR?

@nnethercote
Copy link
Contributor Author

Out of curiosity, has anyone looked at what clap's doing that makes it an order-of-magnitude outlier here? 2x faster for the patched incremental just because of this change seems outlandishly good.

I haven't looked closely, but the 4x unrolling means that any large closure (especially if it's marked with #[inline(always)]) can cause a lot of code bloat. This is true for the hot all within rustc itself that I mentioned above.

bors added a commit that referenced this pull request Sep 24, 2019
Remove manual unrolling from slice::Iter(Mut)::try_fold

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).

---

For context see #64572 (comment)
@JohnCSimon
Copy link
Member

Ping from triage
@scottmcm @nnethercote This conversation seems to have stalled and this PR has sat idle for a week.
Thanks!

@@ -319,7 +319,7 @@ pub trait Iterator {
#[inline]
#[stable(feature = "rust1", since = "1.0.0")]
fn nth(&mut self, mut n: usize) -> Option<Self::Item> {
for x in self {
while let Some(x) = self.next() {
Copy link
Member

@bluss bluss Sep 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good simplification

@@ -1855,18 +1855,15 @@ pub trait Iterator {
/// ```
#[inline]
#[stable(feature = "rust1", since = "1.0.0")]
fn all<F>(&mut self, f: F) -> bool where
fn all<F>(&mut self, mut f: F) -> bool where
Copy link
Member

@bluss bluss Sep 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep these Iterator::all etc simplifications, there are examples of where removing them would be a big regression. We have already put in the work to make sure we thread internal iteration methods through various adaptors correctly, so to remove the use of try_fold here at the top level would be a step back.

This change only partly rolls back what we have worked on for internal iteration — other entry points to it still remain, like sum, max, min, fold and try_fold. And those remaining happen to be the most useful in stable Rust (because implementing try_fold is not possible in stable, so custom iterators can't fully take advantage of this particular all magic). It would also be inconsistent to partly remove it, I think we should keep it all.

Here's an example with a microbenchmark of what kind of improvement it can be rust-itertools/itertools#348 In this example it was fold, but it will also be all when try_fold is stable.

I think we should go through fold/try_fold everywhere we can, for the structural improvements that gives (bypasses complicated state machines that composed iterators might have in .next())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course the code inside Iterator::all is not beautiful to read as it is now, and if it could be simplified while keeping the same features, that would be fantastic.

@andjo403
Copy link
Contributor

this PR changes documented stable behavior see https://doc.rust-lang.org/std/iter/trait.Iterator.html#note-to-implementors

bors added a commit that referenced this pull request Sep 30, 2019
Remove manual unrolling from slice::Iter(Mut)::try_fold

While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone.

I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545).

---

For context see #64572 (comment)
@nnethercote
Copy link
Contributor Author

It's clear that this can't land as is. But follow-up PRs such as #64600 (already landed) and #64885 will get a lot of the gains.

@scottmcm
Copy link
Member

scottmcm commented Oct 1, 2019

this PR changes documented stable behavior

@andjo403 Note that the try_fold method cannot yet be implemented on stable (because Try) is unstable, so there may still be still options available.

tmandry added a commit to tmandry/rust that referenced this pull request Oct 2, 2019
use try_fold instead of try_for_each to reduce compile time

as it was stated in rust-lang#64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold.

as there is some gains with using the try_fold function this is maybe a way forward.
when I tried to compile the clap-rs benchmark I get times gains only some % from rust-lang#64572

there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times

can someone start a perf run?

cc @nnethercote @scottmcm @bluss
r? @ghost
@nnethercote
Copy link
Contributor Author

In short, LLVM is doing way less stuff. My best guess is that clap-rs must use some or all of these iterator methods a lot, and these methods get inlined, and so there ends up being less code to compile. I.e., it's not that rustc is that much faster, but the rust libs are much less bloated.

I did a quick grep for these methods occurring in clap-rs, here are the counts (which could include false positive matches):

  • all: 2
  • any: 59
  • find: 33
  • find_map: 0

Out of curiosity, I tried introducing the simplified versions of these four functions one at a time on clap-rs. Here are the instruction counts, as reported by Cachegrind.

                     check           debug           opt
Original             16,285M         61,818M         183,768M
all                  16,287M         61,732M         183,660M
any                  16,291M         57,746M         178,922M
find                 16,284M         54,560M         174,997M
find_map             16,291M         54,441M         174,861M

There were big drops in the debug and opt builds when any and find were converted, and negligible changes when all and find_map were converted, which matches up with the number of calls in the code.

@nnethercote nnethercote deleted the simplify-Iterator-methods branch October 9, 2019 22:41
@Darksonn
Copy link
Contributor

Which iterators are slowed down by the try_for_each based any implementation? Perhaps it would make sense to override the any implementation on those iterators instead of replacing the default any implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.