Simplify some `Iterator` methods. #64572

nnethercote · 2019-09-18T07:35:04Z

PR #64545 got a big speed-up by replacing a hot call to all() with
explicit iteration. This is because the implementation of all() is
excessively complex: it wraps the given predicate in a closure that
returns a LoopState, passes that closure to try_for_each(), which
wraps the first closure in a second closure, passes that second closure
to try_fold(), which does the actual iteration using the second
closure.

A sufficient smart compiler could optimize all this away; rustc is
currently not sufficiently smart.

This commit does the following.

Changes the implementations of all(), any(), find() and
find_map() to use the simplest possible code, rather than using
try_for_each(). (I am reminded of "The Evolution of a Haskell
Programmer".) These are both shorter and faster than the current
implementations, and will permit the undoing of the all() removal in
More ObligationForest improvements #64545.
Changes ResultShunt::next() so it doesn't call self.find(),
because that was causing infinite recursion with the new
implementation of find(), which itself calls self.next(). (I
honestly don't know how the old implementation of
ResultShunt::next() didn't cause an infinite loop, given that it
also called self.next(), albeit via try_for_each() and
try_fold().)
Changes nth() to use self.next() in a while loop rather than for x in self, because using self-iteration within an iterator method
seems dubious, and self.next() is used in all the other iterator
methods.

PR rust-lang#64545 got a big speed-up by replacing a hot call to `all()` with explicit iteration. This is because the implementation of `all()` is excessively complex: it wraps the given predicate in a closure that returns a `LoopState`, passes that closure to `try_for_each()`, which wraps the first closure in a second closure, passes that second closure to `try_fold()`, which does the actual iteration using the second closure. A sufficient smart compiler could optimize all this away; rustc is currently not sufficiently smart. This commit does the following. - Changes the implementations of `all()`, `any()`, `find()` and `find_map()` to use the simplest possible code, rather than using `try_for_each()`. (I am reminded of "The Evolution of a Haskell Programmer".) These are both shorter and faster than the current implementations, and will permit the undoing of the `all()` removal in rust-lang#64545. - Changes `ResultShunt::next()` so it doesn't call `self.find()`, because that was causing infinite recursion with the new implementation of `find()`, which itself calls `self.next()`. (I honestly don't know how the old implementation of `ResultShunt::next()` didn't cause an infinite loop, given that it also called `self.next()`, albeit via `try_for_each()` and `try_fold()`.) - Changes `nth()` to use `self.next()` in a while loop rather than `for x in self`, because using self-iteration within an iterator method seems dubious, and `self.next()` is used in all the other iterator methods.

rust-highfive · 2019-09-18T07:35:08Z

r? @cramertj

(rust_highfive has picked a reviewer for you, use r? to override)

nnethercote · 2019-09-18T08:33:53Z

@simulacrum: time to test the new all-in-one perf CI command:

@bors try @rust-timer queue

rust-timer · 2019-09-18T08:33:54Z

Awaiting bors try build completion

bors · 2019-09-18T08:34:05Z

⌛ Trying commit 7e98fb9 with merge 5be48aa...

Simplify some `Iterator` methods. PR #64545 got a big speed-up by replacing a hot call to `all()` with explicit iteration. This is because the implementation of `all()` is excessively complex: it wraps the given predicate in a closure that returns a `LoopState`, passes that closure to `try_for_each()`, which wraps the first closure in a second closure, passes that second closure to `try_fold()`, which does the actual iteration using the second closure. A sufficient smart compiler could optimize all this away; rustc is currently not sufficiently smart. This commit does the following. - Changes the implementations of `all()`, `any()`, `find()` and `find_map()` to use the simplest possible code, rather than using `try_for_each()`. (I am reminded of "The Evolution of a Haskell Programmer".) These are both shorter and faster than the current implementations, and will permit the undoing of the `all()` removal in #64545. - Changes `ResultShunt::next()` so it doesn't call `self.find()`, because that was causing infinite recursion with the new implementation of `find()`, which itself calls `self.next()`. (I honestly don't know how the old implementation of `ResultShunt::next()` didn't cause an infinite loop, given that it also called `self.next()`, albeit via `try_for_each()` and `try_fold()`.) - Changes `nth()` to use `self.next()` in a while loop rather than `for x in self`, because using self-iteration within an iterator method seems dubious, and `self.next()` is used in all the other iterator methods.

mati865 · 2019-09-18T08:34:44Z

@nnethercote it's @Mark-Simulacrum, not @simulacrum.

Centril · 2019-09-18T10:55:54Z

r? @scottmcm

bors · 2019-09-18T11:30:59Z

☀️ Try build successful - checks-azure
Build commit: 5be48aa

rust-timer · 2019-09-18T11:31:01Z

Queued 5be48aa with parent 5283791, future comparison URL.

tesuji · 2019-09-18T12:54:10Z

src/libcore/iter/traits/iterator.rs

-                else { LoopState::Break(()) }
+        while let Some(x) = self.next() {
+            if !f(x) {
+                return false;


TIL that Iterator::all is that complex.
Here is the result from godbolt:https://godbolt.org/z/QlHZ0m

Thanks! I'm inexperienced with godbolt, but the results seem to strongly support my simplification, i.e. the microbenchmark static instruction count drops from 29 to 9 and the cycle count drops from 735 to 260. Is that right? Is there anything else in there of note?

The MCA results are misleading here, because of the warning at the bottom of the output that "note: program counter updates are ignored" and thus it doesn't understand the loops. If you consider that the first one is 4x unrolled, the cycle difference is not unreasonable.

The code in that godbolt link isn't actually correct:

while let Some(&x) = a.iter().next() {

a.iter() is called on every loop iteration so it's only looking at the first element in the slice.

Indeed. I updated the link: https://godbolt.org/z/QlHZ0m

rust-timer · 2019-09-18T13:20:09Z

Finished benchmarking try commit 5be48aa, comparison URL.

hanna-kruppe · 2019-09-18T14:15:25Z

Note that using internal iteration over for loops in all of these iterator "sinks" was done not out of cleverness but because internal iteration tends to optimize much better for some iterators (e.g., anything involving chain). See for example the benchmark cited in #45595 for sum. It's very bad that this turns out to be worse for the iterations that are common in rustc and I won't try to block this PR, but I think we should expect that this causes regressions for other kinds of code. Ideally we could have our cake and eat it too.

nnethercote · 2019-09-18T20:29:22Z

@rkruppe: I read through #45595 but I don't see an explanation of the sum benchmark, and intuitively I'm struggling to understand how it could optimize better. Can you explain some more?

I'd be interested in concrete examples of where this change causes slowdowns, because the perf results for rustc itself are ridiculously good. A 48.8% instruction count win? A 50.5% wall-time win? That's crazy. I had only tested with Check runs of keccak and inflate, because they are benchmarks where I know that the slowness of all, and they gave the slight speed-ups I expected. I will double check this with local runs and try to work out why it's such a big improvement.

nnethercote · 2019-09-18T21:17:12Z

I just analyzed clap-rs-debug with Cachegrind. Here is the first part of the diff:

--------------------------------------------------------------------------------
Ir                       
--------------------------------------------------------------------------------
-31,396,128,952 (100.0%)  PROGRAM TOTALS
  
--------------------------------------------------------------------------------
Ir                      file:function
--------------------------------------------------------------------------------
 -908,744,061 ( 2.89%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:(anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
  677,448,255 (-2.16%)  ???:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -673,479,754 ( 2.15%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -584,694,311 ( 1.86%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:_int_malloc
 -561,763,542 ( 1.79%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::Attribute::hasAttribute(llvm::StringRef) const
 -558,013,599 ( 1.78%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetNodeID::AddInteger(unsigned int)
 -556,005,022 ( 1.77%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:_int_free
  539,305,095 (-1.72%)  ???:(anonymous namespace)::DAGCombiner::AddToWorklist(llvm::SDNode*)
 -513,211,306 ( 1.63%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SmallPtrSet.cpp:llvm::SmallPtrSetImplBase::FindBucketFor(void const*) const
 -506,034,675 ( 1.61%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/MC/MCExpr.cpp:llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCAsmLayout const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long> > const*, bool) const 
  485,780,159 (-1.55%)  ???:llvm::Value::getValueName() const
 -481,651,644 ( 1.53%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/MachineInstr.cpp:llvm::MachineInstr::addOperand(llvm::MachineFunction&, llvm::MachineOperand const&)
  476,394,033 (-1.52%)  ???:llvm::FoldingSetNodeID::AddInteger(unsigned int)
 -469,995,575 ( 1.50%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::AttributeSetNode::hasAttribute(llvm::StringRef) const
 -461,988,232 ( 1.47%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/Hashing.h:std::enable_if<llvm::hashing::detail::is_hashable_data<unsigned int const>::value, llvm::hash_code>::type llvm::hashing::detail::hash_combine_range_impl<unsigned int const>(unsigned int const*, unsigned int const*)
 -457,036,946 ( 1.46%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:llvm::SelectionDAGISel::SelectCodeCommon(llvm::SDNode*, unsigned char const*, unsigned int)
 -421,492,023 ( 1.34%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SmallPtrSet.cpp:llvm::SmallPtrSetImplBase::insert_imp_big(void const*)
 -419,943,393 ( 1.34%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Bitstream/BitstreamWriter.h:llvm::BitstreamWriter::Emit(unsigned int, unsigned int)
 -385,636,839 ( 1.23%)  /build/glibc-KRRWSm/glibc-2.29/malloc/malloc.c:malloc
 -379,430,774 ( 1.21%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Support/DJB.h:llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
 -371,939,886 ( 1.18%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -368,831,902 ( 1.17%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetBase::FindNodeOrInsertPos(llvm::FoldingSetNodeID const&, void*&)
 -351,172,291 ( 1.12%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:llvm::SelectionDAG::Combine(llvm::CombineLevel, llvm::AAResults*, llvm::CodeGenOpt::Level)
 -348,577,656 ( 1.11%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/DenseMap.h:llvm::Value::getValueName() const
 -330,542,916 ( 1.05%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SHA1.cpp:llvm::SHA1::hashBlock()
 -327,752,246 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Metadata.cpp:llvm::Instruction::getMetadataImpl(unsigned int) const
 -327,405,404 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/IR/Attributes.cpp:llvm::AttributeList::getAttributes(unsigned int) const
 -327,346,176 ( 1.04%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/SHA1.cpp:blk(unsigned int*, int)
 -324,740,192 ( 1.03%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/ADT/SmallPtrSet.h:llvm::ScheduleDAGSDNodes::BuildSchedUnits()
 -317,306,265 ( 1.01%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/CodeGen/RegAllocFast.cpp:(anonymous namespace)::RegAllocFast::allocateInstruction(llvm::MachineInstr&)
 -307,280,976 ( 0.98%)  /home/njn/moz/rustN/src/llvm-project/llvm/include/llvm/Bitstream/BitstreamWriter.h:llvm::BitstreamWriter::EmitVBR(unsigned int, unsigned int)
  299,376,976 (-0.95%)  ???:llvm::Attribute::hasAttribute(llvm::StringRef) const
 -297,071,541 ( 0.95%)  /home/njn/moz/rustN/src/llvm-project/llvm/lib/Support/FoldingSet.cpp:llvm::FoldingSetNodeID::AddInteger(unsigned long long)
  294,959,073 (-0.94%)  ???:llvm::AttributeSetNode::get(llvm::LLVMContext&, llvm::AttrBuilder const&)

In short, LLVM is doing way less stuff. My best guess is that clap-rs must use some or all of these iterator methods a lot, and these methods get inlined, and so there ends up being less code to compile. I.e., it's not that rustc is that much faster, but the rust libs are much less bloated.

I did a quick grep for these methods occurring in clap-rs, here are the counts (which could include false positive matches):

all: 2
any: 59
find: 33
find_map: 0

That's a lot of calls, so this seems plausible.

Mark-Simulacrum · 2019-09-18T21:38:02Z

Hm, so I wonder if #62429 is somewhat related here. I'd be interested in seeing if we could get similar results with the old code but replacing e.g. |_| true with true_bool with fn true_bool() -> bool { true } or so. But I do think it makes sense to just land this if there are no objections given the wins here :)

hanna-kruppe · 2019-09-18T21:46:27Z

@nnethercote Oh, duh, that makes sense. "LLVM spends a lot less time optimizing iterator code if the iterator code is simpler" is a lot more plausible (and less scary) than rustc code magically getting twice as fast by fiddling with some iterator methods. But there's still instances of all not being optimized as well with the complex implementation than what's in here, right?

@rkruppe: I read through #45595 but I don't see an explanation of the sum benchmark, and intuitively I'm struggling to understand how it could optimize better. Can you explain some more?

I have never investigated this in enough detail to give a very detailed or confident explanation why this is, but I can give you my best guess as compiler engineer.

A loop like for x in a.chain(b) lowers to a single loop which repeatedly checks a flag to see whether the next iteration should pull from a to b. This obscures what's being iterated over, which is an essential first step for many loop optimizations, and hinders optimizations that could have applied to a loop over one of a or b but can't handle them both at the same time. For the "sum of chain of ranges" in the benchmark, the impact is especially drastic because "sum up the numbers from m to n" can be converted into a closed form expression when recognized. But it also affects passes which are more important to real world performance, such as the loop vectorizer.

The fix to all those woes is to split the loop into two loops, one going just over a and the other going just over b. But LLVM does not have a loop splitting pass, so it's often out of luck. In contrast, internal iteration methods like fold can be (are) overriden by Chain to explicitly iterate over each over the components of the chain separately, which gives LLVM two nice separate loops that it can handle more easily.

nnethercote · 2019-09-19T00:12:27Z

Here is a test program that demonstrates @rkruppe's point:

// bench.rs
// - Compile with `rustc --test -O bench.rs`
// - Run with `./bench --bench`
#![feature(test)]

extern crate test;

use test::Bencher;

#[bench]
fn long_all(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(100000);
        let k = test::black_box(1000000);
        (0..2*n).all(|x| x < k)
    })
}

#[bench]
fn long_all_chained(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(100000);
        let k = test::black_box(1000000);
        (0..n).chain(0..n).all(|x| x < k)
    })
}

Before:

running 2 tests
test long_all         ... bench:      47,849 ns/iter (+/- 2,881)
test long_all_chained ... bench:      49,985 ns/iter (+/- 4,589)

After:

running 2 tests
test long_all         ... bench:      47,247 ns/iter (+/- 3,094)
test long_all_chained ... bench:     142,170 ns/iter (+/- 5,201)

So we have a demonstrated slowdown for a chained use in a microbenchmark. It's also interesting that this microbenchmark doesn't show a speed-up for long_all, which is counter to the godbolt microbenchmark, and counter to the real-world experience in rustc itself.

nnethercote · 2019-09-19T00:21:49Z

To summarize the effects of this PR.

Programs that use these iterator methods in simple ways might run faster. We have seen this on one microbenchmark and real world code in rustc itself, and not seen it in another microbenchmark. For rustc, if you look at just the check perf results, which do no code generation and so do not benefit from having to generate less code, we get improvements of up to 4%. I know this is mostly due to the very hot all call in this code:

rust/src/librustc/traits/fulfill.rs

Lines 259 to 267 in ca3766e

    
           // if we were stalled on some unresolved variables, first check 
        
           // whether any of them have been resolved; if not, don't bother 
        
           // doing more work yet 
        
           if !pending_obligation.stalled_on.is_empty() { 
        
               if pending_obligation.stalled_on.iter().all(|&ty| { 
        
                   // Use the force-inlined variant of shallow_resolve() because this code is hot. 
        
                   let resolved = ShallowResolver::new(self.selcx.infcx()).inlined_shallow_resolve(ty); 
        
                   resolved == ty // nothing changed here 
        
               }) {

Programs that use these iterator methods with chain might run slower. We have seen this on one microbenchmark, though not yet in any real world examples.
Programs that use these methods will compile faster, sometimes drastically so. I've been working on making rustc faster since 2016, and I've never gotten close to 30-50% improvements for a single change. I'm normally happy with a 1% improvement on any benchmarks, and ecstatic with 5%. Furthermore, the improvements are for debug and opt builds, which are generally harder to speed up than check builds.

IMO the possible pessimization of the relatively rare chained case is well worth the optimization of the simpler cases and the reductions in compile time.

Furthermore, I'm wondering if any other common library functions are overly complicated and could also be simplified in order to gain more compile-time wins. I'm trying to think of how to find any such functions.

scottmcm · 2019-09-19T04:32:49Z

Looking at the change in #64545 and at @lzutao's helpful goldbolt link reminded me of something else that gives me an alternative hypothesis: stalled_on is a Vec, so it's using the slice iterator, which has manual unrolling for try_fold:

rust/src/libcore/slice/mod.rs

Lines 3184 to 3201 in eceec57

    
           fn try_fold<B, F, R>(&mut self, init: B, mut f: F) -> R where 
        
               Self: Sized, F: FnMut(B, Self::Item) -> R, R: Try<Ok=B> 
        
           { 
        
               // manual unrolling is needed when there are conditional exits from the loop 
        
               let mut accum = init; 
        
               unsafe { 
        
                   while len!(self) >= 4 { 
        
                       accum = f(accum, next_unchecked!(self))?; 
        
                       accum = f(accum, next_unchecked!(self))?; 
        
                       accum = f(accum, next_unchecked!(self))?; 
        
                       accum = f(accum, next_unchecked!(self))?; 
        
                   } 
        
                   while !is_empty!(self) { 
        
                       accum = f(accum, next_unchecked!(self))?; 
        
                   } 
        
               } 
        
               Try::from_ok(accum) 
        
           }

That's there because it was ported from search_while which gave @bluss a 37% perf win for all back in #37972 ("Improve all, any, find, position, rposition by explicitly unrolling the loop for the slice iterators").

I'll make a quick PR to remove the unrolling to give us some information on whether the problem is actually the bool-to-LoopState dance (which I feel shouldn't be that bad because LoopState<(), ()> is an i1 even in debug) or the problem is in calling the inner closure 4 times (which might be exacerbated by "Us[ing] the force-inlined variant of shallow_resolve() because this code is hot").

@scottmcm

[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold For context see #64572 (comment) r? @scottmcm

bluss · 2019-09-19T05:57:00Z

Everything makes sense to revisit, but since the internal iteration is a big algorithm shift - it's not just chain, it allows efficient iteration of all kinds of segmented datastructures, including VecDeque and multi dimensional arrays (when we get there, implementing try_fold is still unstable), backing out unrolling would be better.

scottmcm · 2019-09-19T06:18:49Z

(I honestly don't know how the old implementation of ResultShunt::next() didn't cause an infinite loop, given that it also called self.next(), albeit via try_for_each() and try_fold().)

It also overrides try_fold, so the find called the overridden try_fold, which called the interior iterator's try_fold, and never actually called back to next on the shunt iterator itself.

It's correct (though sometimes suboptimal) to make an iterator by overloading try_fold as the main implementation, and having next be defined in terms of that as self.try_for_each(Err).err().

nnethercote · 2019-09-19T12:03:28Z

So the closure passed to these iterator methods could be duplicated four times? Huh. That's aggressive. Looking at the results in #64600, that accounts for about half of the speedups. Even half of this win is still a big win.

I also wonder if it's worth using the simple implementations for debug builds, where losing some runtime speed in favour of faster compilation is a more acceptable trade-off.

Centril · 2019-09-19T16:28:55Z

One option: we could ostensibly #[cfg(...)] some code to detect whether rustc is being built and change the code slightly accordingly.

Mark-Simulacrum · 2019-09-19T17:23:52Z

Er, no? libstd/core etc are shipped for end-users and for rustc itself as the same binary, so we can't do things differently really in that way. But I also don't think that would be entirely useful.

scottmcm · 2019-09-19T17:45:19Z

Out of curiosity, has anyone looked at what clap's doing that makes it an order-of-magnitude outlier here? 2x faster for the patched incremental just because of this change seems outlandishly good.

(Oh, I spotted the post above about the call counts for these methods. Is it lots of calls with trivial bodies? Are the bodies really complicated? Is it typechecking or codegen or ...? For example, if a bunch of the cost is coming from the generic instead of bools, we can use a different type in Iterator...)

bluss · 2019-09-19T17:46:24Z

This blog post writes in detail about what internal iteration gives to Rust. The headline can be taken with a grain of salt: "Rust’s iterators are inefficient, and here’s what we can do about it."

Just to give more background to why Iterator's .try_fold() and .all() have their current shape.

rpjohnst · 2019-09-19T20:50:38Z

As a longer-term solution, I wonder if closures themselves could be made cheaper to compile, so rustc could avoid building so much LLVM IR even when handed the internal iterator versions of these functions.

Improvements in that direction could potentially also hugely benefit generator-based iterator implementations, if such a thing becomes common.

For example, the Thorin research IR has some higher-order optimizations that can eliminate closures before they are lowered: http://compilers.cs.uni-saarland.de/papers/lkh15_cgo.pdf. Perhaps something like this could be adapted to run on MIR?

nnethercote · 2019-09-19T22:59:37Z

Out of curiosity, has anyone looked at what clap's doing that makes it an order-of-magnitude outlier here? 2x faster for the patched incremental just because of this change seems outlandishly good.

I haven't looked closely, but the 4x unrolling means that any large closure (especially if it's marked with #[inline(always)]) can cause a lot of code bloat. This is true for the hot all within rustc itself that I mentioned above.

Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)

JohnCSimon · 2019-09-28T01:36:25Z

Ping from triage
@scottmcm @nnethercote This conversation seems to have stalled and this PR has sat idle for a week.
Thanks!

bluss · 2019-09-28T07:12:07Z

src/libcore/iter/traits/iterator.rs

@@ -319,7 +319,7 @@ pub trait Iterator {
    #[inline]
    #[stable(feature = "rust1", since = "1.0.0")]
    fn nth(&mut self, mut n: usize) -> Option<Self::Item> {
-        for x in self {
+        while let Some(x) = self.next() {


This is a good simplification

bluss · 2019-09-28T07:28:12Z

src/libcore/iter/traits/iterator.rs

@@ -1855,18 +1855,15 @@ pub trait Iterator {
    /// ```
    #[inline]
    #[stable(feature = "rust1", since = "1.0.0")]
-    fn all<F>(&mut self, f: F) -> bool where
+    fn all<F>(&mut self, mut f: F) -> bool where


I'd like to keep these Iterator::all etc simplifications, there are examples of where removing them would be a big regression. We have already put in the work to make sure we thread internal iteration methods through various adaptors correctly, so to remove the use of try_fold here at the top level would be a step back.

This change only partly rolls back what we have worked on for internal iteration — other entry points to it still remain, like sum, max, min, fold and try_fold. And those remaining happen to be the most useful in stable Rust (because implementing try_fold is not possible in stable, so custom iterators can't fully take advantage of this particular all magic). It would also be inconsistent to partly remove it, I think we should keep it all.

Here's an example with a microbenchmark of what kind of improvement it can be rust-itertools/itertools#348 In this example it was fold, but it will also be all when try_fold is stable.

I think we should go through fold/try_fold everywhere we can, for the structural improvements that gives (bypasses complicated state machines that composed iterators might have in .next())

Of course the code inside Iterator::all is not beautiful to read as it is now, and if it could be simplified while keeping the same features, that would be fantastic.

andjo403 · 2019-09-29T10:47:42Z

this PR changes documented stable behavior see https://doc.rust-lang.org/std/iter/trait.Iterator.html#note-to-implementors

Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)

nnethercote · 2019-09-30T23:52:47Z

It's clear that this can't land as is. But follow-up PRs such as #64600 (already landed) and #64885 will get a lot of the gains.

scottmcm · 2019-10-01T05:36:32Z

this PR changes documented stable behavior

@andjo403 Note that the try_fold method cannot yet be implemented on stable (because Try) is unstable, so there may still be still options available.

@nnethercote

use try_fold instead of try_for_each to reduce compile time as it was stated in rust-lang#64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold. as there is some gains with using the try_fold function this is maybe a way forward. when I tried to compile the clap-rs benchmark I get times gains only some % from rust-lang#64572 there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times can someone start a perf run? cc @nnethercote @scottmcm @bluss r? @ghost

nnethercote · 2019-10-09T04:50:26Z

In short, LLVM is doing way less stuff. My best guess is that clap-rs must use some or all of these iterator methods a lot, and these methods get inlined, and so there ends up being less code to compile. I.e., it's not that rustc is that much faster, but the rust libs are much less bloated.

I did a quick grep for these methods occurring in clap-rs, here are the counts (which could include false positive matches):

all: 2

any: 59

find: 33

find_map: 0

Out of curiosity, I tried introducing the simplified versions of these four functions one at a time on clap-rs. Here are the instruction counts, as reported by Cachegrind.

                     check           debug           opt
Original             16,285M         61,818M         183,768M
all                  16,287M         61,732M         183,660M
any                  16,291M         57,746M         178,922M
find                 16,284M         54,560M         174,997M
find_map             16,291M         54,441M         174,861M

There were big drops in the debug and opt builds when any and find were converted, and negligible changes when all and find_map were converted, which matches up with the number of calls in the code.

Darksonn · 2019-10-12T09:12:31Z

Which iterators are slowed down by the try_for_each based any implementation? Perhaps it would make sense to override the any implementation on those iterators instead of replacing the default any implementation?

rust-highfive assigned cramertj Sep 18, 2019

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 18, 2019

rust-highfive assigned scottmcm and unassigned cramertj Sep 18, 2019

tesuji reviewed Sep 18, 2019

View reviewed changes

nnethercote mentioned this pull request Sep 18, 2019

More ObligationForest improvements #64545

Merged

scottmcm mentioned this pull request Sep 19, 2019

Remove manual unrolling from slice::Iter(Mut)::try_fold #64600

Merged

bors added a commit that referenced this pull request Sep 19, 2019

Auto merge of #64600 - scottmcm:no-slice-tryfold-unroll, r=<try>

dd115ba

[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold For context see #64572 (comment) r? @scottmcm

Eh2406 mentioned this pull request Sep 20, 2019

Provide a tool to see what user code is causing rustc to use lots of time rust-lang/measureme#51

Open

bluss reviewed Sep 28, 2019

View reviewed changes

andjo403 mentioned this pull request Sep 28, 2019

use try_fold instead of try_for_each to reduce compile time #64885

Merged

nnethercote closed this Sep 30, 2019

nnethercote deleted the simplify-Iterator-methods branch October 9, 2019 22:41

Swatinem mentioned this pull request Jan 23, 2020

perf: Use for_each in Vec::extend #68046

Closed

nnethercote mentioned this pull request May 12, 2020

Make fold standalone. #72139

Merged

Simplify some Iterator methods. #64572

Simplify some Iterator methods. #64572

Conversation

nnethercote commented Sep 18, 2019

rust-highfive commented Sep 18, 2019

nnethercote commented Sep 18, 2019

rust-timer commented Sep 18, 2019

bors commented Sep 18, 2019

mati865 commented Sep 18, 2019

Centril commented Sep 18, 2019

bors commented Sep 18, 2019

rust-timer commented Sep 18, 2019

tesuji Sep 18, 2019 • edited Loading

Choose a reason for hiding this comment

nnethercote Sep 18, 2019

Choose a reason for hiding this comment

scottmcm Sep 19, 2019

Choose a reason for hiding this comment

ollie27 Sep 19, 2019

Choose a reason for hiding this comment

tesuji Sep 19, 2019

Choose a reason for hiding this comment

rust-timer commented Sep 18, 2019

hanna-kruppe commented Sep 18, 2019

nnethercote commented Sep 18, 2019

nnethercote commented Sep 18, 2019 • edited Loading

Mark-Simulacrum commented Sep 18, 2019

hanna-kruppe commented Sep 18, 2019

nnethercote commented Sep 19, 2019

nnethercote commented Sep 19, 2019

scottmcm commented Sep 19, 2019 • edited Loading

bluss commented Sep 19, 2019 • edited Loading

scottmcm commented Sep 19, 2019 • edited Loading

nnethercote commented Sep 19, 2019

Centril commented Sep 19, 2019

Mark-Simulacrum commented Sep 19, 2019

scottmcm commented Sep 19, 2019 • edited Loading

bluss commented Sep 19, 2019 • edited Loading

rpjohnst commented Sep 19, 2019

nnethercote commented Sep 19, 2019

JohnCSimon commented Sep 28, 2019

bluss Sep 28, 2019 • edited Loading

Choose a reason for hiding this comment

bluss Sep 28, 2019 • edited Loading

Choose a reason for hiding this comment

bluss Sep 28, 2019

Choose a reason for hiding this comment

andjo403 commented Sep 29, 2019

nnethercote commented Sep 30, 2019

scottmcm commented Oct 1, 2019

nnethercote commented Oct 9, 2019

Darksonn commented Oct 12, 2019

Simplify some `Iterator` methods. #64572

Simplify some `Iterator` methods. #64572

tesuji Sep 18, 2019 •

edited

Loading

nnethercote commented Sep 18, 2019 •

edited

Loading

scottmcm commented Sep 19, 2019 •

edited

Loading

bluss commented Sep 19, 2019 •

edited

Loading

scottmcm commented Sep 19, 2019 •

edited

Loading

scottmcm commented Sep 19, 2019 •

edited

Loading

bluss commented Sep 19, 2019 •

edited

Loading

bluss Sep 28, 2019 •

edited

Loading

bluss Sep 28, 2019 •

edited

Loading