Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override try_fold for StepBy #51435

Closed
wants to merge 1 commit into from

Conversation

SimonSapin
Copy link
Contributor

No description provided.

@SimonSapin SimonSapin added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. A-iterators Area: Iterators labels Jun 8, 2018
@rust-highfive
Copy link
Collaborator

r? @kennytm

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 8, 2018
@SimonSapin
Copy link
Contributor Author

@scottmcm This PR is from your repo. I’m not good at reading assembly but your comment at #27741 (comment) suggests that this is an improvement. Should we land it?

The diff looks good to me.

@kennytm
Copy link
Member

kennytm commented Jun 8, 2018

I've done a simple benchmark and the test code from the comment is improved from 800µs/iter to 600µs/iter.

running 2 tests
test pr_bench  ... bench:     592,516 ns/iter (+/- 42,563)
test std_bench ... bench:     809,164 ns/iter (+/- 54,219)

Source code:

#![feature(try_trait, test)]

extern crate test;

use std::ops::Try;

use test::{Bencher, black_box};

#[no_mangle]
pub fn std_compute(a: u64, b: u64) -> u64 {
    StepByWithoutTryFold(StepBy {
        iter: a..b,
        step: 6,
        first_take: true,
    }).map(|x| x ^ (x - 1)).sum()
}

#[no_mangle]
pub fn pr_compute(a: u64, b: u64) -> u64 {
    StepBy {
        iter: a..b,
        step: 6,
        first_take: true,
    }.map(|x| x ^ (x - 1)).sum()
}

#[bench]
fn std_bench(bencher: &mut Bencher) {
    let a = black_box(1);
    let b = black_box(5000000);
    bencher.iter(|| {
        black_box(std_compute(a, b));
    });
}

#[bench]
fn pr_bench(bencher: &mut Bencher) {
    let a = black_box(1);
    let b = black_box(5000000);
    bencher.iter(|| {
        black_box(pr_compute(a, b));
    });
}

struct StepBy<I> {
    iter: I,
    step: usize,
    first_take: bool,
}

struct StepByWithoutTryFold<I>(StepBy<I>);

impl<I: Iterator> Iterator for StepBy<I> {
    type Item = I::Item;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        if self.first_take {
            self.first_take = false;
            self.iter.next()
        } else {
            self.iter.nth(self.step)
        }
    }

    #[inline]
    fn try_fold<B, F, R>(&mut self, init: B, mut f: F) -> R where
        Self: Sized, F: FnMut(B, Self::Item) -> R, R: Try<Ok=B>
    {
        let mut accum = init;

        if self.first_take {
            self.first_take = false;
            if let Some(x) = self.iter.next() {
                accum = f(accum, x)?;
            } else {
                return Try::from_ok(accum);
            }
        }

        while let Some(x) = self.iter.nth(self.step) {
            accum = f(accum, x)?;
        }
        Try::from_ok(accum)
    }
}

impl<I: Iterator> Iterator for StepByWithoutTryFold<I> {
    type Item = I::Item;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        if self.0.first_take {
            self.0.first_take = false;
            self.0.iter.next()
        } else {
            self.0.iter.nth(self.0.step)
        }
    }
}

@scottmcm
Copy link
Member

scottmcm commented Jun 9, 2018

Hmm, I'd forgotten I'd started this 😅 I see two options:

  1. Just merge this, since it helps -- thanks for benching, @kennytm! (Well, with some tests, especially for the short-circuiting logic, since that's the thing I got wrong the most doing the other try_folds.)

  2. Reimplement StepBy so that the bool flag isn't needed, and thus the manual try_fold is likely also unneeded (since there's nothing to manually hoist). The flag is there so it can use nth, but if there was a method that called next n times, returning the first result, then the flag wouldn't be needed, and it would actually be easier for Range to implement. Sketch of what I mean: https://play.rust-lang.org/?gist=f2c028e6e42e9ce5f3a311b71dcfc3a9&version=nightly&mode=release

Thoughts?

@kennytm
Copy link
Member

kennytm commented Jun 12, 2018

I think we could do the refactoring later.

@bors r+

@bors
Copy link
Contributor

bors commented Jun 12, 2018

📌 Commit 2d55d28 has been approved by kennytm

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 12, 2018
@bors
Copy link
Contributor

bors commented Jun 13, 2018

🔒 Merge conflict

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Jun 13, 2018
@Emerentius
Copy link
Contributor

I've rerun @kennytm's code but replaced the std code by the specialization from my PR referencing this issue just above this comment. With that, this PR would lower performance.

test pr_bench             ... bench:     647,483 ns/iter (+/- 12,094)
test specialized_pr_bench ... bench:     413,365 ns/iter (+/- 13,795)
#![feature(try_trait, test, step_trait)]
use std::iter::Step;

extern crate test;

use std::ops::Try;

use test::{Bencher, black_box};

#[no_mangle]
pub fn std_compute(a: u64, b: u64) -> u64 {
    StepByWithoutTryFold(StepBy {
        iter: a..b,
        step: 6,
        first_take: true,
    }).map(|x| x ^ (x - 1)).sum()
}

#[no_mangle]
pub fn pr_compute(a: u64, b: u64) -> u64 {
    StepBy {
        iter: a..b,
        step: 6,
        first_take: true,
    }.map(|x| x ^ (x - 1)).sum()
}

#[bench]
fn specialized_pr_bench(bencher: &mut Bencher) {
    let a = black_box(1);
    let b = black_box(5000000);
    bencher.iter(|| {
        black_box(std_compute(a, b));
    });
}

#[bench]
fn pr_bench(bencher: &mut Bencher) {
    let a = black_box(1);
    let b = black_box(5000000);
    bencher.iter(|| {
        black_box(pr_compute(a, b));
    });
}

struct StepBy<I> {
    iter: I,
    step: usize,
    first_take: bool,
}

struct StepByWithoutTryFold<I>(StepBy<I>);

impl<I: Iterator> Iterator for StepBy<I> {
    type Item = I::Item;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        if self.first_take {
            self.first_take = false;
            self.iter.next()
        } else {
            self.iter.nth(self.step)
        }
    }

    #[inline]
    fn try_fold<B, F, R>(&mut self, init: B, mut f: F) -> R where
        Self: Sized, F: FnMut(B, Self::Item) -> R, R: Try<Ok=B>
    {
        let mut accum = init;

        if self.first_take {
            self.first_take = false;
            if let Some(x) = self.iter.next() {
                accum = f(accum, x)?;
            } else {
                return Try::from_ok(accum);
            }
        }

        while let Some(x) = self.iter.nth(self.step) {
            accum = f(accum, x)?;
        }
        Try::from_ok(accum)
    }
}

impl<T> Iterator for StepByWithoutTryFold<std::ops::Range<T>>
where
    T: Step,
{
    type Item = T;

    #[inline]
    fn next(&mut self) -> Option<Self::Item> {
        self.0.first_take = false;
        if self.0.iter.start >= self.0.iter.end {
            return None;
        }
        // add 1 to self.step to get original step size back
        // it was decremented for the general case on construction
        if let Some(n) = self.0.iter.start.add_usize(self.0.step+1) {
            let next = std::mem::replace(&mut self.0.iter.start, n);
            Some(next)
        } else {
            let last = self.0.iter.start.clone();
            self.0.iter.start = self.0.iter.end.clone();
            Some(last)
        }
    }
}

@SimonSapin
Copy link
Contributor Author

Thanks for looking into this @Emerentius. Closing in favor of #51601.

@SimonSapin SimonSapin closed this Jun 18, 2018
bors added a commit that referenced this pull request Jun 21, 2018
Specialize StepBy<Range(Inclusive)>

Part of #51557, related to #43064, #31155

As discussed in the above issues, `step_by` optimizes very badly on ranges which is related to
1. the special casing of the first `StepBy::next()` call
2. the need to do 2 additions of `n - 1` and `1` inside the range's `next()`

This PR eliminates both by overriding `next()` to always produce the current element and also step ahead by `n` elements in one go. The generated code is much better, even identical in the case of a `Range` with constant `start` and `end` where `start+step` can't overflow. Without constant bounds it's a bit longer than the manual loop. `RangeInclusive` doesn't optimize as nicely but is still much better than the original asm.
Unsigned integers optimize better than signed ones for some reason.

See the following two links for a comparison.

[godbolt: specialization for ..](https://godbolt.org/g/haHLJr)
[godbolt: specialization for ..=](https://godbolt.org/g/ewyMu6)

`RangeFrom`, the only other range with an `Iterator` implementation can't be specialized like this without changing behaviour due to overflow. There is no way to save "finished-ness".

The approach can not be used in general, because it would produce side effects of the underlying iterator too early.

May obsolete #51435, haven't checked.
@timvermeulen
Copy link
Contributor

timvermeulen commented May 7, 2019

Since #51601 was reverted, should this be considered again?

Centril added a commit to Centril/rust that referenced this pull request Sep 9, 2019
… r=scottmcm

Override `StepBy::{try_fold, try_rfold}`

Previous PR: rust-lang#51435

The previous PR was closed in favor of rust-lang#51601, which was later reverted. I don't think these implementations will make it harder to specialize `StepBy<Range<_>>` later, so we should be able to land this without any consequences.

This should fix rust-lang#57517 – in my benchmarks `iter` and `iter.step_by(1)` now perform equally well, provided internal iteration is used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants