Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RangeInclusive internal iteration performance improvement. #58122

Merged
merged 2 commits into from
Feb 23, 2019

Conversation

matthieu-m
Copy link
Contributor

@matthieu-m matthieu-m commented Feb 3, 2019

Specialize Iterator::try_fold and DoubleEndedIterator::try_rfold to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with RangeInclusive on par with the performance of iteration with Range:

  • Single conditional jump in hot loop,
  • Unrolling and vectorization,
  • And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of next and next_back, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

  • The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
  • An implementation using a is_done boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as next and next_back are concerned.

Specialize Iterator::try_fold and DoubleEndedIterator::try_rfold to
improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with
RangeInclusive on par with the performance of iteration with Range:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various
attempts at stream-lining the implementation of next and next_back,
LLVM has stubbornly refused to optimize external iteration
appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is
   performed, but which uses 2 conditional jumps in the hot loop when
   optimization fail.
 - An implementation using a "is_done" boolean, which uses 1
   conditional jump in the hot loop when optimization fail, allowing
   unrolling and vectorization, but for which Closed Form substitution
   fails.

In the absence of any conclusive evidence as to which usecase matters
most, and with no assurance that the lack of Closed Form substitution
is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the
statu quo as far as next and next_back are concerned.
@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 3, 2019
@matthieu-m
Copy link
Contributor Author

Take 2 on #57378 .

src/libcore/iter/range.rs Show resolved Hide resolved
@dtolnay dtolnay added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 6, 2019
@matthieu-m
Copy link
Contributor Author

@dtolnay : I've addressed the concerns raised by scottmcm, so this should be ready for review.

@dtolnay dtolnay added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 11, 2019
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me, but since this is intended to be a performance improvement only, do you have a benchmark that other people could run to confirm the improvement?

@matthieu-m
Copy link
Contributor Author

matthieu-m commented Feb 15, 2019

During development, I used the following benchmark https://gist.github.com/matthieu-m/3a9ab9afc4eee80565c8d95f94db6fa4 where FixedRangeInclusive is a copy/paste of the current version, with the performance improvements of this PR.

On my machine, I get:

Benchmark Exclusive Chain Current Proposed (vs Current)
sum 1.1315 ns 4.8270 ns 972.33 ps 1.5776 ns (+62%)
triangle foreach 1.1667 ns 1.7881 ns 978.75 ps 1.5791 ns (+61%)
triangle loop 1.1299 ns 1.1643 ms 968.94 ps 975.43 ps (-0.34%)
addmul foreach 1.1611 ms 1.1633 ms 1.1799 ms 1.1644 ms (-1.3%)
addmul loop 1.1689 ms 1.1785 ms 1.1658 ms 1.1638 ms (-0.17%)
triples 960.00 us 889.76 us 1.8435 ms 968.05 us (-47%)

Where we can see that:

  • The sum and triangle foreach benchmarks are all transformed into a closed form, but not the same one.
  • The triangle loop and addmul loop benchmarks use the same next() implementation, so are mostly interesting to see the small imprecision of such measurements.
  • The addmul foreach benchmark is simple enough that LLVM was already optimizing it well, so the small improvement is close to the noise.
  • The triples benchmark, which originally prompted this work, shows a x2 improvement.

Overall, the new RangeInclusive implementation performs on par with the Range one (modulo closed form), as the performance cliff on triples is smoothed out.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dtolnay
Copy link
Member

dtolnay commented Feb 15, 2019

@bors r+

@bors
Copy link
Contributor

bors commented Feb 15, 2019

📌 Commit 4fed67f has been approved by dtolnay

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 15, 2019
@bors
Copy link
Contributor

bors commented Feb 15, 2019

⌛ Testing commit 4fed67f with merge 74beb94d93bb043326dc2b1c80d53bbc054031d1...

@bors
Copy link
Contributor

bors commented Feb 15, 2019

💔 Test failed - checks-travis

@rust-highfive
Copy link
Collaborator

Your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@bors bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Feb 15, 2019
@pietroalbini
Copy link
Member

@bors retry -- apparently we can't clone the repo anymore on macOS

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 15, 2019
Centril added a commit to Centril/rust that referenced this pull request Feb 20, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
Centril added a commit to Centril/rust that referenced this pull request Feb 23, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
Centril added a commit to Centril/rust that referenced this pull request Feb 23, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
Centril added a commit to Centril/rust that referenced this pull request Feb 23, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
Centril added a commit to Centril/rust that referenced this pull request Feb 23, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
Centril added a commit to Centril/rust that referenced this pull request Feb 23, 2019
RangeInclusive internal iteration performance improvement.

Specialize `Iterator::try_fold` and `DoubleEndedIterator::try_rfold` to improve code generation in all internal iteration scenarios.

This changes brings the performance of internal iteration with `RangeInclusive` on par with the performance of iteration with `Range`:

 - Single conditional jump in hot loop,
 - Unrolling and vectorization,
 - And even Closed Form substitution.

Unfortunately, it only applies to internal iteration. Despite various attempts at stream-lining the implementation of `next` and `next_back`, LLVM has stubbornly refused to optimize external iteration appropriately, leaving me with a choice between:

 - The current implementation, for which Closed Form substitution is performed, but which uses 2 conditional jumps in the hot loop when optimization fail.
 - An implementation using a `is_done` boolean, which uses 1 conditional jump in the hot loop when optimization fail, allowing unrolling and vectorization, but for which Closed Form substitution fails.

In the absence of any conclusive evidence as to which usecase matters most, and with no assurance that the lack of Closed Form substitution is not indicative of other optimizations being foiled, there is no way
to pick one implementation over the other, and thus I defer to the statu quo as far as `next` and `next_back` are concerned.
bors added a commit that referenced this pull request Feb 23, 2019
Rollup of 16 pull requests

Successful merges:

 - #58100 (Transition librustdoc to Rust 2018)
 - #58122 (RangeInclusive internal iteration performance improvement.)
 - #58199 (Add better error message for partial move)
 - #58227 (Updated RELEASES.md for 1.33.0)
 - #58353 (Check the Self-type of inherent associated constants)
 - #58453 (SGX target: fix panic = abort)
 - #58476 (Remove `LazyTokenStream`.)
 - #58526 (Special suggestion for illegal unicode curly quote pairs)
 - #58595 (Turn duration consts into associated consts)
 - #58609 (Allow Self::Module to be mutated.)
 - #58628 (Optimise vec![false; N] to zero-alloc)
 - #58643 (Don't generate minification variables if minification disabled)
 - #58648 (Update tests to account for cross-platform testing and miri.)
 - #58654 (Do not underflow after resetting unmatched braces count)
 - #58658 (Add expected/provided byte alignments to validation error message)
 - #58667 (Reduce Miri-related Code Repetition `like (n << amt) >> amt`)

Failed merges:

r? @ghost
@bors bors merged commit 4fed67f into rust-lang:master Feb 23, 2019
@matthieu-m matthieu-m deleted the range_incl_perf branch February 24, 2019 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants