Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlatMap, Flatten appear to optimize badly #87411

Closed
adrian17 opened this issue Jul 23, 2021 · 10 comments
Closed

FlatMap, Flatten appear to optimize badly #87411

adrian17 opened this issue Jul 23, 2021 · 10 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@adrian17
Copy link

Context: I recently saw a perf regression caused by the following change:

pub fn pixels_rgba(&self) -> Vec<u8> {
    let mut output = Vec::new();
    for p in &self.pixels { output.extend_from_slice(&[p.red(), p.green(), p.blue(), p.alpha()]) }
    output
}
->
pub fn pixels_rgba(&self) -> Vec<u8> {
    self.pixels.iter().flat_map(|p| [p.red(), p.green(), p.blue(), p.alpha()]).collect();
}

The change was reasonable, with assumption that this would be more idiomatic and guarantee that the output Vec is preallocated. Unfortunately, this made the function several times slower. Recently-merged #87168 helped here slightly, but the generated code is still much slower.

The same can also be observed for other code using flat_map or flatten, like a simple iteration over the iterator, and regardless of whether the flattened type has known size (array) or not.

I made an example benchmark in repo https://github.com/adrian17/flat_map_perf , with the following results on my machine (with today's nightly rustc):


tests::bench_array_4x500000_collect_loop               1,269,560 ns/iter (+/- 146,537)
tests::bench_array_4x500000_collect_loop_with_prealloc 1,255,140 ns/iter (+/- 165,287)
tests::bench_array_4x500000_collect_with_flat_map      2,697,082 ns/iter (+/- 303,411)

tests::bench_array_4x500000_iteration_nested_loop        220,838 ns/iter (+/- 25,307)
tests::bench_array_4x500000_iteration_flat_map         3,029,744 ns/iter (+/- 463,749)


tests::bench_iter_4000x500_collect_loop                  243,537 ns/iter (+/- 34,574)
tests::bench_iter_4000x500_collect_loop_with_prealloc    243,246 ns/iter (+/- 34,197)
tests::bench_iter_4000x500_collect_with_flatten        3,521,586 ns/iter (+/- 597,755)

tests::bench_iter_4000x500_iteration_nested_loop         290,939 ns/iter (+/- 34,414)
tests::bench_iter_4000x500_iteration_flatten           2,099,386 ns/iter (+/- 512,732)


tests::bench_iter_4x500000_collect_loop                3,124,601 ns/iter (+/- 444,296)
tests::bench_iter_4x500000_collect_loop_with_prealloc  2,873,051 ns/iter (+/- 576,719)
tests::bench_iter_4x500000_collect_with_flatten        5,579,601 ns/iter (+/- 796,355)

tests::bench_iter_4x500000_iteration_nested_loop       2,118,351 ns/iter (+/- 396,325)
tests::bench_iter_4x500000_iteration_flatten           3,187,518 ns/iter (+/- 443,080)
@the8472
Copy link
Member

the8472 commented Jul 23, 2021

@adrian17 would it be possible to license those benchmarks in a way that's compatible with the rustc licenses so they can be included in the std benchmarks?

@JulianKnodt
Copy link
Contributor

JulianKnodt commented Jul 24, 2021

Looking at the last few benchmarks, it would appear that the main issue is that flatten doesn't vectorize well:
https://rust.godbolt.org/z/Mejb1GMso

I didn't check out the other ones, but there may be 2 issues, one where flatten doesn't vectorize well, and the other where collect cannot preallocate the correct amount.

As for the original change, imho converting everything into an iterator probably is not more idiomatic and is likely less efficient especially when already working with slices, because it's relying on propagation of traits in the iterators which may not all be implemented, and also the compiler to optimize through more abstract levels. That being said, the perf regression is surprising, and worth digging into.

Edit: When looking at iteration_flatten vs. iteration_nested_loop, replacing the for loop in flatten with a for_each outputs very similar assembly for both, which is personally unexpected.

@the8472
Copy link
Member

the8472 commented Jul 24, 2021

replacing the for loop in flatten with a for_each outputs very similar assembly for both, which is personally unexpected.

actually, this is expected. for_each and related methods were introduced because internal iteration is easier to optimize than external iteration

https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb

@the8472
Copy link
Member

the8472 commented Jul 24, 2021

At least in the array case the flatten version can be made faster than the collect loops

test vec::bench_array_4x500000_collect_loop               ... bench:     603,389 ns/iter (+/- 4,740)
test vec::bench_array_4x500000_collect_loop_with_prealloc ... bench:     600,193 ns/iter (+/- 4,775)
test vec::bench_array_4x500000_collect_with_flat_map      ... bench:      44,652 ns/iter (+/- 210)
test vec::bench_array_4x500000_iteration_flat_map         ... bench:      35,318 ns/iter (+/- 451)
test vec::bench_array_4x500000_iteration_nested_loop      ... bench:      35,263 ns/iter (+/- 379)
test vec::bench_iter_4000x500_collect_loop                ... bench:      45,740 ns/iter (+/- 3,116)
test vec::bench_iter_4000x500_collect_loop_with_prealloc  ... bench:      45,567 ns/iter (+/- 481)
test vec::bench_iter_4000x500_collect_with_flatten        ... bench:     906,304 ns/iter (+/- 11,179)
test vec::bench_iter_4000x500_iteration_flatten           ... bench:      37,737 ns/iter (+/- 343)
test vec::bench_iter_4000x500_iteration_nested_loop       ... bench:      39,169 ns/iter (+/- 424)
test vec::bench_iter_4x500000_collect_loop                ... bench:   1,471,703 ns/iter (+/- 58,651)
test vec::bench_iter_4x500000_collect_loop_with_prealloc  ... bench:   1,597,366 ns/iter (+/- 68,349)
test vec::bench_iter_4x500000_collect_with_flatten        ... bench:   1,339,198 ns/iter (+/- 115,268)
test vec::bench_iter_4x500000_iteration_flatten           ... bench:   1,168,924 ns/iter (+/- 103,969)
test vec::bench_iter_4x500000_iteration_nested_loop       ... bench:   1,174,368 ns/iter (+/- 26,433)

The nested data structure is a much tougher nut to crack since a good size hint isn't available and internal iteration doesn't have much of an advantage due to reallocation checks needed inside the loop. It basically uses the fallback code path and previous attempts to improve it also lead to compile time regressions.

@adrian17
Copy link
Author

@adrian17 would it be possible to license those benchmarks in a way that's compatible with the rustc licenses so they can be included in the std benchmarks?

My bad, forgot to attach the MIT license :) Done.

@nikic nikic added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jul 24, 2021
@the8472
Copy link
Member

the8472 commented Jul 24, 2021

bench_array_4x500000_collect_with_flat_map

This will be addressed by #87431

bench_array_4x500000_iteration_flat_map
bench_iter_4000x500_iteration_flatten

.for_each() should be used here instead of for ... in. There has been some discussion about changing the loop desugaring to internal iteration in some cases, but I don't think this is going to happen anytime soon.

bench_iter_4x500000_*

These are just bad practice. It's a large collection of many tiny vecs on the heap which means the CPU spends more time chasing pointers and checking the dynamic length than moving data. The flatten performance is not ideal but throwing additional compiler flags at the problem shrinks the difference to a point where I don't think it's worth to optimize it.

Picking better data structures is a much lower-hanging fruit.

bench_iter_4000x500_collect_with_flatten

This one is the most difficult one. In principle this is quite optimizable, 5x speedups should be easily possible, maybe 20x in some scenarios. But it might need wider changes than just a specialization or implementing another method. Changing extend_desugared to internal iteration might bring some wins but won't the manual implementation using extend_from_slice. I think this would at least need some new marker trait (TrustedChunkedMinLength) or possibly new iteration methods to acquire data in sized batches determined by iterator adapters (cycle, chain and flatten could produce that kind of thing when the sub-iterators are TrustedLen but moving the outer part prevents the full length from being propagated).

I'm not sure how big the appetite for more optimization traits is among the libs team.

@adrian17 does this represent a real use or is it merely an extrapolation from the initial report?

@adrian17
Copy link
Author

adrian17 commented Jul 24, 2021

@adrian17 does this represent a real use or is it merely an extrapolation from the initial report?

Indeed, most of these cases I added artificially for wider comparison. bench_array_4x500000_collect_* are the only ones that directly impacted me.

That said, I've seen cases of iteration looking like for item in (...).flatten() in the wild, including some in rustc (see: regex " in .+flat").

bors added a commit to rust-lang-ci/rust that referenced this issue Jul 27, 2021
implement fold() on array::IntoIter to improve flatten().collect() perf

With rust-lang#87168 flattening `array::IntoIter`s is now `TrustedLen`, the `FromIterator` implementation for `Vec` has a specialization for `TrustedLen` iterators which uses internal iteration. This implements one of the main internal iteration methods on `array::Into` to optimize the combination of those two features.

This should address the main issue in rust-lang#87411

```
# old
test vec::bench_flat_map_collect                         ... bench:   2,244,024 ns/iter (+/- 18,903)

# new
test vec::bench_flat_map_collect                         ... bench:     172,863 ns/iter (+/- 2,141)
```
@the8472
Copy link
Member

the8472 commented Jul 27, 2021

@adrian17 next nightly should have the fix, can you test it?

@adrian17
Copy link
Author

Yeah, it looks good now, thank you!

I'm still feeling like bench_*_iteration_* have counter-intuitive perf, as in: ideally I would expect nested loop and for x in (...).flatten() to be roughly interchangable, but I understand that's a bigger issue and out of scope here. By the way:

There has been some discussion about changing the loop desugaring to internal iteration in some cases, but I don't think this is going to happen anytime soon.

Could you please link where these discussions were having place?

@the8472
Copy link
Member

the8472 commented Jul 28, 2021

Yeah, it looks good now, thank you!

Great, then I'll close this one.

but I understand that's a bigger issue and out of scope here.

If it becomes an issue in real code that'll be worth a separate issue. Currently it's a theoretical concern and may not be worth the necessary optimization work.

Could you please link where these discussions were having place?

I asked on zulip in a discussion relating to other iterator optimizations and was pointed to this comment rust-lang/rfcs#3058

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

4 participants