FlatMap, Flatten appear to optimize badly #87411

adrian17 · 2021-07-23T17:32:01Z

Context: I recently saw a perf regression caused by the following change:

pub fn pixels_rgba(&self) -> Vec<u8> {
    let mut output = Vec::new();
    for p in &self.pixels { output.extend_from_slice(&[p.red(), p.green(), p.blue(), p.alpha()]) }
    output
}
->
pub fn pixels_rgba(&self) -> Vec<u8> {
    self.pixels.iter().flat_map(|p| [p.red(), p.green(), p.blue(), p.alpha()]).collect();
}

The change was reasonable, with assumption that this would be more idiomatic and guarantee that the output Vec is preallocated. Unfortunately, this made the function several times slower. Recently-merged #87168 helped here slightly, but the generated code is still much slower.

The same can also be observed for other code using flat_map or flatten, like a simple iteration over the iterator, and regardless of whether the flattened type has known size (array) or not.

I made an example benchmark in repo https://github.com/adrian17/flat_map_perf , with the following results on my machine (with today's nightly rustc):


tests::bench_array_4x500000_collect_loop               1,269,560 ns/iter (+/- 146,537)
tests::bench_array_4x500000_collect_loop_with_prealloc 1,255,140 ns/iter (+/- 165,287)
tests::bench_array_4x500000_collect_with_flat_map      2,697,082 ns/iter (+/- 303,411)

tests::bench_array_4x500000_iteration_nested_loop        220,838 ns/iter (+/- 25,307)
tests::bench_array_4x500000_iteration_flat_map         3,029,744 ns/iter (+/- 463,749)


tests::bench_iter_4000x500_collect_loop                  243,537 ns/iter (+/- 34,574)
tests::bench_iter_4000x500_collect_loop_with_prealloc    243,246 ns/iter (+/- 34,197)
tests::bench_iter_4000x500_collect_with_flatten        3,521,586 ns/iter (+/- 597,755)

tests::bench_iter_4000x500_iteration_nested_loop         290,939 ns/iter (+/- 34,414)
tests::bench_iter_4000x500_iteration_flatten           2,099,386 ns/iter (+/- 512,732)


tests::bench_iter_4x500000_collect_loop                3,124,601 ns/iter (+/- 444,296)
tests::bench_iter_4x500000_collect_loop_with_prealloc  2,873,051 ns/iter (+/- 576,719)
tests::bench_iter_4x500000_collect_with_flatten        5,579,601 ns/iter (+/- 796,355)

tests::bench_iter_4x500000_iteration_nested_loop       2,118,351 ns/iter (+/- 396,325)
tests::bench_iter_4x500000_iteration_flatten           3,187,518 ns/iter (+/- 443,080)

The text was updated successfully, but these errors were encountered:

the8472 · 2021-07-23T23:51:47Z

@adrian17 would it be possible to license those benchmarks in a way that's compatible with the rustc licenses so they can be included in the std benchmarks?

JulianKnodt · 2021-07-24T04:01:39Z

Looking at the last few benchmarks, it would appear that the main issue is that flatten doesn't vectorize well:
https://rust.godbolt.org/z/Mejb1GMso

I didn't check out the other ones, but there may be 2 issues, one where flatten doesn't vectorize well, and the other where collect cannot preallocate the correct amount.

As for the original change, imho converting everything into an iterator probably is not more idiomatic and is likely less efficient especially when already working with slices, because it's relying on propagation of traits in the iterators which may not all be implemented, and also the compiler to optimize through more abstract levels. That being said, the perf regression is surprising, and worth digging into.

Edit: When looking at iteration_flatten vs. iteration_nested_loop, replacing the for loop in flatten with a for_each outputs very similar assembly for both, which is personally unexpected.

the8472 · 2021-07-24T08:23:55Z

replacing the for loop in flatten with a for_each outputs very similar assembly for both, which is personally unexpected.

actually, this is expected. for_each and related methods were introduced because internal iteration is easier to optimize than external iteration

https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb

the8472 · 2021-07-24T09:43:36Z

At least in the array case the flatten version can be made faster than the collect loops

test vec::bench_array_4x500000_collect_loop               ... bench:     603,389 ns/iter (+/- 4,740)
test vec::bench_array_4x500000_collect_loop_with_prealloc ... bench:     600,193 ns/iter (+/- 4,775)
test vec::bench_array_4x500000_collect_with_flat_map      ... bench:      44,652 ns/iter (+/- 210)
test vec::bench_array_4x500000_iteration_flat_map         ... bench:      35,318 ns/iter (+/- 451)
test vec::bench_array_4x500000_iteration_nested_loop      ... bench:      35,263 ns/iter (+/- 379)
test vec::bench_iter_4000x500_collect_loop                ... bench:      45,740 ns/iter (+/- 3,116)
test vec::bench_iter_4000x500_collect_loop_with_prealloc  ... bench:      45,567 ns/iter (+/- 481)
test vec::bench_iter_4000x500_collect_with_flatten        ... bench:     906,304 ns/iter (+/- 11,179)
test vec::bench_iter_4000x500_iteration_flatten           ... bench:      37,737 ns/iter (+/- 343)
test vec::bench_iter_4000x500_iteration_nested_loop       ... bench:      39,169 ns/iter (+/- 424)
test vec::bench_iter_4x500000_collect_loop                ... bench:   1,471,703 ns/iter (+/- 58,651)
test vec::bench_iter_4x500000_collect_loop_with_prealloc  ... bench:   1,597,366 ns/iter (+/- 68,349)
test vec::bench_iter_4x500000_collect_with_flatten        ... bench:   1,339,198 ns/iter (+/- 115,268)
test vec::bench_iter_4x500000_iteration_flatten           ... bench:   1,168,924 ns/iter (+/- 103,969)
test vec::bench_iter_4x500000_iteration_nested_loop       ... bench:   1,174,368 ns/iter (+/- 26,433)

The nested data structure is a much tougher nut to crack since a good size hint isn't available and internal iteration doesn't have much of an advantage due to reallocation checks needed inside the loop. It basically uses the fallback code path and previous attempts to improve it also lead to compile time regressions.

adrian17 · 2021-07-24T10:11:55Z

@adrian17 would it be possible to license those benchmarks in a way that's compatible with the rustc licenses so they can be included in the std benchmarks?

My bad, forgot to attach the MIT license :) Done.

the8472 · 2021-07-24T16:42:53Z

bench_array_4x500000_collect_with_flat_map

This will be addressed by #87431

bench_array_4x500000_iteration_flat_map
bench_iter_4000x500_iteration_flatten

.for_each() should be used here instead of for ... in. There has been some discussion about changing the loop desugaring to internal iteration in some cases, but I don't think this is going to happen anytime soon.

bench_iter_4x500000_*

These are just bad practice. It's a large collection of many tiny vecs on the heap which means the CPU spends more time chasing pointers and checking the dynamic length than moving data. The flatten performance is not ideal but throwing additional compiler flags at the problem shrinks the difference to a point where I don't think it's worth to optimize it.

Picking better data structures is a much lower-hanging fruit.

bench_iter_4000x500_collect_with_flatten

This one is the most difficult one. In principle this is quite optimizable, 5x speedups should be easily possible, maybe 20x in some scenarios. But it might need wider changes than just a specialization or implementing another method. Changing extend_desugared to internal iteration might bring some wins but won't the manual implementation using extend_from_slice. I think this would at least need some new marker trait (TrustedChunkedMinLength) or possibly new iteration methods to acquire data in sized batches determined by iterator adapters (cycle, chain and flatten could produce that kind of thing when the sub-iterators are TrustedLen but moving the outer part prevents the full length from being propagated).

I'm not sure how big the appetite for more optimization traits is among the libs team.

@adrian17 does this represent a real use or is it merely an extrapolation from the initial report?

adrian17 · 2021-07-24T17:52:40Z

@adrian17 does this represent a real use or is it merely an extrapolation from the initial report?

Indeed, most of these cases I added artificially for wider comparison. bench_array_4x500000_collect_* are the only ones that directly impacted me.

That said, I've seen cases of iteration looking like for item in (...).flatten() in the wild, including some in rustc (see: regex " in .+flat").

implement fold() on array::IntoIter to improve flatten().collect() perf With rust-lang#87168 flattening `array::IntoIter`s is now `TrustedLen`, the `FromIterator` implementation for `Vec` has a specialization for `TrustedLen` iterators which uses internal iteration. This implements one of the main internal iteration methods on `array::Into` to optimize the combination of those two features. This should address the main issue in rust-lang#87411 ``` # old test vec::bench_flat_map_collect ... bench: 2,244,024 ns/iter (+/- 18,903) # new test vec::bench_flat_map_collect ... bench: 172,863 ns/iter (+/- 2,141) ```

the8472 · 2021-07-27T18:39:32Z

@adrian17 next nightly should have the fix, can you test it?

adrian17 · 2021-07-28T22:16:16Z

Yeah, it looks good now, thank you!

I'm still feeling like bench_*_iteration_* have counter-intuitive perf, as in: ideally I would expect nested loop and for x in (...).flatten() to be roughly interchangable, but I understand that's a bigger issue and out of scope here. By the way:

There has been some discussion about changing the loop desugaring to internal iteration in some cases, but I don't think this is going to happen anytime soon.

Could you please link where these discussions were having place?

the8472 · 2021-07-28T22:33:04Z

Yeah, it looks good now, thank you!

Great, then I'll close this one.

but I understand that's a bigger issue and out of scope here.

If it becomes an issue in real code that'll be worth a separate issue. Currently it's a theoretical concern and may not be worth the necessary optimization work.

Could you please link where these discussions were having place?

I asked on zulip in a discussion relating to other iterator optimizations and was pointed to this comment rust-lang/rfcs#3058

adrian17 mentioned this issue Jul 23, 2021

avm1: Fix performance regression in BitmapData::pixels_rgba ruffle-rs/ruffle#4919

Merged

nikic added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jul 24, 2021

the8472 mentioned this issue Jul 24, 2021

implement fold() on array::IntoIter to improve flatten().collect() perf #87431

Merged

the8472 closed this as completed Jul 28, 2021

james7132 mentioned this issue Nov 25, 2022

flat_map fails to autovectorize internal iterators #104914

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlatMap, Flatten appear to optimize badly #87411

FlatMap, Flatten appear to optimize badly #87411

adrian17 commented Jul 23, 2021

the8472 commented Jul 23, 2021

JulianKnodt commented Jul 24, 2021 •

edited

Loading

the8472 commented Jul 24, 2021

the8472 commented Jul 24, 2021

adrian17 commented Jul 24, 2021

the8472 commented Jul 24, 2021

adrian17 commented Jul 24, 2021 •

edited

Loading

the8472 commented Jul 27, 2021

adrian17 commented Jul 28, 2021

the8472 commented Jul 28, 2021

FlatMap, Flatten appear to optimize badly #87411

FlatMap, Flatten appear to optimize badly #87411

Comments

adrian17 commented Jul 23, 2021

the8472 commented Jul 23, 2021

JulianKnodt commented Jul 24, 2021 • edited Loading

the8472 commented Jul 24, 2021

the8472 commented Jul 24, 2021

adrian17 commented Jul 24, 2021

the8472 commented Jul 24, 2021

adrian17 commented Jul 24, 2021 • edited Loading

the8472 commented Jul 27, 2021

adrian17 commented Jul 28, 2021

the8472 commented Jul 28, 2021

JulianKnodt commented Jul 24, 2021 •

edited

Loading

adrian17 commented Jul 24, 2021 •

edited

Loading