Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up Vec's benchmarks #83458

Merged
merged 1 commit into from
Mar 30, 2021
Merged

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Mar 25, 2021

The Vec benchmarks need a lot of love. I sort of noticed this in #83357 but the overall situation is much less awesome than I thought at the time. The first commit just removes a lot of asserts and does a touch of other cleanup.

A number of these benchmarks are poorly-named. For example, bench_map_fast is not in fact fast, bench_rev_1 and bench_rev_2 are vague, bench_in_place_zip_iter_mut doesn't call zip, bench_in_place* don't do anything in-place... Should I fix these, or is there tooling that depend on the names not changing?

I've also noticed that bench_rev_1 and bench_rev_2 are remarkably fragile. It looks like poking other code in Vec can cause the codegen of this benchmark to switch to a version that has almost exactly half its current throughput and I have absolutely no idea why.

Here's the fast version:

  0.69110:   movdqu -0x20(%rbx,%rdx,4),%xmm0
  1.76movdqu -0x10(%rbx,%rdx,4),%xmm1
  0.71pshufd $0x1b,%xmm1,%xmm1
  0.60pshufd $0x1b,%xmm0,%xmm0
  3.68movdqu %xmm1,-0x30(%rcx)
 14.36movdqu %xmm0,-0x20(%rcx)
 13.88movdqu -0x40(%rbx,%rdx,4),%xmm0
  6.64movdqu -0x30(%rbx,%rdx,4),%xmm1
  0.76pshufd $0x1b,%xmm1,%xmm1
  0.77pshufd $0x1b,%xmm0,%xmm0
  1.87movdqu %xmm1,-0x10(%rcx)
 13.01movdqu %xmm0,(%rcx)
 38.81add    $0x40,%rcx
  0.92add    $0xfffffffffffffff0,%rdx
  1.22 │     ↑ jne    110

And the slow one:

  0.42 │9a880:   movdqa     %xmm2,%xmm1
  4.03 │9a884:   movq       -0x8(%rbx,%rsi,4),%xmm4
  8.49 │9a88a:   pshufd     $0xe1,%xmm4,%xmm4
  2.58 │9a88f:   movq       -0x10(%rbx,%rsi,4),%xmm5
  7.02 │9a895:   pshufd     $0xe1,%xmm5,%xmm5
  4.79 │9a89a:   punpcklqdq %xmm5,%xmm4
  5.77 │9a89e:   movdqu     %xmm4,-0x18(%rdx)
 15.74 │9a8a3:   movq       -0x18(%rbx,%rsi,4),%xmm4
  3.91 │9a8a9:   pshufd     $0xe1,%xmm4,%xmm4
  5.04 │9a8ae:   movq       -0x20(%rbx,%rsi,4),%xmm5
  5.29 │9a8b4:   pshufd     $0xe1,%xmm5,%xmm5
  4.60 │9a8b9:   punpcklqdq %xmm5,%xmm4
  9.81 │9a8bd:   movdqu     %xmm4,-0x8(%rdx)
 11.05 │9a8c2:   paddq      %xmm3,%xmm0
  0.86 │9a8c6:   paddq      %xmm3,%xmm2
  5.89 │9a8ca:   add        $0x20,%rdx
  0.12 │9a8ce:   add        $0xfffffffffffffff8,%rsi
  1.16 │9a8d2:   add        $0x2,%rdi
  2.96 │9a8d6: → jne        9a880 <<alloc::vec::Vec<T,A> as core::iter::traits::collect::Extend<&T>>::extend+0xd0>

Many of the Vec benchmarks assert what values should be produced by the
benchmarked code. In some cases, these asserts dominate the runtime of
the benchmarks they are in, causing the benchmarks to understate the
impact of an optimization or regression.
@rust-highfive
Copy link
Collaborator

r? @dtolnay

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Mar 25, 2021
Copy link
Member

@the8472 the8472 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bench_in_place_zip_iter_mut doesn't call zip

My bad, I could have picked better name. If I recall correctly it was baseline to which bench_in_place_zip_recycle gets compared.

bench_in_place* don't do anything in-place...

There's a specialization for vec.into_iter()....collect() which reuses the allocation, it should be exercising that.

b.iter(|| {
let v: Vec<u32> = Vec::with_capacity(src_len);
assert_eq!(v.len(), 0);
assert_eq!(v.capacity(), src_len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that benchmarks are also executed when you run ./x.py test, i.e. they can double as test-cases. If you remove those asserts here because they get in the way of measuring things you'll have to check if there are roughtly equivalent proper tests somewhere.

They seem to be quite trivial, so hopefully this isn't the only place where this stuff is tested, but I'd still check.

@saethlin
Copy link
Member Author

There's a specialization for vec.into_iter()....collect() which reuses the allocation, it should be exercising that.

When I read the name of these benchmarks, I was expecting to find an implementation that doesn't allocate or free. In my mind, the whole point of an operation being in-place is that it doesn't allocate or free. So my concern/confusion is that the bench_in_place* benchmarks runtime is dominated by malloc and free because every iteration creates a new Vec, even though the whole point of the benchmarks seems to be exercising a specialization that reduce allocator activity. I don't think there's another benchmark to compare against to see the impact of the specialization.

Do we want to demonstrate the impact of the into_iter() -> collect specialization? Maybe we should have benchmarks that do zero allocation because of this specialization (like bench_in_place_recycle) and another set of benchmarks that aren't eligible for the specialization for some reason that has minimal impact on the runtime of the iteration (maybe alignment of the new T?).

@the8472
Copy link
Member

the8472 commented Mar 25, 2021

There's a specialization for vec.into_iter()....collect() which reuses the allocation, it should be exercising that.

When I read the name of these benchmarks, I was expecting to find an implementation that doesn't allocate or free. In my mind, the whole point of an operation being in-place is that it doesn't allocate or free. So my concern/confusion is that the bench_in_place* benchmarks runtime is dominated by malloc and free because every iteration creates a new Vec, even though the whole point of the benchmarks seems to be exercising a specialization that reduce allocator activity. I don't think there's another benchmark to compare against to see the impact of the specialization.

It still should be half the amount of malloc/frees compared to the naive approach. Which was possible to see when I did before/after benchmarks when developing. But granted it took quite some noise-reduction on my system to even see the effects. The _recycle approach makes it more obvious.
They also helped checking the assembly to see whether the xor vectorized. In retrospect I should have written codegen tests for that instead.

Do we want to demonstrate the impact of the into_iter() -> collect specialization? Maybe we should have benchmarks that do zero allocation because of this specialization (like bench_in_place_recycle) and another set of benchmarks that aren't eligible for the specialization for some reason that has minimal impact on the runtime of the iteration (maybe alignment of the new T?).

Well, the iter_mut based one should already be doing the same, but sure something based on alignment might be a more obvious baseline.

@the8472
Copy link
Member

the8472 commented Mar 25, 2021

I've also noticed that bench_rev_1 and bench_rev_2 are remarkably fragile. It looks like poking other code in Vec can cause the codegen of this benchmark to switch to a version that has almost exactly half its current throughput and I have absolutely no idea why.

Try benchmarking with incremental compilation off and codgenunits == 1 if that's not already the case. Without that the codegen partitioning ruins some inline opportunities in easily perturbed ways.

@saethlin
Copy link
Member Author

Try benchmarking with incremental compilation off and codgenunits == 1 if that's not already the case. Without that the codegen partitioning ruins some inline opportunities in easily perturbed ways.

I believe I already have those settings? Those are the defaults, as far as I can tell. Though honestly the whole x.py thing is deeply mysterious to me, I can't for example figure out how to get debug symbols in the benchmark binaries.

@the8472
Copy link
Member

the8472 commented Mar 28, 2021

x.py delegates to src/bootstrap, which pieces together the build procedure. I think the debuginfo is controlled by debuginfo-level-std or debuginfo-level-tests in config.toml

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am happy you are willing to get these in better shape.

I am on board with removing asserts from the benchmarks. Trying to get the benchmark suite to serve double duty as tests and benchmarks seems misguided. Too much test-only code inline complicates seeing what is actually being benchmarked. The functionality of cargo test / x.py test of also running 1 iteration of your benchmarks is just to smoke test that they don't crash, it isn't meant to be the place that you do the main exhaustive testing of library functionality.

A number of these benchmarks are poorly-named. Should I fix these, or is there tooling that depend on the names not changing?

I do not know of anything like that. Feel free to fix the names.

@dtolnay
Copy link
Member

dtolnay commented Mar 30, 2021

@bors r+

@bors
Copy link
Contributor

bors commented Mar 30, 2021

📌 Commit 8c88418 has been approved by dtolnay

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 30, 2021
@bors
Copy link
Contributor

bors commented Mar 30, 2021

⌛ Testing commit 8c88418 with merge 689e847...

@bors
Copy link
Contributor

bors commented Mar 30, 2021

☀️ Test successful - checks-actions
Approved by: dtolnay
Pushing 689e847 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 30, 2021
@bors bors merged commit 689e847 into rust-lang:master Mar 30, 2021
@rustbot rustbot added this to the 1.53.0 milestone Mar 30, 2021
@saethlin saethlin deleted the improve-vec-benches branch May 16, 2022 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants