Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Rework extract_meshes #4240

Closed
wants to merge 2 commits into from

Conversation

nicopap
Copy link
Contributor

@nicopap nicopap commented Mar 17, 2022

  • Cleanup redundant code
  • Use a type alias to make sure the caster_query and
    not_caster_query really do the same thing and access the same things

Objective

Cleanup code that would otherwise be difficult to understand

Solution

  • extract_meshes had two for loops which are functionally identical,
    just copy-pasted code. I extracted the common code between the two
    and put them into an anonymous function.
  • I flattened the tuple literal for the bundle batch, it looks much
    less nested and the code is much more readable as a result.
  • The parameters of extract_meshes were also very daunting, but they
    turned out to be the same query repeated twice. I extracted the query
    into a type alias.

EDIT: I reworked the PR to not do anything breaking, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.

@github-actions github-actions bot added the S-Needs-Triage This issue needs to be labelled label Mar 17, 2022
@superdump
Copy link
Contributor

Did you benchmark this with something like many_cubes to check its performance is as good as or better than main?

@mockersf you said something about the Query iterator only being an exact size iterator if there was no filter?

@nicopap
Copy link
Contributor Author

nicopap commented Mar 17, 2022

Did you benchmark this with something like many_cubes to check its performance is as good as or better than main?

Actually not. I'll do that and give you the run down.

@nicopap
Copy link
Contributor Author

nicopap commented Mar 17, 2022

tracy log for spans running cargo run --release --example many_cubes --features 'trace trace_tracy' for 10000 frames.

I did this 4 time. Twice per branch. The span are the measurments for the respective spans:

  • ovhe: system overhead{name="bevy_pbr::render::mesh::extract_meshes"}
  • self: ovhe self time
  • cmds: system_commands{name="bevy_pbr::render::mesh::extract_meshes"}
  • syst: system{name="bevy_pbr::render::mesh::extract_meshes"}

All values are in microseconds μs.

run span branch mean median std
1 cmds cleanup-extract_meshes 861 820 340
1 syst cleanup-extract_meshes 1280 1220 421
1 ovhe cleanup-extract_meshes 724 754 779
1 self cleanup-extract_meshes 79 35 1010
1 cmds main 845 795 337
1 syst main 846 814 287
1 ovhe main 517 395 563
1 self main 94 8 705
2 cmds cleanup-extract_meshes 1270 1160 492
2 syst cleanup-extract_meshes 1320 1250 435
2 ovhe cleanup-extract_meshes 760 835 797
2 self cleanup-extract_meshes 78 53 1050
2 cmds main 831 789 336
2 syst main 831 806 277
2 ovhe main 526 603 537
2 self main 85 65 695

I guess the value that matters is self + syst + cmds?

run branch self + syst + cmds
1 cleanup-extract_meshes 79 + 861 + 1280 = 2220
1 main 94 + 846 + 845 = 1785
2 cleanup-extract_meshes 78 + 1320 + 1270 = 2668
2 main 85 + 831 + 831 = 1747

Looks like there is performance degradation? I really have not enough expertize to say how useful this kind of microbenchmark is.

@nicopap
Copy link
Contributor Author

nicopap commented Mar 17, 2022

QueryIter::size_hint links to #1686 though this is only tangentially relevant here. From the code I've read, the standard library <T as Iterator>::collect<Vec<_>>() uses the lower bound exposed by Iterator::size_hint() for the initial allocation (unless the iterator is TrustedLen in which case it uses the upper bound). Whether the iterator implements ExactSizeIterator doesn't matter.

Currently, QueryIter::size_hint's lower bound is always zero. So rust won't be able to use it to optimize a QueryIter::collect<Vec<_>>() call. Ways to solve that:

  • Use specialization to implement a special size_hint for queries with filter that have predictable size not stable
  • unsafe impl TrustedLen not stable
  • Probably (maybe?) some others I'll take time to investigate.

@superdump
Copy link
Contributor

0.5-1.0ms difference is quite a large performance regression considering, say 120fps requires a frame to be processed in less than 8.33ms.

That said, the code in the PR does look cleaner, and I would think that if the collect can be made to allocate a large enough Vec up-front, and that shows the performance is as good or better than main then it's a good change. :)

@nicopap
Copy link
Contributor Author

nicopap commented Mar 17, 2022

I looked into the size_hint problem and I think I figured an implementation that gives proper hint archetype based filtering. Now to check if it improves performance on this PR.

@nicopap
Copy link
Contributor Author

nicopap commented Mar 17, 2022

Second run of benchmarks, this time with the size_hint fix from the changes introduced in #4244. I'm very disapointed.

Clearly, this doesn't work. My guess is that maybe the filter overwrites the min size (which makes a lot of sense) so the size_hint fix doesn't fix this at all.

It's still possible to just add back the last size-based allocation strategy. I'll do that tomorrow and see where it gets.

run span branch mean median std
1 cmds cleanup + size-hint 1350 1230 513
1 syst cleanup + size-hint 1140 1090 411
1 self cleanup + size-hint 80 60 929
2 cmds cleanup + size-hint 1270 1160 477
2 syst cleanup + size-hint 1180 1130 417
2 self cleanup + size-hint 107 106 1010
1 cmds main 1290 1180 488
1 syst main 837 812 275
1 self main 78 45 684
3 cmds cleanup + size-hint 1330 1200 506
3 syst cleanup + size-hint 1190 1140 432
3 self cleanup + size-hint 114 113 1040
2 cmds main 1260 1140 485
2 syst main 844 819 280
2 self main 82 60 700
1 cmds filter-size-hint 1300 1190 505
1 syst filter-size-hint 844 813 283
1 self filter-size-hint 107 106 741

@ghost ghost added A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide and removed S-Needs-Triage This issue needs to be labelled labels Mar 18, 2022
@nicopap
Copy link
Contributor Author

nicopap commented Mar 18, 2022

Third run of benchmarkes. this time with proper allocation patterns. Names are:

  • cleanup + naive data: current code
  • cleanup + min-data: An alternative command buffer where some of the work is delegated to the command system
  • main: main as of c1a2378

Looks like this time the cleaned up code turned into a performance gain with the proper allocation pattern. A major downside however is that we allocate a command buffer the size of all entities with a mesh, regardless of how many of them are visible, this seems like a pitfall waiting to find a victim. I also expect the perf to vary somewhat between machines as a result.

run span branch mean median std
1 cmds cleanup + naive data 850 801 333
1 syst cleanup + naive data 910 885 295
1 self cleanup + naive data 82 33 737
1 cmds cleanup + min-data 1500 1350 546
1 syst cleanup + min-data 615 604 186
1 self cleanup + min-data 84 60 524
1 cmds main 1280 1150 501
1 syst main 833 806 275
1 self main 88 75 701
2 cmds cleanup + naive data 1300 1140 740
2 syst cleanup + naive data 924 870 434
2 self cleanup + naive data 79 49 789
run branch total
1 cleanup + naive data 1842
1 cleanup + min-data 2199
1 main 2201
2 cleanup + naive data 2303

@superdump
Copy link
Contributor

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

@nicopap
Copy link
Contributor Author

nicopap commented Mar 18, 2022

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

Well, I guess that's why extend accepts an iterator. Combining with_capacity with extend is still pretty ergonomic.

@mockersf
Copy link
Member

mockersf commented Mar 19, 2022

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

Iterators are using the min size from the size hint to know how much to allocate. For ExactSizeIterator or TrustedLen, the size hint is actually correct, hence why allocation is OK. You can't manually set the size hint, but there are ways around that if that's something we want in Bevy.
Alternatively, collect_into (rust-lang/rust#93057) has been merged very recently, we could add a TODO note to revisit this once it's stable

@nicopap
Copy link
Contributor Author

nicopap commented Apr 27, 2022

I think it's too much of a loss to not keep the previous frame's vec sizes. But I still think the function can be improved. So I'll put this PR into draft mod

@nicopap nicopap marked this pull request as draft April 27, 2022 09:23
@nicopap nicopap force-pushed the cleanup-extract_meshes branch 2 times, most recently from 2e70aef to cbce57a Compare April 27, 2022 19:41
@nicopap
Copy link
Contributor Author

nicopap commented Apr 27, 2022

Note: I managed to trigger an ICE writing this PR. Though I don't really think the "improvements" the ICE is blocking is quite that important. See rust-lang/rust#96488

@nicopap
Copy link
Contributor Author

nicopap commented Jun 13, 2022

btw, this is not breaking anymore, and it keeps the exact same storage pattern as the original. It's just a minor refactoring.

@alice-i-cecile alice-i-cecile self-requested a review June 28, 2022 16:24
};
let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster));
let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible;
let mut caster_cmds = Vec::with_capacity(*prev_len_caster);
Copy link
Member

@alice-i-cecile alice-i-cecile Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are more accurately characterized as "bundles" or "entities", not commands.

In general, please try to spell out terms; cmds was very hard to decipher reading this from top to bottom.

>,
mut prev_len_caster: Local<usize>,
mut prev_len_not: Local<usize>,
caster_query: Query<ExtractMeshQuery, Without<NotShadowCaster>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this into a single query with an Option<NotShadowCaster> field? I suppose it would likely be slower...

Copy link
Contributor Author

@nicopap nicopap Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't know about the perf characteristics. But a better way to do this would be to merge the two queries and replace the Option<&'static NotShadowReceiver> by With<NotShadowReceiver> (the negative on the marker type really throws me off)

Edit: Wow! This is not how With filter works! I'm surprised.

Edit2: Option<&T> can be replaced by Or<(With<T>, ())> for the purpose of checking if a component is present. I'm wondering if the performance profile is identical

Edit3: Nevermind! This wouldn't work. It would just always return true

}
*previous_not_caster_len = not_caster_values.len();
commands.insert_or_spawn_batch(not_caster_values);
let flags = if not_receiver.is_some() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let flags = if not_receiver.is_some() {
let shadow_receiver_flags = if not_receiver.is_some() {

};
let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster));
let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible;
let mut caster_cmds = Vec::with_capacity(*prev_len_caster);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut caster_cmds = Vec::with_capacity(*prev_len_caster);
/// Use the previous number of matching entities to provide a good guess of the required capacity
let mut caster_cmds = Vec::with_capacity(*prev_len_caster);

.map(mesh_bundle)
.map(with_marker),
);
*prev_len_caster = caster_cmds.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*prev_len_caster = caster_cmds.len();
// Store the number of entities matching each query in the appropriate Local
// This allows us to make reasonable guesses about the capacity required for the next iteration
*prev_len_caster = caster_cmds.len();

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find both before and after quite hard to read. I think this does need cleanup, but I don't think this is a clear improvement currently. The functional style is probably appropriate here, given the use of the spawn_batch method, but it's easy for that style to become quite impenetrable.

I've left some suggestions for improved variable naming and comments that should help make this more clear.

@alice-i-cecile alice-i-cecile removed the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Jun 28, 2022
@nicopap
Copy link
Contributor Author

nicopap commented Jun 28, 2022

The suggestion of merging the two queries really made me think. I'll push an update that replaces the functional style with a simple for loop, put the PR in draft mode, do some benchmarking and make some judgments based on that.

@nicopap nicopap marked this pull request as draft June 28, 2022 19:57
Cleanup redundant code
* `extract_meshes` had two for loops which are functionally identical,
  just copy-pasted code. This was because of the split query. We merge
  them and put them in a single loop.
* The tuple literal for the command buffers was difficult to read,
  because it spanned many levels of idententations and lines of codes.
  they now fit in a single line.
@alice-i-cecile
Copy link
Member

Wow that last batch of changes really makes this much cleaner. I'm very curious to see if there's a perf regression. If there isn't, I'm strongly in favor of these changes.

@nicopap nicopap marked this pull request as ready for review June 30, 2022 16:41
@nicopap
Copy link
Contributor Author

nicopap commented Jun 30, 2022

I've done mini-benchmarking in the same vein as the first iteration of this PR. It seems that the single for loop is not only as performant as the initial implementation, but significantly faster (15%). I've a AMD Ryzen 9 3900X, I'm wondering if I might have been lucky staying within my CPU cache size, or maybe a single query with a single loop unlocks better compiler optimizations?

Don't take my word for it! I'm going to post the benchmark table tomorrow.

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance speedup is probably due to more coherent accesses if you have a ‘random’ or ‘incoherent’ mix of shadow casters and not shadow casters. But, when you say 15%, 15% of what metric? How are you testing? And what is the absolute time difference?

crates/bevy_pbr/src/render/mesh.rs Outdated Show resolved Hide resolved
&ComputedVisibility,
&GlobalTransform,
&Handle<Mesh>,
Option<Without<NotShadowReceiver>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, what is this query syntax? We can do optional filters within the query parameters? What does that even mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the component exists, it returns Some(component). Otherwise it returns None.

The particularly tricky bit here is that Without is being used in the fetch parameter of the query, and so you're getting a bool returned there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it matches Without<NotShadowCaster> then it returns Some(true) and otherwise None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just Option<&NotShadowCaster> and invert the logic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that suggestion; I think this is probably clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  1. Theoretically, Option<&NotShadowCaster> requires building a reference based on an offset and stuff, while Option<With<NotShadowCaster>> does not, it might have different performance characteristic.
  2. Minor, but the item type for Option<With<NotShadowCaster>> is Option<()> not Option<bool>
  3. I preferred Without<Not…> over With<Not…> because in the first implementation I got very confused by the negative on the variables.

@superdump
Copy link
Contributor

I’ll happily test on an M1 Max.

@nicopap
Copy link
Contributor Author

nicopap commented Jul 1, 2022

think the performance speedup is probably due to more coherent accesses if you have a ‘random’ or ‘incoherent’ mix of shadow casters and not shadow casters. But, when you say 15%, 15% of what metric? How are you testing? And what is the absolute time difference?

I've only tested many_cubes. You are right! I should try another benchmark where there are shadow casters.

@nicopap
Copy link
Contributor Author

nicopap commented Jul 1, 2022

I've been benchmarking the extract_meshes on the many_cubes stress test. I've tried controlling for thermal and background tasks, but I really can't get meaningful data out of it. It seems everything is within a ±10% range.

I've updated many_cubes to add NotShadowCaster and NotShadowReceiver on some cubes, so that I could have a fair split of entities with/without etc.

legend:

  • stag: Span stage{name="extract"}
  • syst: Span system overhead{name="bevy_pbr::render::mesh::extract_meshes"}
  • cmds: Span system_commands{name="bevy_pbr::render::mesh::extract_meshes"}
  • Times in microseconds (μs)
  • Ran with cargo run --example many_cubes --release --features "trace trace_tracy" for exactly 500 seconds, with a 180 seconds interval between runs (for thermal) on my linux machine (specs already mentioned) running exactly two instances of the st terminal, I even killed some daemons for this.
  • baseline: The commit just before this PR's (e28b88b), but the many_cubes patch
  • unified query: The code submitted in this PR, but the many_cubes patch
  • split query: A variation of this PR with two for loops.
run span branch mean median std
1 stag baseline 2750 2660 737
2 stag baseline 2920 2790 763
1 stag unified query 2810 2690 769
2 stag unified query 2670 2560 688
1 stag split query 2840 2700 759
2 stag split query 2990 2850 822
1 syst baseline 505 417 545
2 syst baseline 504 403 546
1 syst unified query 472 349 519
2 syst unified query 485 362 530
1 syst split query 565 415 626
2 syst split query 571 418 628
1 cmds baseline 577 559 199
2 cmds baseline 607 588 216
1 cmds unified query 613 656 214
2 cmds unified query 1130 1060 356
1 cmds split query 1130 1060 365
2 cmds split query 597 580 216

Ran on this PR branch and base commit with the following patch to many_cubes:

https://gist.github.com/nicopap/b44cd2e1de65053ce521291dfecc5d12

I've been using the following script for testing:

https://gist.github.com/nicopap/00a04685b65a5402fa55a07ac5adc26a

@alice-i-cecile
Copy link
Member

I'm content to call that "performance neutral" then.

@alice-i-cecile alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Jul 4, 2022
@alice-i-cecile
Copy link
Member

bors r+

bors bot pushed a commit that referenced this pull request Jul 4, 2022
* Cleanup redundant code
* Use a type alias to make sure the `caster_query` and
  `not_caster_query` really do the same thing and access the same things

**Objective**

Cleanup code that would otherwise be difficult to understand

**Solution**

* `extract_meshes` had two for loops which are functionally identical,
  just copy-pasted code. I extracted the common code between the two
  and put them into an anonymous function.
* I flattened the tuple literal for the bundle batch, it looks much
  less nested and the code is much more readable as a result.
* The parameters of `extract_meshes` were also very daunting, but they
  turned out to be the same query repeated twice. I extracted the query
  into a type alias.

EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
@bors bors bot changed the title Rework extract_meshes [Merged by Bors] - Rework extract_meshes Jul 4, 2022
@bors bors bot closed this Jul 4, 2022
inodentry pushed a commit to IyesGames/bevy that referenced this pull request Aug 8, 2022
* Cleanup redundant code
* Use a type alias to make sure the `caster_query` and
  `not_caster_query` really do the same thing and access the same things

**Objective**

Cleanup code that would otherwise be difficult to understand

**Solution**

* `extract_meshes` had two for loops which are functionally identical,
  just copy-pasted code. I extracted the common code between the two
  and put them into an anonymous function.
* I flattened the tuple literal for the bundle batch, it looks much
  less nested and the code is much more readable as a result.
* The parameters of `extract_meshes` were also very daunting, but they
  turned out to be the same query repeated twice. I extracted the query
  into a type alias.

EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
james7132 pushed a commit to james7132/bevy that referenced this pull request Oct 28, 2022
* Cleanup redundant code
* Use a type alias to make sure the `caster_query` and
  `not_caster_query` really do the same thing and access the same things

**Objective**

Cleanup code that would otherwise be difficult to understand

**Solution**

* `extract_meshes` had two for loops which are functionally identical,
  just copy-pasted code. I extracted the common code between the two
  and put them into an anonymous function.
* I flattened the tuple literal for the bundle batch, it looks much
  less nested and the code is much more readable as a result.
* The parameters of `extract_meshes` were also very daunting, but they
  turned out to be the same query repeated twice. I extracted the query
  into a type alias.

EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
ItsDoot pushed a commit to ItsDoot/bevy that referenced this pull request Feb 1, 2023
* Cleanup redundant code
* Use a type alias to make sure the `caster_query` and
  `not_caster_query` really do the same thing and access the same things

**Objective**

Cleanup code that would otherwise be difficult to understand

**Solution**

* `extract_meshes` had two for loops which are functionally identical,
  just copy-pasted code. I extracted the common code between the two
  and put them into an anonymous function.
* I flattened the tuple literal for the bundle batch, it looks much
  less nested and the code is much more readable as a result.
* The parameters of `extract_meshes` were also very daunting, but they
  turned out to be the same query repeated twice. I extracted the query
  into a type alias.

EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
@nicopap nicopap deleted the cleanup-extract_meshes branch September 5, 2023 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants