[Merged by Bors] - Rework extract_meshes #4240

nicopap · 2022-03-17T10:41:01Z

Cleanup redundant code
Use a type alias to make sure the caster_query and
not_caster_query really do the same thing and access the same things

Objective

Cleanup code that would otherwise be difficult to understand

Solution

extract_meshes had two for loops which are functionally identical,
just copy-pasted code. I extracted the common code between the two
and put them into an anonymous function.
I flattened the tuple literal for the bundle batch, it looks much
less nested and the code is much more readable as a result.
The parameters of extract_meshes were also very daunting, but they
turned out to be the same query repeated twice. I extracted the query
into a type alias.

EDIT: I reworked the PR to not do anything breaking, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.

superdump · 2022-03-17T12:09:57Z

Did you benchmark this with something like many_cubes to check its performance is as good as or better than main?

@mockersf you said something about the Query iterator only being an exact size iterator if there was no filter?

nicopap · 2022-03-17T12:23:59Z

Did you benchmark this with something like many_cubes to check its performance is as good as or better than main?

Actually not. I'll do that and give you the run down.

nicopap · 2022-03-17T14:17:26Z

tracy log for spans running cargo run --release --example many_cubes --features 'trace trace_tracy' for 10000 frames.

I did this 4 time. Twice per branch. The span are the measurments for the respective spans:

ovhe: system overhead{name="bevy_pbr::render::mesh::extract_meshes"}
self: ovhe self time
cmds: system_commands{name="bevy_pbr::render::mesh::extract_meshes"}
syst: system{name="bevy_pbr::render::mesh::extract_meshes"}

All values are in microseconds μs.

run	span	branch	mean	median	std
1	cmds	cleanup-extract_meshes	861	820	340
1	syst	cleanup-extract_meshes	1280	1220	421
1	ovhe	cleanup-extract_meshes	724	754	779
1	self	cleanup-extract_meshes	79	35	1010
1	cmds	main	845	795	337
1	syst	main	846	814	287
1	ovhe	main	517	395	563
1	self	main	94	8	705
2	cmds	cleanup-extract_meshes	1270	1160	492
2	syst	cleanup-extract_meshes	1320	1250	435
2	ovhe	cleanup-extract_meshes	760	835	797
2	self	cleanup-extract_meshes	78	53	1050
2	cmds	main	831	789	336
2	syst	main	831	806	277
2	ovhe	main	526	603	537
2	self	main	85	65	695

I guess the value that matters is self + syst + cmds?

run	branch	self + syst + cmds
1	cleanup-extract_meshes	79 + 861 + 1280 = 2220
1	main	94 + 846 + 845 = 1785
2	cleanup-extract_meshes	78 + 1320 + 1270 = 2668
2	main	85 + 831 + 831 = 1747

Looks like there is performance degradation? I really have not enough expertize to say how useful this kind of microbenchmark is.

nicopap · 2022-03-17T15:19:43Z

QueryIter::size_hint links to #1686 though this is only tangentially relevant here. From the code I've read, the standard library <T as Iterator>::collect<Vec<_>>() uses the lower bound exposed by Iterator::size_hint() for the initial allocation (unless the iterator is TrustedLen in which case it uses the upper bound). Whether the iterator implements ExactSizeIterator doesn't matter.

Currently, QueryIter::size_hint's lower bound is always zero. So rust won't be able to use it to optimize a QueryIter::collect<Vec<_>>() call. Ways to solve that:

Use specialization to implement a special size_hint for queries with filter that have predictable size not stable
unsafe impl TrustedLen not stable
Probably (maybe?) some others I'll take time to investigate.

superdump · 2022-03-17T17:23:38Z

0.5-1.0ms difference is quite a large performance regression considering, say 120fps requires a frame to be processed in less than 8.33ms.

That said, the code in the PR does look cleaner, and I would think that if the collect can be made to allocate a large enough Vec up-front, and that shows the performance is as good or better than main then it's a good change. :)

nicopap · 2022-03-17T17:50:38Z

I looked into the size_hint problem and I think I figured an implementation that gives proper hint archetype based filtering. Now to check if it improves performance on this PR.

nicopap · 2022-03-17T20:14:16Z

Second run of benchmarks, this time with the size_hint fix from the changes introduced in #4244. I'm very disapointed.

Clearly, this doesn't work. My guess is that maybe the filter overwrites the min size (which makes a lot of sense) so the size_hint fix doesn't fix this at all.

It's still possible to just add back the last size-based allocation strategy. I'll do that tomorrow and see where it gets.

run	span	branch	mean	median	std
1	cmds	cleanup + size-hint	1350	1230	513
1	syst	cleanup + size-hint	1140	1090	411
1	self	cleanup + size-hint	80	60	929
2	cmds	cleanup + size-hint	1270	1160	477
2	syst	cleanup + size-hint	1180	1130	417
2	self	cleanup + size-hint	107	106	1010
1	cmds	main	1290	1180	488
1	syst	main	837	812	275
1	self	main	78	45	684
3	cmds	cleanup + size-hint	1330	1200	506
3	syst	cleanup + size-hint	1190	1140	432
3	self	cleanup + size-hint	114	113	1040
2	cmds	main	1260	1140	485
2	syst	main	844	819	280
2	self	main	82	60	700
1	cmds	filter-size-hint	1300	1190	505
1	syst	filter-size-hint	844	813	283
1	self	filter-size-hint	107	106	741

nicopap · 2022-03-18T08:34:31Z

Third run of benchmarkes. this time with proper allocation patterns. Names are:

cleanup + naive data: current code
cleanup + min-data: An alternative command buffer where some of the work is delegated to the command system
main: main as of c1a2378

Looks like this time the cleaned up code turned into a performance gain with the proper allocation pattern. A major downside however is that we allocate a command buffer the size of all entities with a mesh, regardless of how many of them are visible, this seems like a pitfall waiting to find a victim. I also expect the perf to vary somewhat between machines as a result.

run	span	branch	mean	median	std
1	cmds	cleanup + naive data	850	801	333
1	syst	cleanup + naive data	910	885	295
1	self	cleanup + naive data	82	33	737
1	cmds	cleanup + min-data	1500	1350	546
1	syst	cleanup + min-data	615	604	186
1	self	cleanup + min-data	84	60	524
1	cmds	main	1280	1150	501
1	syst	main	833	806	275
1	self	main	88	75	701
2	cmds	cleanup + naive data	1300	1140	740
2	syst	cleanup + naive data	924	870	434
2	self	cleanup + naive data	79	49	789

run	branch	total
1	cleanup + naive data	1842
1	cleanup + min-data	2199
1	main	2201
2	cleanup + naive data	2303

superdump · 2022-03-18T11:18:06Z

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

nicopap · 2022-03-18T13:24:46Z

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

Well, I guess that's why extend accepts an iterator. Combining with_capacity with extend is still pretty ergonomic.

mockersf · 2022-03-19T22:11:21Z

It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem.

Iterators are using the min size from the size hint to know how much to allocate. For ExactSizeIterator or TrustedLen, the size hint is actually correct, hence why allocation is OK. You can't manually set the size hint, but there are ways around that if that's something we want in Bevy.
Alternatively, collect_into (rust-lang/rust#93057) has been merged very recently, we could add a TODO note to revisit this once it's stable

nicopap · 2022-04-27T09:23:52Z

I think it's too much of a loss to not keep the previous frame's vec sizes. But I still think the function can be improved. So I'll put this PR into draft mod

nicopap · 2022-04-27T19:44:44Z

Note: I managed to trigger an ICE writing this PR. Though I don't really think the "improvements" the ICE is blocking is quite that important. See rust-lang/rust#96488

nicopap · 2022-06-13T09:39:31Z

btw, this is not breaking anymore, and it keeps the exact same storage pattern as the original. It's just a minor refactoring.

alice-i-cecile · 2022-06-28T18:56:10Z

crates/bevy_pbr/src/render/mesh.rs

+    };
+    let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster));
+    let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible;
+    let mut caster_cmds = Vec::with_capacity(*prev_len_caster);


These are more accurately characterized as "bundles" or "entities", not commands.

In general, please try to spell out terms; cmds was very hard to decipher reading this from top to bottom.

alice-i-cecile · 2022-06-28T18:56:50Z

crates/bevy_pbr/src/render/mesh.rs

-    >,
+    mut prev_len_caster: Local<usize>,
+    mut prev_len_not: Local<usize>,
+    caster_query: Query<ExtractMeshQuery, Without<NotShadowCaster>>,


Can we merge this into a single query with an Option<NotShadowCaster> field? I suppose it would likely be slower...

I really don't know about the perf characteristics. But a better way to do this would be to merge the two queries and replace the Option<&'static NotShadowReceiver> by With<NotShadowReceiver> (the negative on the marker type really throws me off)

Edit: Wow! This is not how With filter works! I'm surprised.

Edit2: Option<&T> can be replaced by Or<(With<T>, ())> for the purpose of checking if a component is present. I'm wondering if the performance profile is identical

Edit3: Nevermind! This wouldn't work. It would just always return true

alice-i-cecile · 2022-06-28T18:58:42Z

crates/bevy_pbr/src/render/mesh.rs

-    }
-    *previous_not_caster_len = not_caster_values.len();
-    commands.insert_or_spawn_batch(not_caster_values);
+        let flags = if not_receiver.is_some() {


Suggested change

let flags = if not_receiver.is_some() {

let shadow_receiver_flags = if not_receiver.is_some() {

alice-i-cecile · 2022-06-28T19:04:10Z

crates/bevy_pbr/src/render/mesh.rs

+    };
+    let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster));
+    let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible;
+    let mut caster_cmds = Vec::with_capacity(*prev_len_caster);


Suggested change

let mut caster_cmds = Vec::with_capacity(*prev_len_caster);

/// Use the previous number of matching entities to provide a good guess of the required capacity

let mut caster_cmds = Vec::with_capacity(*prev_len_caster);

alice-i-cecile · 2022-06-28T19:05:25Z

crates/bevy_pbr/src/render/mesh.rs

+            .map(mesh_bundle)
+            .map(with_marker),
+    );
+    *prev_len_caster = caster_cmds.len();


Suggested change

*prev_len_caster = caster_cmds.len();

// Store the number of entities matching each query in the appropriate Local

// This allows us to make reasonable guesses about the capacity required for the next iteration

*prev_len_caster = caster_cmds.len();

alice-i-cecile

I find both before and after quite hard to read. I think this does need cleanup, but I don't think this is a clear improvement currently. The functional style is probably appropriate here, given the use of the spawn_batch method, but it's easy for that style to become quite impenetrable.

I've left some suggestions for improved variable naming and comments that should help make this more clear.

nicopap · 2022-06-28T19:50:35Z

The suggestion of merging the two queries really made me think. I'll push an update that replaces the functional style with a simple for loop, put the PR in draft mode, do some benchmarking and make some judgments based on that.

Cleanup redundant code * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. This was because of the split query. We merge them and put them in a single loop. * The tuple literal for the command buffers was difficult to read, because it spanned many levels of idententations and lines of codes. they now fit in a single line.

alice-i-cecile · 2022-06-28T20:58:35Z

Wow that last batch of changes really makes this much cleaner. I'm very curious to see if there's a perf regression. If there isn't, I'm strongly in favor of these changes.

nicopap · 2022-06-30T17:02:03Z

I've done mini-benchmarking in the same vein as the first iteration of this PR. It seems that the single for loop is not only as performant as the initial implementation, but significantly faster (15%). I've a AMD Ryzen 9 3900X, I'm wondering if I might have been lucky staying within my CPU cache size, or maybe a single query with a single loop unlocks better compiler optimizations?

Don't take my word for it! I'm going to post the benchmark table tomorrow.

superdump

I think the performance speedup is probably due to more coherent accesses if you have a ‘random’ or ‘incoherent’ mix of shadow casters and not shadow casters. But, when you say 15%, 15% of what metric? How are you testing? And what is the absolute time difference?

crates/bevy_pbr/src/render/mesh.rs

superdump · 2022-06-30T19:59:25Z

crates/bevy_pbr/src/render/mesh.rs

+        &ComputedVisibility,
+        &GlobalTransform,
+        &Handle<Mesh>,
+        Option<Without<NotShadowReceiver>>,


Woah, what is this query syntax? We can do optional filters within the query parameters? What does that even mean?

If the component exists, it returns Some(component). Otherwise it returns None.

The particularly tricky bit here is that Without is being used in the fetch parameter of the query, and so you're getting a bool returned there.

So if it matches Without<NotShadowCaster> then it returns Some(true) and otherwise None?

Why not just Option<&NotShadowCaster> and invert the logic?

I agree with that suggestion; I think this is probably clearer.

A few things:

Theoretically, Option<&NotShadowCaster> requires building a reference based on an offset and stuff, while Option<With<NotShadowCaster>> does not, it might have different performance characteristic.

Minor, but the item type for Option<With<NotShadowCaster>> is Option<()> not Option<bool>

I preferred Without<Not…> over With<Not…> because in the first implementation I got very confused by the negative on the variables.

superdump · 2022-06-30T20:04:14Z

I’ll happily test on an M1 Max.

nicopap · 2022-07-01T06:54:19Z

think the performance speedup is probably due to more coherent accesses if you have a ‘random’ or ‘incoherent’ mix of shadow casters and not shadow casters. But, when you say 15%, 15% of what metric? How are you testing? And what is the absolute time difference?

I've only tested many_cubes. You are right! I should try another benchmark where there are shadow casters.

nicopap · 2022-07-01T17:59:38Z

I've been benchmarking the extract_meshes on the many_cubes stress test. I've tried controlling for thermal and background tasks, but I really can't get meaningful data out of it. It seems everything is within a ±10% range.

I've updated many_cubes to add NotShadowCaster and NotShadowReceiver on some cubes, so that I could have a fair split of entities with/without etc.

legend:

stag: Span stage{name="extract"}
syst: Span system overhead{name="bevy_pbr::render::mesh::extract_meshes"}
cmds: Span system_commands{name="bevy_pbr::render::mesh::extract_meshes"}
Times in microseconds (μs)
Ran with cargo run --example many_cubes --release --features "trace trace_tracy" for exactly 500 seconds, with a 180 seconds interval between runs (for thermal) on my linux machine (specs already mentioned) running exactly two instances of the st terminal, I even killed some daemons for this.
baseline: The commit just before this PR's (e28b88b), but the many_cubes patch
unified query: The code submitted in this PR, but the many_cubes patch
split query: A variation of this PR with two for loops.

run	span	branch	mean	median	std
1	stag	baseline	2750	2660	737
2	stag	baseline	2920	2790	763
1	stag	unified query	2810	2690	769
2	stag	unified query	2670	2560	688
1	stag	split query	2840	2700	759
2	stag	split query	2990	2850	822
1	syst	baseline	505	417	545
2	syst	baseline	504	403	546
1	syst	unified query	472	349	519
2	syst	unified query	485	362	530
1	syst	split query	565	415	626
2	syst	split query	571	418	628
1	cmds	baseline	577	559	199
2	cmds	baseline	607	588	216
1	cmds	unified query	613	656	214
2	cmds	unified query	1130	1060	356
1	cmds	split query	1130	1060	365
2	cmds	split query	597	580	216

Ran on this PR branch and base commit with the following patch to many_cubes:

https://gist.github.com/nicopap/b44cd2e1de65053ce521291dfecc5d12

I've been using the following script for testing:

https://gist.github.com/nicopap/00a04685b65a5402fa55a07ac5adc26a

alice-i-cecile · 2022-07-01T19:09:53Z

I'm content to call that "performance neutral" then.

alice-i-cecile · 2022-07-04T12:44:07Z

bors r+

* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.

bors · 2022-07-04T13:04:03Z

Pull request successfully merged into main.

Build succeeded:

* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.

github-actions bot added the S-Needs-Triage This issue needs to be labelled label Mar 17, 2022

nicopap marked this pull request as draft April 27, 2022 09:23

nicopap force-pushed the cleanup-extract_meshes branch 2 times, most recently from 2e70aef to cbce57a Compare April 27, 2022 19:41

nicopap marked this pull request as ready for review April 28, 2022 06:55

nicopap force-pushed the cleanup-extract_meshes branch from cbce57a to 32ead71 Compare April 28, 2022 07:10

nicopap mentioned this pull request May 13, 2022

bevy_render: Reserve only entities from the last frame #4322

Closed

nicopap force-pushed the cleanup-extract_meshes branch from 32ead71 to 76e9b59 Compare June 13, 2022 09:38

alice-i-cecile self-requested a review June 28, 2022 16:24

alice-i-cecile reviewed Jun 28, 2022

View reviewed changes

alice-i-cecile requested changes Jun 28, 2022

View reviewed changes

alice-i-cecile removed the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Jun 28, 2022

nicopap force-pushed the cleanup-extract_meshes branch from 76e9b59 to 7807225 Compare June 28, 2022 19:57

nicopap marked this pull request as draft June 28, 2022 19:57

nicopap force-pushed the cleanup-extract_meshes branch from 7807225 to adfbf36 Compare June 28, 2022 20:03

nicopap marked this pull request as ready for review June 30, 2022 16:41

superdump reviewed Jun 30, 2022

View reviewed changes

Update some names

58d0efe

superdump approved these changes Jul 4, 2022

View reviewed changes

alice-i-cecile approved these changes Jul 4, 2022

View reviewed changes

alice-i-cecile added the S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it label Jul 4, 2022

bors bot changed the title ~~Rework extract_meshes~~ [Merged by Bors] - Rework extract_meshes Jul 4, 2022

bors bot closed this Jul 4, 2022

nicopap deleted the cleanup-extract_meshes branch September 5, 2023 15:44

	let flags = if not_receiver.is_some() {
	let shadow_receiver_flags = if not_receiver.is_some() {

	let mut caster_cmds = Vec::with_capacity(*prev_len_caster);
	/// Use the previous number of matching entities to provide a good guess of the required capacity
	let mut caster_cmds = Vec::with_capacity(*prev_len_caster);

[Merged by Bors] - Rework extract_meshes #4240

[Merged by Bors] - Rework extract_meshes #4240

Conversation

nicopap commented Mar 17, 2022 • edited Loading

superdump commented Mar 17, 2022

nicopap commented Mar 17, 2022

nicopap commented Mar 17, 2022

nicopap commented Mar 17, 2022

superdump commented Mar 17, 2022

nicopap commented Mar 17, 2022

nicopap commented Mar 17, 2022 • edited Loading

nicopap commented Mar 18, 2022 • edited Loading

superdump commented Mar 18, 2022

nicopap commented Mar 18, 2022

mockersf commented Mar 19, 2022 • edited Loading

nicopap commented Apr 27, 2022

nicopap commented Apr 27, 2022

nicopap commented Jun 13, 2022

alice-i-cecile Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicopap Jun 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alice-i-cecile left a comment

Choose a reason for hiding this comment

nicopap commented Jun 28, 2022

alice-i-cecile commented Jun 28, 2022

nicopap commented Jun 30, 2022

superdump left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

superdump commented Jun 30, 2022

nicopap commented Jul 1, 2022

nicopap commented Jul 1, 2022 • edited Loading

alice-i-cecile commented Jul 1, 2022

alice-i-cecile commented Jul 4, 2022

bors bot commented Jul 4, 2022

nicopap commented Mar 17, 2022 •

edited

Loading

nicopap commented Mar 17, 2022 •

edited

Loading

nicopap commented Mar 18, 2022 •

edited

Loading

mockersf commented Mar 19, 2022 •

edited

Loading

alice-i-cecile Jun 28, 2022 •

edited

Loading

nicopap Jun 28, 2022 •

edited

Loading

nicopap commented Jul 1, 2022 •

edited

Loading