-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Rework extract_meshes #4240
Conversation
Did you benchmark this with something like @mockersf you said something about the |
Actually not. I'll do that and give you the run down. |
tracy log for spans running I did this 4 time. Twice per branch. The
All values are in microseconds
I guess the value that matters is
Looks like there is performance degradation? I really have not enough expertize to say how useful this kind of microbenchmark is. |
Currently,
|
0.5-1.0ms difference is quite a large performance regression considering, say 120fps requires a frame to be processed in less than 8.33ms. That said, the code in the PR does look cleaner, and I would think that if the collect can be made to allocate a large enough Vec up-front, and that shows the performance is as good or better than |
I looked into the |
Second run of benchmarks, this time with the Clearly, this doesn't work. My guess is that maybe the It's still possible to just add back the last size-based allocation strategy. I'll do that tomorrow and see where it gets.
|
Third run of benchmarkes. this time with proper allocation patterns. Names are:
Looks like this time the cleaned up code turned into a performance gain with the proper allocation pattern. A major downside however is that we allocate a command buffer the size of all entities with a mesh, regardless of how many of them are visible, this seems like a pitfall waiting to find a victim. I also expect the perf to vary somewhat between machines as a result.
|
It feels strange that iterators don’t support manually specifying a size for the allocation of the collection if for some reason we know better than the iterator itself can. That would solve the over-allocation problem. |
Well, I guess that's why |
Iterators are using the min size from the size hint to know how much to allocate. For |
I think it's too much of a loss to not keep the previous frame's vec sizes. But I still think the function can be improved. So I'll put this PR into draft mod |
2e70aef
to
cbce57a
Compare
Note: I managed to trigger an ICE writing this PR. Though I don't really think the "improvements" the ICE is blocking is quite that important. See rust-lang/rust#96488 |
cbce57a
to
32ead71
Compare
32ead71
to
76e9b59
Compare
btw, this is not breaking anymore, and it keeps the exact same storage pattern as the original. It's just a minor refactoring. |
crates/bevy_pbr/src/render/mesh.rs
Outdated
}; | ||
let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster)); | ||
let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible; | ||
let mut caster_cmds = Vec::with_capacity(*prev_len_caster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are more accurately characterized as "bundles" or "entities", not commands.
In general, please try to spell out terms; cmds
was very hard to decipher reading this from top to bottom.
crates/bevy_pbr/src/render/mesh.rs
Outdated
>, | ||
mut prev_len_caster: Local<usize>, | ||
mut prev_len_not: Local<usize>, | ||
caster_query: Query<ExtractMeshQuery, Without<NotShadowCaster>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this into a single query with an Option<NotShadowCaster>
field? I suppose it would likely be slower...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't know about the perf characteristics. But a better way to do this would be to merge the two queries and replace the Option<&'static NotShadowReceiver>
by With<NotShadowReceiver>
(the negative on the marker type really throws me off)
Edit: Wow! This is not how With
filter works! I'm surprised.
Edit2: Option<&T>
can be replaced by Or<(With<T>, ())>
for the purpose of checking if a component is present. I'm wondering if the performance profile is identical
Edit3: Nevermind! This wouldn't work. It would just always return true
crates/bevy_pbr/src/render/mesh.rs
Outdated
} | ||
*previous_not_caster_len = not_caster_values.len(); | ||
commands.insert_or_spawn_batch(not_caster_values); | ||
let flags = if not_receiver.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let flags = if not_receiver.is_some() { | |
let shadow_receiver_flags = if not_receiver.is_some() { |
crates/bevy_pbr/src/render/mesh.rs
Outdated
}; | ||
let with_marker = |(entity, (handle, uniform))| (entity, (handle, uniform, NotShadowCaster)); | ||
let is_visible = |(_, vis, ..): &ExtractMeshItem| vis.is_visible; | ||
let mut caster_cmds = Vec::with_capacity(*prev_len_caster); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut caster_cmds = Vec::with_capacity(*prev_len_caster); | |
/// Use the previous number of matching entities to provide a good guess of the required capacity | |
let mut caster_cmds = Vec::with_capacity(*prev_len_caster); |
crates/bevy_pbr/src/render/mesh.rs
Outdated
.map(mesh_bundle) | ||
.map(with_marker), | ||
); | ||
*prev_len_caster = caster_cmds.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*prev_len_caster = caster_cmds.len(); | |
// Store the number of entities matching each query in the appropriate Local | |
// This allows us to make reasonable guesses about the capacity required for the next iteration | |
*prev_len_caster = caster_cmds.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find both before and after quite hard to read. I think this does need cleanup, but I don't think this is a clear improvement currently. The functional style is probably appropriate here, given the use of the spawn_batch method, but it's easy for that style to become quite impenetrable.
I've left some suggestions for improved variable naming and comments that should help make this more clear.
The suggestion of merging the two queries really made me think. I'll push an update that replaces the functional style with a simple for loop, put the PR in draft mode, do some benchmarking and make some judgments based on that. |
76e9b59
to
7807225
Compare
Cleanup redundant code * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. This was because of the split query. We merge them and put them in a single loop. * The tuple literal for the command buffers was difficult to read, because it spanned many levels of idententations and lines of codes. they now fit in a single line.
7807225
to
adfbf36
Compare
Wow that last batch of changes really makes this much cleaner. I'm very curious to see if there's a perf regression. If there isn't, I'm strongly in favor of these changes. |
I've done mini-benchmarking in the same vein as the first iteration of this PR. It seems that the single Don't take my word for it! I'm going to post the benchmark table tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the performance speedup is probably due to more coherent accesses if you have a ‘random’ or ‘incoherent’ mix of shadow casters and not shadow casters. But, when you say 15%, 15% of what metric? How are you testing? And what is the absolute time difference?
crates/bevy_pbr/src/render/mesh.rs
Outdated
&ComputedVisibility, | ||
&GlobalTransform, | ||
&Handle<Mesh>, | ||
Option<Without<NotShadowReceiver>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, what is this query syntax? We can do optional filters within the query parameters? What does that even mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the component exists, it returns Some(component)
. Otherwise it returns None
.
The particularly tricky bit here is that Without
is being used in the fetch parameter of the query, and so you're getting a bool returned there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if it matches Without<NotShadowCaster>
then it returns Some(true)
and otherwise None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just Option<&NotShadowCaster>
and invert the logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that suggestion; I think this is probably clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things:
- Theoretically,
Option<&NotShadowCaster>
requires building a reference based on an offset and stuff, whileOption<With<NotShadowCaster>>
does not, it might have different performance characteristic. - Minor, but the item type for
Option<With<NotShadowCaster>>
isOption<()>
notOption<bool>
- I preferred
Without<Not…>
overWith<Not…>
because in the first implementation I got very confused by the negative on the variables.
I’ll happily test on an M1 Max. |
I've only tested |
I've been benchmarking the I've updated legend:
Ran on this PR branch and base commit with the following patch to https://gist.github.com/nicopap/b44cd2e1de65053ce521291dfecc5d12 I've been using the following script for testing: https://gist.github.com/nicopap/00a04685b65a5402fa55a07ac5adc26a |
I'm content to call that "performance neutral" then. |
bors r+ |
* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
* Cleanup redundant code * Use a type alias to make sure the `caster_query` and `not_caster_query` really do the same thing and access the same things **Objective** Cleanup code that would otherwise be difficult to understand **Solution** * `extract_meshes` had two for loops which are functionally identical, just copy-pasted code. I extracted the common code between the two and put them into an anonymous function. * I flattened the tuple literal for the bundle batch, it looks much less nested and the code is much more readable as a result. * The parameters of `extract_meshes` were also very daunting, but they turned out to be the same query repeated twice. I extracted the query into a type alias. EDIT: I reworked the PR to **not do anything breaking**, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.
caster_query
andnot_caster_query
really do the same thing and access the same thingsObjective
Cleanup code that would otherwise be difficult to understand
Solution
extract_meshes
had two for loops which are functionally identical,just copy-pasted code. I extracted the common code between the two
and put them into an anonymous function.
less nested and the code is much more readable as a result.
extract_meshes
were also very daunting, but theyturned out to be the same query repeated twice. I extracted the query
into a type alias.
EDIT: I reworked the PR to not do anything breaking, and keep the old allocation behavior. Removing the memorized length was clearly a performance loss, so I kept it.