Move adding DynamicUniformIndex to Extract #5037

james7132 · 2022-06-17T20:03:13Z

Objective

prepare_uniform_components's commands must be run with exclusive access to the render world and can take quite a bit of time for components on lots of entities, particularly with archetypes with many big components. This is doing redundant work that is already being done in Extract.

Solution

Implement Default on DynamicUniformIndex.
Move adding DynamicUniformIndex into Extract instead of Prepare.
Change prepare_uniform_components to query for &mut DynamicUniformIndex instead of using commands.

Performance

This was tested against the many_cubes stress test. Here are the respective timing changes:

stage/system	main	this PR
Full Frame	20.52ms	18.79ms
Extract (Stage)	3.7ms	3.83ms
Prepare (Stage)	2.64ms	1.13ms
extract_meshes (system)	1.45ms	1.69us
extract_meshes (commands)	881.75us	1.48ms
prepare_uniform_components (system)	1.12ms	929.56us
prepare_uniform_components (commands)	1.26ms	253ns

Changelog

TODO

Migration Guide

TODO

superdump

I like this. One question - instead of having a default value of 0, wouldn’t it be better to make it an Option and skip drawing the thing if its index was never initialised?

james7132 · 2022-06-18T01:58:13Z

I like this. One question - instead of having a default value of 0, wouldn’t it be better to make it an Option and skip drawing the thing if its index was never initialised?

This adds a branch in the middle of the render stage, which I'm hesitant to bloat even more given how heavy it already is, and it's assured to written to during prepare too. It also makes the component bigger, which deflates the performance gains we see here. Perhaps under a #[cfg(debug_assertions)]?

MinerSebas · 2022-06-18T12:50:40Z

crates/bevy_render/src/extract_component.rs

+impl<C: Component> Clone for DynamicUniformIndex<C> {
+    fn clone(&self) -> Self {
+        Self {
+            index: self.index,
+            marker: PhantomData,
+        }
+    }
+}


Is the manual Clone necessary here?
It doesn't relax the C: Component bound and Copy is still derived.

PhantomData<T> only implements Clone iff T: Clone, which also transitively holds for the derived impl. This implements Clone and Default regardless of what T is.

MinerSebas · 2022-06-18T12:51:40Z

crates/bevy_render/src/extract_component.rs

+impl<C: Component> Default for DynamicUniformIndex<C> {
+    fn default() -> Self {
+        Self {
+            index: 0,
+            marker: PhantomData,
+        }
+    }
+}


This default impl also could be replaced by a derive.

james7132 · 2022-06-19T01:47:32Z

Wanted to just note that if we merge #4902, we can avoid the secondary copy inside the extraction commands by adding a ExtractedComponentUniforms<C> resource we write to instead of inserting via Commands. This does make it harder to query for the the uniforms after the fact, but this minimizes the number of heavy copies being done. This effectively forces the extract stage systems to do what UniformComponentPlugin already did before, but it definitely avoids heavier copies, both into the render world and within it.

EDIT: Tried this, the queue_material_meshes system is dependent on MeshUniform as a component.

I also tested to see if we could defer the inverse().transpose() computation in extract_meshes, and it does seem like it shuffles about 100us of time in many_cubes to the prepare stage. Unlike before, this can be done in-place, without any copying, only a temporary on the stack is used.

superdump · 2022-06-19T05:46:58Z

I like this. One question - instead of having a default value of 0, wouldn’t it be better to make it an Option and skip drawing the thing if its index was never initialised?

Could you test and profile it to see if it does make a practical performance difference? It would be a win for correctness in case something doesn’t actually ever get set.

superdump · 2022-06-19T05:52:09Z

Wanted to just note that if we merge #4902, we can avoid the secondary copy inside the extraction commands by adding a ExtractedComponentUniforms<C> resource we write to instead of inserting via Commands. This does make it harder to query for the the uniforms after the fact, but this minimizes the number of heavy copies being done. This effectively forces the extract stage systems to do what UniformComponentPlugin already did before, but it definitely avoids heavier copies, both into the render world and within it.

EDIT: Tried this, the queue_material_meshes system is dependent on MeshUniform as a component.

I also tested to see if we could defer the inverse().transpose() computation in extract_meshes, and it does seem like it shuffles about 100us of time in many_cubes to the prepare stage. Unlike before, this can be done in-place, without any copying, only a temporary on the stack is used.

This again becomes a question of whether the model matrix is used/desired to be used elsewhere in the render schedule such that not calculating it upfront incurs calculation of it multiple times. If it just moves time from one place to another with no other overall performance benefits then the only pro is that it makes the extract stage shorter. That would be enough but only if we think no one ever needs the model matrix. I wonder if TAA would need it for motion vectors or how that works…

james7132 · 2022-06-19T08:37:29Z

Could you test and profile it to see if it does make a practical performance difference? It would be a win for correctness in case something doesn’t actually ever get set.

Tried this with a if let Some(index) = dynamic.index(). The Render phase perf doesn't tangibly change (7.49ms -> 7.50ms), though I think this may cause incorrect bindings if we don't panic on not finding an index. Changing it to panic via unwrap negatively affects render perf on my machine (7.49ms ->7.63ms).

superdump · 2022-06-19T10:59:47Z

Could you test and profile it to see if it does make a practical performance difference? It would be a win for correctness in case something doesn’t actually ever get set.

Tried this with a if let Some(index) = dynamic.index(). The Render phase perf doesn't tangibly change (7.49ms -> 7.50ms), though I think this may cause incorrect bindings if we don't panic on not finding an index. Changing it to panic via unwrap negatively affects render perf on my machine (7.49ms ->7.63ms).

Good to know that panic incurs that performance hit. I was thinking that a missing index would somehow cause that entity not to be drawn by propagating up an error or something. But if we don’t already have error returns from draw functions then maybe it’s not worth it. I’m just kind of expecting it to be easy enough to make code where some entities never have their dynamic index updated and then they will be drawn using whatever the transform is for index 0. I suppose another way to handle it would be to make that model matrix produce vertices containing nans in the clip position and then it will be dropped, but that feels like a hack where it would be better to just not draw the thing.

james7132 · 2022-06-20T05:16:53Z

Good to know that panic incurs that performance hit. I was thinking that a missing index would somehow cause that entity not to be drawn by propagating up an error or something. But if we don’t already have error returns from draw functions then maybe it’s not worth it. I’m just kind of expecting it to be easy enough to make code where some entities never have their dynamic index updated and then they will be drawn using whatever the transform is for index 0. I suppose another way to handle it would be to make that model matrix produce vertices containing nans in the clip position and then it will be dropped, but that feels like a hack where it would be better to just not draw the thing.

Having seen the perf hit, I tried the opposite and changed RenderCommand to return a Result<(), RenderCommandError> instead and changed the existing unwraps into ok_or(...)?s, and saw a slight improvement in perf (7.49ms -> 7.37ms in the same stress test). This is likely because we're removing all machinery for panicking from the functions, allowing for fewer instructions/smaller jumps. I can either add it to this PR, though I'd rather not due to how large/controversial it would be relative to this change.

superdump · 2022-06-21T09:38:22Z

Having seen the perf hit, I tried the opposite and changed RenderCommand to return a Result<(), RenderCommandError> instead and changed the existing unwraps into ok_or(...)?s, and saw a slight improvement in perf (7.49ms -> 7.37ms in the same stress test). This is likely because we're removing all machinery for panicking from the functions, allowing for fewer instructions/smaller jumps. I can either add it to this PR, though I'd rather not due to how large/controversial it would be relative to this change.

Sounds reasonable to do it in a separate PR. If you don't intend to do that straight away, could you add a TODO comment?

cart · 2022-08-08T22:12:25Z

crates/bevy_render/src/extract_component.rs

    render_device: Res<RenderDevice>,
    render_queue: Res<RenderQueue>,
    mut component_uniforms: ResMut<ComponentUniforms<C>>,
-    components: Query<(Entity, &C)>,
+    mut components: Query<(&C, &mut DynamicUniformIndex<C>)>,


Doesn't this break the UniformComponentPlugin in the general case?
This now assumes that DynamicUniformIndex is added in the extract step, but that isn't the case for something using, say, ExtractComponentPlugin.

We aren't currently using this anywhere else, but given that this is intended to be a generalized (and user facing) abstraction, I think we should discuss ways to make this "fool proof".

I've dropped the ball on this PR, but thinking on this a bit more. I think it makes a lot of sense to take an approach where we keep indices as components while directly writing extracted components to their target staging buffers.

Indices are small. 4-8 bytes typically. Compare this with the equivalent MeshUniform, which is 132 bytes currently. If we are going to heavily leverage commands for rendering, we should be minimizing the number of large copies that are being performed. I'd much rather us copy heavy components once and then just shuffle the indices around.

If we still need the intermediate data during Prepare or Queue, we can always refer back to the buffer in memory. It's less ergonomic, but alleviates the heaviest parts of running the Render World right now.

james7132 · 2023-09-22T06:53:06Z

Closing this as the renderer is already moving in a non-direct ECS storage direction, and the introduction of the instancing and batching changes makes this difficult to merge.

Move adding DynamicUniformIndex to Extract

e3aefb6

james7132 added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times labels Jun 17, 2022

james7132 requested a review from superdump June 17, 2022 20:03

superdump approved these changes Jun 18, 2022

View reviewed changes

james7132 mentioned this pull request Jun 18, 2022

[Merged by Bors] - Replace BlobVec's swap_scratch with a swap_nonoverlapping #4853

Closed

MinerSebas reviewed Jun 18, 2022

View reviewed changes

superdump requested a review from cart June 26, 2022 00:08

cart reviewed Aug 8, 2022

View reviewed changes

Weibye added the S-Adopt-Me The original PR author has no intent to complete this work. Pick me up! label Aug 10, 2022

james7132 closed this Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move adding DynamicUniformIndex to Extract #5037

Move adding DynamicUniformIndex to Extract #5037

james7132 commented Jun 17, 2022 •

edited

Loading

superdump left a comment

james7132 commented Jun 18, 2022 •

edited

Loading

MinerSebas Jun 18, 2022

james7132 Jun 18, 2022

MinerSebas Jun 18, 2022

james7132 commented Jun 19, 2022 •

edited

Loading

superdump commented Jun 19, 2022

superdump commented Jun 19, 2022

james7132 commented Jun 19, 2022

superdump commented Jun 19, 2022

james7132 commented Jun 20, 2022 •

edited

Loading

superdump commented Jun 21, 2022 •

edited

Loading

cart Aug 8, 2022

james7132 Dec 7, 2022

james7132 commented Sep 22, 2023

Move adding DynamicUniformIndex to Extract #5037

Move adding DynamicUniformIndex to Extract #5037

Conversation

james7132 commented Jun 17, 2022 • edited Loading

Objective

Solution

Performance

Changelog

Migration Guide

superdump left a comment

Choose a reason for hiding this comment

james7132 commented Jun 18, 2022 • edited Loading

MinerSebas Jun 18, 2022

Choose a reason for hiding this comment

james7132 Jun 18, 2022

Choose a reason for hiding this comment

MinerSebas Jun 18, 2022

Choose a reason for hiding this comment

james7132 commented Jun 19, 2022 • edited Loading

superdump commented Jun 19, 2022

superdump commented Jun 19, 2022

james7132 commented Jun 19, 2022

superdump commented Jun 19, 2022

james7132 commented Jun 20, 2022 • edited Loading

superdump commented Jun 21, 2022 • edited Loading

cart Aug 8, 2022

Choose a reason for hiding this comment

james7132 Dec 7, 2022

Choose a reason for hiding this comment

james7132 commented Sep 22, 2023

james7132 commented Jun 17, 2022 •

edited

Loading

james7132 commented Jun 18, 2022 •

edited

Loading

james7132 commented Jun 19, 2022 •

edited

Loading

james7132 commented Jun 20, 2022 •

edited

Loading

superdump commented Jun 21, 2022 •

edited

Loading