Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder render sets, refactor bevy_sprite to take advantage #9236

Merged
merged 36 commits into from
Aug 27, 2023

Conversation

james-j-obrien
Copy link
Contributor

@james-j-obrien james-j-obrien commented Jul 21, 2023

This is a continuation of this PR: #8062

Objective

  • Reorder render schedule sets to allow data preparation when phase item order is known to support improved batching
  • Part of the batching/instancing etc plan from here: GPU Instancing #89 (comment)
  • The original idea came from @inodentry and proved to be a good one. Thanks!
  • Refactor bevy_sprite and bevy_ui to take advantage of the new ordering

Solution

  • Move Prepare and PrepareFlush after PhaseSortFlush

  • Add a PrepareAssets set that runs in parallel with other systems and sets in the render schedule.

    • Put prepare_assets systems in the PrepareAssets set
    • If explicit dependencies are needed on Mesh or Material RenderAssets then depend on the appropriate system.
  • Add ManageViews and ManageViewsFlush sets between ExtractCommands and Queue

  • Move queue_mesh*_bind_group to the Prepare stage

    • Rename them to prepare_
    • Put systems that prepare resources (buffers, textures, etc.) into a PrepareResources set inside Prepare
    • Put the prepare_..._bind_group systems into a PrepareBindGroup set after PrepareResources
  • Move prepare_lights to the ManageViews set

    • prepare_lights creates views and this must happen before Queue
    • This system needs refactoring to stop handling all responsibilities
    • Gather lights, sort, and create shadow map views. Store sorted light entities in a resource
  • Remove BatchedPhaseItem

  • Replace batch_range with batch_size representing how many items to skip after rendering the item or to skip the item entirely if batch_size is 0.

  • queue_sprites has been split into queue_sprites for queueing phase items and prepare_sprites for batching after the PhaseSort

    • PhaseItems are still inserted in queue_sprites
    • After sorting adjacent compatible sprite phase items are accumulated into SpriteBatch components on the first entity of each batch, containing a range of vertex indices. The associated PhaseItem's batch_size is updated appropriately.
    • SpriteBatch items are then drawn skipping over the other items in the batch based on the value in batch_size
  • A very similar refactor was performed on bevy_ui


Changelog

Changed:

  • Reordered and reworked render app schedule sets. The main change is that data is extracted, queued, sorted, and then prepared when the order of data is known.
  • Refactor bevy_sprite and bevy_ui to take advantage of the reordering.

Migration Guide

  • Assets such as materials and meshes should now be created in PrepareAssets e.g. prepare_assets<Mesh>
  • Queueing entities to RenderPhases continues to be done in Queue e.g. queue_sprites
  • Preparing resources (textures, buffers, etc.) should now be done in PrepareResources, e.g. prepare_prepass_textures, prepare_mesh_uniforms
  • Prepare bind groups should now be done in PrepareBindGroups e.g. prepare_mesh_bind_group
  • Any batching or instancing can now be done in Prepare where the order of the phase items is known e.g. prepare_sprites

Next Steps

  • Introduce some generic mechanism to ensure items that can be batched are grouped in the phase item order, currently you could easily have [sprite at z 0, mesh at z 0, sprite at z 0] preventing batching.
  • Investigate improved orderings for building the MeshUniform buffer
  • Implementing batching across the rest of bevy

@github-actions
Copy link
Contributor

Example contributors failed to run, please try running it locally and check the result.

2 similar comments
@github-actions
Copy link
Contributor

Example contributors failed to run, please try running it locally and check the result.

@github-actions
Copy link
Contributor

Example contributors failed to run, please try running it locally and check the result.

@james-j-obrien
Copy link
Contributor Author

Had to add an additional check in prepare_sprites to ensure that the image asset is prepared. This isn't ideal as we then do the lookup again in queue_sprites in order to create the bind group.

Copy link
Contributor

@JMS55 JMS55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, but I can really only review bevy_pbr/bevy_render. I'm not familiar with bevy_sprite, bevy_ui, etc.

Also, curious why the QueueMeshes subset exists.

crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/render_phase/mod.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/render_phase/mod.rs Show resolved Hide resolved
@superdump superdump self-requested a review July 22, 2023 10:40
@mockersf mockersf added the A-Rendering Drawing game state to the screen label Jul 23, 2023
@nicopap nicopap self-requested a review July 27, 2023 07:08
@nicopap nicopap added the C-Code-Quality A section of code that is hard to understand or change label Jul 27, 2023
@superdump superdump requested a review from robtfm July 27, 2023 14:20
Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks OK to me. The approach with UI to do batching in Queue looks at first glance like it would break if other plugins queued anything to the TransparentUi render phase such that those items should be sorted between.

Other than that I'd like to see some benchmarks and either decide that performance is good enough that more performance improvements can be made in a follow-up PR to save on blocking merge of this PR, or that more work is needed.

crates/bevy_sprite/src/render/mod.rs Outdated Show resolved Hide resolved
crates/bevy_sprite/src/render/mod.rs Outdated Show resolved Hide resolved
crates/bevy_sprite/src/render/mod.rs Show resolved Hide resolved
crates/bevy_ui/src/render/mod.rs Show resolved Hide resolved
@superdump superdump mentioned this pull request Jul 29, 2023
@superdump
Copy link
Contributor

On my M1 Max, this PR drops bevymark -- 10000 16 performance from ~72fps to ~44fps, so I think we will have to do some work to optimise this before merging as it's too far away. I'll have a look at what I can do to help this along. :)

@superdump
Copy link
Contributor

superdump commented Aug 3, 2023

I first made a branch that:

  • Uses a SparseSet<Entity, ExtractedSprite> to extract sprites into, and then only sprite batches are spawned as entities after batching.
  • Avoids per-sprite GpuImage lookups, instead doing them once per batch (except for a case where the image is not loaded yet, that would still be per-sprite until a sprite is found whose image is loaded.)

superdump@4e05e65

After that prepare_sprites still needed some more optimisation but it's getting there. I tried some minor changes but didn't make much of a dent on the performance drop.

image
About 16% slower than main (based on medians, on an M1 Max running bevymark -- 10000 16 to spawn 16 waves of 10k sprites, and running it for 1500 frames as I usually do.)

Some time passed and I wanted to try changing the way the sprite rendering worked, to instead put per-instance data into a GpuArrayBuffer, and then look it up using an instance index calculated from a specially-crafted index buffer where the low 2 bits are the x and y offsets of the quad (00 is bottom left, 01 bottom right, 10 top left, 11 top right) and the high bits contain the instance index. I got this working but it was still quite a bit slower than main.

image
About 12% slower than main.

In profiles in Xcode I thought I had seen a lot of cache misses, and so I had one more idea to try - get rid of the GpuArrayBuffer of per-instance data, put it in an instance-rate vertex buffer instead, and use an index buffer with just 6 vertices. This had been claimed to be slower than generating a larger index buffer but on the M1 Max...

image
About 28% frame time reduction vs main!!

I tested on my 5900HS 0 mobile RTX 3080 as well, not as big a boost there:
image
About an 8%ish frame time reduction, but still good!

So, I propose that this PR be rebased, reviewed, and merged. And that I make a separate PR with this change to using an instance buffer. @james-j-obrien

Oh, one more thing I noted when moving to the instance-rate vertex buffer: it uses a BufferVec of a repr C struct so does not need to do any serialisation into std430 or std140 layouts using encase. This seems to cut out about 50% of the prepare_sprites execution time.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Example contributors failed to run, please try running it locally and check the result.

@james-j-obrien
Copy link
Contributor Author

Fixed the UI implementation to match that of bevy_sprite. Main caveat here is that I have set the z depth used for the phase sort in TransparentUi to use the stack_index of the corresponding ui node as an f32. This works and still allows for positioning arbitrary non node TransparentUi items behind or in front of the UI, but the value is also arbitrary so open to suggestions if there is a cleaner method I'm not seeing. One alternative is to sort by the usize stack_index instead of using a float at all but that would then also apply to non node TransparentUi items for little upside.

@JMS55
Copy link
Contributor

JMS55 commented Aug 22, 2023

Can we rename PrepareBuffers to PrepareResources or something more generic?

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good to me. Huge thanks @robtfm for going through all the systems and sets!

crates/bevy_pbr/src/ssao/mod.rs Show resolved Hide resolved
crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
/// The copy of [`apply_deferred`] that runs between [`PrepareResources`](RenderSet::PrepareResources) and ['PrepareBindgroups'](RenderSet::PrepareBindgroups).
PrepareResourcesFlush,
/// A sub-set within Prepare for constructing bindgroups, or other data that relies on buffers.
PrepareBindgroups,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the wgpu/bevy types are camelCased as BindGroup, I think this should do the same, everywhere. Just commenting this one time.

Suggested change
PrepareBindgroups,
PrepareBindGroups,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the this instance, the only other example in the repo is the LineGizmoUniformBindgroup. I could update that as well but it seems sort of tangential to the purpose of this PR.

crates/bevy_render/src/mesh/mod.rs Show resolved Hide resolved
crates/bevy_render/src/render_asset.rs Show resolved Hide resolved
@superdump
Copy link
Contributor

The new SVG is looking good! :) It feels like it makes more sense.

Copy link
Contributor

@robtfm robtfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. i tested a number of examples and didn't see any issues. couple of minor points but happy to approve as is.

crates/bevy_render/src/lib.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/view/window/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@JMS55 JMS55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An absolutely amazing and foundational PR :). This resolves my long standing complaint of "why the heck are we making bind groups in a stage (system set? is that the official term now?) called queue?". Thanks so much for doing this!

self.per_object_binding_dynamic_offset,
Reverse(FloatOrd(self.distance)),
)
Reverse(FloatOrd(self.distance))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we sort by pipeline here? Unrelated to this PR really, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that separately. :)

for item in &self.items {
let draw_function = draw_functions.get_mut(item.draw_function()).unwrap();
draw_function.draw(world, render_pass, view, item);
let mut index = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the assembly looks like for this block.

Copy link
Contributor Author

@james-j-obrien james-j-obrien Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some annotated output from just the main render loop: render.txt
And here's the full asm for the function: render_full.txt
(had to use txt files as github complains about .asm)

There is an extra branch for the zero case but we won't be taking that the majority of the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Don't think it's a huge problem then. Branch predictor probably handles it fine.

Copy link
Contributor

@superdump superdump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@james-j-obrien - let's get this merged! Please update to main and ping me on Discord so I can click the button. :)

@superdump superdump added this pull request to the merge queue Aug 27, 2023
Merged via the queue into bevyengine:main with commit 4f1d9a6 Aug 27, 2023
23 of 24 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Sep 2, 2023
# Objective

- Supercedes #8872 
- Improve sprite rendering performance after the regression in #9236 

## Solution

- Use an instance-rate vertex buffer to store per-instance data.
- Store color, UV offset and scale, and a transform per instance.
- Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine
3x4 matrix and store the transpose of that in the per-instance data.
This is similar to how MeshUniform uses transpose affine matrices.
- Use a special index buffer that has batches of 6 indices referencing 4
vertices. The lower 2 bits indicate the x and y of a quad such that the
corners are:
  ```
  10    11

  00    01
  ```
UVs are implicit but get modified by UV offset and scale The remaining
upper bits contain the instance index.

## Benchmarks

I will compare versus `main` before #9236 because the results should be
as good as or faster than that. Running `bevymark -- 10000 16` on an M1
Max with `main` at `e8b38925` in yellow, this PR in red:

![Screenshot 2023-08-27 at 18 44
10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316)

Looking at the median frame times, that's a 37% reduction from before.

---

## Changelog

- Changed: Improved sprite rendering performance by leveraging an
instance-rate vertex buffer.

---------

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
@rparrett rparrett mentioned this pull request Sep 2, 2023
@cart cart mentioned this pull request Oct 13, 2023
43 tasks
@re0312 re0312 mentioned this pull request Jan 7, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jan 7, 2024
# Objective

- since #9236 queue_mesh_bind_group has been renamed to
prepare_mesh_bind_group,but the comment referring to it has not been
updated. .
rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this pull request Jan 9, 2024
# Objective

- Supercedes bevyengine#8872 
- Improve sprite rendering performance after the regression in bevyengine#9236 

## Solution

- Use an instance-rate vertex buffer to store per-instance data.
- Store color, UV offset and scale, and a transform per instance.
- Convert Sprite rect, custom_size, anchor, and flip_x/_y to an affine
3x4 matrix and store the transpose of that in the per-instance data.
This is similar to how MeshUniform uses transpose affine matrices.
- Use a special index buffer that has batches of 6 indices referencing 4
vertices. The lower 2 bits indicate the x and y of a quad such that the
corners are:
  ```
  10    11

  00    01
  ```
UVs are implicit but get modified by UV offset and scale The remaining
upper bits contain the instance index.

## Benchmarks

I will compare versus `main` before bevyengine#9236 because the results should be
as good as or faster than that. Running `bevymark -- 10000 16` on an M1
Max with `main` at `e8b38925` in yellow, this PR in red:

![Screenshot 2023-08-27 at 18 44
10](https://github.com/bevyengine/bevy/assets/302146/bdc5c929-d547-44bb-b519-20dce676a316)

Looking at the median frame times, that's a 37% reduction from before.

---

## Changelog

- Changed: Improved sprite rendering performance by leveraging an
instance-rate vertex buffer.

---------

Co-authored-by: Giacomo Stevanato <giaco.stevanato@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Code-Quality A section of code that is hard to understand or change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants