Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pack multiple meshes into vertex and index buffers. #13218

Closed
wants to merge 8 commits into from

Conversation

pcwalton
Copy link
Contributor

@pcwalton pcwalton commented May 3, 2024

The underlying allocation algorithm is offset-allocator, which is a port of Sebastian Aaltonen's OffsetAllocator. It's a fast, simple hard real time allocator in the two-level segregated fit family.

Allocations are divided into two categories: regular and large. Regular allocations go into one of the shared slabs managed by an allocator. Large allocations get their own individual slabs. Due to platform limitations, on WebGL 2 all vertex buffers are considered large allocations that get their own slabs; however, index buffers can still be packed together. The slab size is 32 MB by default, but the developer can adjust it manually.

The mesh bin key and compare data have been reworked so that the slab IDs are compared first. That way, meshes in the same vertex and index buffers tend to be drawn together. Note that this only works well for opaque meshes; transparent meshes must be sorted into draw order, so there's less opportunity for grouping.

The purpose of packing meshes together is to reduce the number of times vertex and index buffers have to be re-bound, which is expensive. In the future, we'd like to use multi-draw, which allows us to draw multiple meshes with a single drawcall, as long as they're in the same buffers. Thus, this patch paves the way toward multi-draw, and with it a GPU-driven pipeline.

Even without multi-draw, this patch results in significant performance improvements. For me, the command submission time (i.e. GPU time plus driver and wgpu overhead) for Bistro goes from 4.07ms to 1.42ms without shadows (2.8x speedup); with shadows it goes from 6.91ms to 2.62ms (2.45x speedup). The number of vertex and index buffer switches in Bistro is reduced from approximately 3,600 to 927, with the vast majority of the remaining switches due to the transparent pass.

Bistro, without shadows. Yellow is this PR; red is main.
Screenshot 2024-05-02 200257

Bistro, with shadows. Yellow is this PR; red is main.
Screenshot 2024-05-02 200413


Changelog

Added

  • Multiple meshes can now be packed together into vertex and index buffers, which reduces state changes and provides performance improvements.

Migration Guide

  • The vertex and index data in GpuMesh are now GpuAllocations instead of Buffers, to facilitate packing multiple meshes in the same buffer. To fetch the buffer corresponding to a GpuAllocation, use the buffer() method in the new GpuAllocator resource. Note that the allocation may be located anywhere in the buffer; use the offset() method to determine its location.

The underlying allocation algorithm is [`offset-allocator`], which is a port
of [Sebastian Aaltonen's `OffsetAllocator`]. It's a fast, simple hard real
time allocator in the two-level segregated fit family.

Allocations are divided into two categories: *regular* and *large*. Regular
allocations go into one of the shared slabs managed by an allocator. Large
allocations get their own individual slabs. Due to platform limitations, on
WebGL 2 all vertex buffers are considered large allocations that get their
own slabs; however, index buffers can still be packed together. The slab
size is 32 MB by default, but the developer can adjust it manually.

The mesh bin key and compare data have been reworked so that the slab
IDs are compared first. That way, meshes that the same vertex and index
buffers tend to be drawn together. Note that this only works well for
opaque meshes; transparent meshes must be sorted into draw order, so
there's less opportunity for grouping.

The purpose of packing meshes together is to reduce the number of times
vertex and index buffers have to be re-bound, which is expensive. In the
future, we'd like to use *multi-draw*, which allows us to draw multiple
meshes with a single drawcall, as long as they're in the same buffers.
Thus, this patch paves the way toward multi-draw, and with it a
GPU-driven pipeline

Even without multi-draw, this patch results in significant performance
improvements. For me, the command submission time (i.e. GPU time plus
driver and `wgpu` overhead) for Bistro goes from 4.07ms to 1.42ms
without shadows (2.8x speedup); with shadows it goes from 6.91ms to
2.62ms (2.45x speedup). The number of vertex and index buffer switches
in Bistro is reduced from approximately 3,600 to 927, with the vast
majority of the remaining switches due to the transparent pass.

[`offset-allocator`]: https://github.com/pcwalton/offset-allocator/

[Sebastian Aaltonen's `OffsetAllocator`]: https://github.com/sebbbi/OffsetAllocator/
@superdump superdump added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times X-Uncontroversial This work is generally agreed upon labels May 3, 2024
@Elabajaba Elabajaba self-requested a review May 4, 2024 05:07
@NthTensor
Copy link
Contributor

Awesome! I will try to read through and test this over the weekend.

Will this also give us a second perf improvement when order-independent-transparency lands? That should mean we can basically stop sorting, right?

@pcwalton
Copy link
Contributor Author

pcwalton commented May 4, 2024

@NthTensor Yes, I would think so.

/// position*, we must only group meshes with identical vertex buffer layouts
/// into the same buffer.
#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub enum GpuAllocationClass {
Copy link

@Swoorup Swoorup May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if it makes sense to add other GpuAllocationClass textures, samplers, pipeline layout, etc? Mostly looking at rerun's code as an example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add those as we come to them. YAGNI :)

/// The mesh.
///
/// Although we don't have multidraw capability yet, we place this at the
/// end to maximize multidraw opportunities in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does order matter here? It's not repr(c) or anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartialOrd compares from first struct field to last struct field. So fields at the top of the struct will generally be placed together more often.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, Thanks.

Copy link
Contributor

@NthTensor NthTensor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the code involving the allocator itself is a bit beyond me, but I don't see any issues reading it over. Very pleased with the performance trying this out on my fairly low quality hardware.

Copy link
Contributor

@re0312 re0312 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work ! , but looks like something was broken in 2d_shapes
image

impl GpuAllocator {
/// Returns the slab that the given allocation is stored in.
pub fn buffer(&self, allocation: &GpuAllocation) -> &Buffer {
&self.slabs[&allocation.slab_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of panicking when encountering an invalid allocation, we could return an Option<&Buffer> here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I prefer to not panic, but I'm not sure how we would handle the option here in the render system. It looks like this only panics if there's a (unrecoverable) memory error in this or the allocator. Perhaps we could unwrap for a more useful error message?

Copy link
Contributor Author

@pcwalton pcwalton May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it fails it always indicates a bug, because we should be spilling into large allocations if a small allocation can't handle it. So I'm not sure returning None would be helpful.

crates/bevy_render/src/allocator.rs Outdated Show resolved Hide resolved
@NthTensor
Copy link
Contributor

sc_1715093073
sc_1715092940

I can replicate the issues with 2d shapes. That's weird.

@pcwalton
Copy link
Contributor Author

Updated to main and fixed the 2D meshes problem. It was a simple mistake when porting the logic from 3D over: in the indexed path, for the draw_indexed method call I forgot to switch the base_vertex parameter from 0 to the actual location in the buffer.

@pcwalton pcwalton requested review from re0312 and superdump and removed request for re0312 May 17, 2024 05:51
@pcwalton pcwalton added this to the 0.15 milestone May 20, 2024
@pcwalton pcwalton added the S-Needs-Review Needs reviewer attention (from anyone!) to move forward label May 20, 2024
crates/bevy_render/src/allocator.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/allocator.rs Outdated Show resolved Hide resolved
crates/bevy_render/src/allocator.rs Outdated Show resolved Hide resolved
@pcwalton
Copy link
Contributor Author

Comments addressed

@pcwalton pcwalton requested a review from ricky26 May 21, 2024 02:19
@NthTensor NthTensor added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels May 21, 2024
@alice-i-cecile alice-i-cecile added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label May 21, 2024
@alice-i-cecile alice-i-cecile modified the milestones: 0.15, 0.14 May 21, 2024
@alice-i-cecile alice-i-cecile added the S-Blocked This cannot move forward until something else changes label May 21, 2024
@alice-i-cecile
Copy link
Member

[2:32 PM]pcwalton: I think 0.15 would be best for that.
[2:32 PM]pcwalton: Specifically my concern is the size of the slabs: the heuristic hasn't been well tuned and might balloon memory usage in some cases. We won't know without testing.
[2:32 PM]pcwalton: It's the kind of thing we'll only know through a cycle of testing on main so I'd be uncomfortable with 0.14. Besides, @griffin found that the savings are very situational

Blocking until 0.14 is shipped.

@pcwalton
Copy link
Contributor Author

This is ready to go, but I think it would be best to wait until 0.15 and not merge for 0.14. The reason is that the memory usage heuristics haven't been well tuned yet. We'll only know what the best heuristics are through a cycle of testing.

@BD103
Copy link
Member

BD103 commented May 22, 2024

I would like to nominate this for the release notes. I think the performance gains are significant enough that users would enjoy reading about it. (Not to mention I've been watching this in the background every since offset-allocator hit the front page of HN!)

@alice-i-cecile alice-i-cecile added the M-Needs-Release-Note Work that should be called out in the blog due to impact label May 22, 2024
@alice-i-cecile
Copy link
Member

Agreed :) In the future, feel free to just add the label yourselves: it's easy to make the editorial call to split or lump things during the final release notes process.

const SWEEP_INTERVAL: Duration = Duration::from_secs(10);

/// The default size of a slab, in bytes.
const DEFAULT_SLAB_SIZE: u64 = 32 * 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this (in the future) be synced up with wgpu's allocation sizes as well? (gfx-rs/wgpu#5875)

class: &GpuAllocationClass,
contents: &[u8],
) -> GpuAllocation {
// If this is going to overflow a slab, give it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If this is going to overflow a slab, give it
// If this is going to overflow a slab, give it its own large slab.

@pcwalton pcwalton closed this Jul 8, 2024
@pcwalton
Copy link
Contributor Author

pcwalton commented Jul 8, 2024

I'm closing this because I've written it.

pcwalton added a commit to pcwalton/bevy that referenced this pull request Jul 9, 2024
This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of bevyengine#13218, which also used slabs.
Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over bevyengine#13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Jul 9, 2024
This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of bevyengine#13218, which also used slabs.
Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over bevyengine#13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Jul 9, 2024
This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of bevyengine#13218, which also used slabs.
Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over bevyengine#13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.
pcwalton added a commit to pcwalton/bevy that referenced this pull request Jul 9, 2024
This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of bevyengine#13218, which also used slabs.
Unlike bevyengine#13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over bevyengine#13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.
github-merge-queue bot pushed a commit that referenced this pull request Jul 16, 2024
…#14257)

This commit uses the [`offset-allocator`] crate to combine vertex and
index arrays from different meshes into single buffers. Since the
primary source of `wgpu` overhead is from validation and synchronization
when switching buffers, this significantly improves Bevy's rendering
performance on many scenes.

This patch is a more flexible version of #13218, which also used slabs.
Unlike #13218, which used slabs of a fixed size, this commit implements
slabs that start small and can grow. In addition to reducing memory
usage, supporting slab growth reduces the number of vertex and index
buffer switches that need to happen during rendering, leading to
improved performance. To prevent pathological fragmentation behavior,
slabs are capped to a maximum size, and mesh arrays that are too large
get their own dedicated slabs.

As an additional improvement over #13218, this commit allows the
application to customize all allocator heuristics. The
`MeshAllocatorSettings` resource contains values that adjust the minimum
and maximum slab sizes, the cutoff point at which meshes get their own
dedicated slabs, and the rate at which slabs grow. Hopefully-sensible
defaults have been chosen for each value.

Unfortunately, WebGL 2 doesn't support the *base vertex* feature, which
is necessary to pack vertex arrays from different meshes into the same
buffer. `wgpu` represents this restriction as the downlevel flag
`BASE_VERTEX`. This patch detects that bit and ensures that all vertex
buffers get dedicated slabs on that platform. Even on WebGL 2, though,
we can combine all *index* arrays into single buffers to reduce buffer
changes, and we do so.

The following measurements are on Bistro:

Overall frame time improves from 8.74 ms to 5.53 ms (1.58x speedup):
![Screenshot 2024-07-09
163521](https://github.com/bevyengine/bevy/assets/157897/5d83c824-c0ee-434c-bbaf-218ff7212c48)

Render system time improves from 6.57 ms to 3.54 ms (1.86x speedup):
![Screenshot 2024-07-09
163559](https://github.com/bevyengine/bevy/assets/157897/d94e2273-c3a0-496a-9f88-20d394129610)

Opaque pass time improves from 4.64 ms to 2.33 ms (1.99x speedup):
![Screenshot 2024-07-09
163536](https://github.com/bevyengine/bevy/assets/157897/e4ef6e48-d60e-44ae-9a71-b9a731c99d9a)

## Migration Guide

### Changed

* Vertex and index buffers for meshes may now be packed alongside other
buffers, for performance.
* `GpuMesh` has been renamed to `RenderMesh`, to reflect the fact that
it no longer directly stores handles to GPU objects.
* Because meshes no longer have their own vertex and index buffers, the
responsibility for the buffers has moved from `GpuMesh` (now called
`RenderMesh`) to the `MeshAllocator` resource. To access the vertex data
for a mesh, use `MeshAllocator::mesh_vertex_slice`. To access the index
data for a mesh, use `MeshAllocator::mesh_index_slice`.

[`offset-allocator`]: https://github.com/pcwalton/offset-allocator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! M-Needs-Release-Note Work that should be called out in the blog due to impact S-Blocked This cannot move forward until something else changes S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it X-Uncontroversial This work is generally agreed upon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants