Acyclic Command Graph for Rendering Device #84976

DarioSamo · 2023-11-16T14:41:21Z

Background

@reduz proposed the idea a while ago of refactoring RenderingDevice to automatically build a graph out of the commands submitted to the class (outlined here). The first stepping stone towards implementing this was @RandomShaper's PR (#83452) that splits the commands into easily serializable parameters. Therefore, merging this PR will require merging that PR first (and if you wish to review this, only look at my individual commit after Pedro's changes).

Improvements

This PR makes the following improvements towards implementing @reduz's idea.

RenderingDevice's complexity has been drastically reduced as it no longer needs to solve pipeline barriers or layout transitions for pretty much all its functionality. This responsibility has been delegated to a new class called RenderingDeviceGraph.
Overall the total amount of vkCmdPipelineBarrier calls has been reduced immensely. On average I've noticed a reduction of about 60-80% of the total amount of barrier calls in a frame when compared to master.
RenderingDeviceGraph is capable of reordering commands based on the dependency between the resources used to push the submitted commands to be processed as early as possible. This gives the driver much better chances of parallelizing the work effectively.
RenderingDeviceGraph will group as many possible barriers as possible in 'levels' depending on the usage of the resources. These barriers are submitted before the commands of the level are processed to perform any layout or synchronization barriers that are required.
RenderingDevice's API has been simplified as some parameters are no longer required.
- Barrier bitmasks are gone. They no longer serve any purpose.
- Draw list and compute list overlapping no longer needs to be specified.
- 'Split draw lists' are gone as they can be automatically recorded by the graph instead, and has been shown to be viable already (although disabled behind an experimental macro for now).
- Draw lists no longer need to specify their initial and final actions in excruciating detail. The operations are much simpler now. Load, clear or discard for initial. Store and discard for final. The detail behind the original action no longer serves any purpose, as the graph will automatically skip any transitions that are not required if commands that use the same image layout are chained together (e.g. render passes).
- Draw lists no longer need to specify storage textures as it's not required at all.
Both Forward+ and Mobile have been adapted to use the new API and have overall resulted in a net removal of code complexity that is no longer required due to the graph automatically solving what it was doing already.
A lot of existing Vulkan synchronization errors caused by the current barrier system have been solved automatically by the graph.

Implementation

Since commands need to be deferred until later for reordering, everything submitted to RD is serialized into a large uint8 vector that grows as much as necessary. The capacity of all these vectors is reused across recordings to avoid reallocation as much as possible. The recorded commands are the bare minimum the RenderingDeviceDriver needs to execute the command after reordering is performed.

As expected, this PR will add CPU overhead that was not there before due to the additional time spent recording commands and reordering. While there's a very obvious optimization that is indicated in the TODO list that is pending, the rest of the extra work won't be as easy to overcome. This extra cost can also be offset tremendously by using secondary command buffers, which is pending to be enabled in the PR as soon as the issue described in the TODO is figured out.

Compatibility breakage

RenderingDevice's binary compatibility is not guaranteed as expected due to a lot of arguments being removed from the functions. I've provided compatibility wrappers as required, although I'm not quite clear if they work as intended yet.
If the compatibility wrappers work as intended, there should be no need to change the behavior of the code dependent on RD: pretty much most of the time the additional detail that was provided to the functions is just ignored as the graph can solve it on its own.
In some cases it is possible compatibility breaks because the graph performs additional layers of validation as to whether some operations are allowed. The validation that RenderingDeviceGraph includes:
- Checking whether the same resource is used with different usages in the same command. This was found to be the case already with a couple of effects that used the indirect buffer as both dispatch and storage, leading to UB.
- Checking if multiple slices of the same resource in different usages are used in the same command and have overlap. This is not allowed as the layout transitions are impossible to solve effectively and can lead to race conditions in the GPU. Luckily, such a case was not found in the existing rendering code as far as I could find, but if some code runs into this it means it has to be fixed on the user side and not on the graph.

Performance improvements

GPU performance improvements are expected across the board as long as the CPU overhead isn't slowing the game down (which should go down with the future immutable change). The performance improvements will also vary depending on how much the particular IHV was suffering from inefficient barrier usage. One area that will particularly benefit is projects using GPU particles, as their processing will be much more parallelized than it was before.

At least on an NVIDIA 3090ti I've noticed around an overall ~10% frame time improvement in several projects, with potential bigger wins in platforms like AMD that can parallelize effectively on single queue or mobile hardware that does not handle barriers as gracefully as NVIDIA does.

Future improvements

These will be researched after this PR is merged as a second iteration.

Dedicated transfer queue for resources that use the setup command buffer that can run in parallel and synchronize when it's time to process the drawing commands.
Support for multiple graphics and compute queues that will split the work of the graph to support parallelization on hardware that can take advantage of it more effectively (like NVIDIA).

TODO

Debug broken uniform set in TPS demo (will be fixed in Pedro's PR soon).
Debug strange memory usage increase far beyond what should be expected.
Update documentation to match the new API.
Fix the C# glue error that is not getting generated properly for some reason due to the RD API change.
Double check if MSAA is working on Forward+ and Mobile.
Attempt new mutable/immutable tracker design that does not require explicit flags from the engine. This requires working out a way to refresh the trackers used by vertex, index and uniform set arrays once those resources turn into mutables. All dependencies must be made mutable.
Debug strange issue in NVIDIA where the editor will show up completely black when using secondary command buffers depending on the contents of the draw list. This currently blocks secondary command buffers from being enabled. I've been unable to determine the root of the issue so far. (Postponed until we get feedback from NVIDIA)

Production edit: closes godotengine/godot-roadmap#29

DarioSamo · 2023-12-04T16:46:31Z

Opening this for review. We won't take any steps towards merging these until the elements marked out in the TODO are done and the PR this is based on is merged, but I expect it to take a while to effectively review all these changes.

core/templates/hashfuncs.h

BastiaanOlij · 2023-12-06T07:36:28Z

Some initial testing, after fixing an issue in @RandomShaper PR, this is working on both desktop VR (tested with an Valve Index headset) and on Quest (tested on Quest 3). I have more testing to do.

This was done with the mobile renderer, there weren't any obvious performance improvements but I wasn't expecting any as the mobile renderer already has a minimum of passes, there isn't much for the graph to optimise.

I did notice when testing on the Quest 3 that MSAA broke, as far as I can tell it looks like it's resolving MSAA before it has finished rendering, so there is an issue in barriers. I did not test this with just @RandomShaper PR, so not 100% sure if this is introduced by the acyclic graph or if we're missing something in the new barrier code.

DarioSamo · 2024-01-05T13:38:45Z

Some further investigation on the scene itself indicates the player character or something related might be at fault, but a RenderDoc capture doesn't show anything strange about the synchronization: it is in fact pretty much correct as far as the synchronization between the compute skinning step and the depth pre-pass goes (using a single memory barrier).

DarioSamo · 2024-01-05T13:51:38Z

In a weird twist of fate, it seems enabling buffer barriers also fixes the issue while retaining the ability to both reorder the graph and not have to rely on full barriers to synchronize on the AMD Radeon RX Vega M.

Since the main suspect is the compute skinning right now, it might be a good idea to try to exaggerate the issue on the project by creating multiple skinned characters so if any race condition exists, it'll be more likely to show up.

DarioSamo · 2024-01-05T15:07:50Z

The conclusions after talking with @akien-mga seem to be so far:

The issue does not happen if buffer barriers are used instead, no matter how much the system is pushed to duplicate as many animated characters as possible. It happens instantly if buffer barriers are re-enabled.
The issue does not happen in the Windows 10 artifact from the PR on the same hardware with regular memory barriers instead.

We should probably discuss how to approach this, as it could be a hint of something being wrong in the driver/system combination itself. Reviewing the RenderDoc capture has not revealed anything apparent nor does the validation or synchronization layer show any errors about it. It might be possible to build a standalone sample using Vulkan that replicates the issue if we want to dedicate time to that.

Alternatively, we can enable buffer barriers by default at the cost of some performance and trusting that IHVs implement them correctly (or at least basically translate them to global barriers internally).

clayjohn

Dario and I discussed this on chat. The issue that Remi is facing appears to go away when using buffer barriers, when using full barriers, or when not reordering commands. We suspect that it is caused by a driver bug as it only appears to reproduce on a specific combination of hardware.

For most drivers, buffer barriers are ignored/promoted to memory barriers. In theory their should be some overhead from adding them, but testing has shown that the overhead is minimal.

Accordingly, our plan is as follows:

Enable buffer barriers by default
If @akien confirms that this new update works fine on his hardware, merge this before Dev2
Create an MRP for AMD/MESA and submit a bug report.
Remove the buffer barriers once the problematic driver is fixed, or once we find a better workaround

DarioSamo · 2024-01-06T23:48:45Z

Well this is a bad discovery to make at the last minute, but it turns out that at some point, my Vulkan Validation misconfigured itself and actually turned off my synchronization checking, and now I get some synchronization errors that @akien-mga was reporting (not the error that was reported on the scene however, the visuals themselves are still fine). I realized this when I went to test another project and was wondering why I was not getting synchronization errors in a more obvious scenario.

I'd suggest avoiding to merge this until these synchronization errors are addressed, as there's quite a lot more than I thought there were due to the Validation layer turning itself off at some point during development.

EDIT: Upon further testing I can confirm when forcing the full barrier access bits most of the errors are gone at least, so the rendering graph logic itself seems fine, it just needs some further tweaking for correctness and analyzing what's missing from these cases.

DarioSamo · 2024-01-07T03:20:46Z

I was able to solve most of the synchronization errors, although one of the solutions will probably remain a bit temporary until a more proper solution is figured out, but it's not exactly a pressing case as it involves an edge case with slices transitions (mostly due to how reflection probes behave).

There's another synchronization error in the TPS project, but it seems actually unrelated to the graph and it has more to do with the texture upload in particular of that project. It's worth checking if that error shows up as well in master at the moment or if #86855 might be related.

Ansraer · 2024-01-07T19:47:05Z

Oh thank god. I am building a PR on top of the RenderGraph and couldn't figure out why the layers were screaming at me when I hadn't even launched my new compute shader yet.

DarioSamo · 2024-01-07T23:19:56Z

Oh thank god. I am building a PR on top of the RenderGraph and couldn't figure out why the layers were screaming at me when I hadn't even launched my new compute shader yet.

Were you using only validation or synchronization? I never saw errors with regular validation so far, but don't hesitate to report anything that might've been missed.

akien-mga · 2024-01-08T12:00:39Z

I retested the latest version of this PR (d7ea8b7). I confirm that:

With buffer barriers (current PR), the skinning glitch I reproduce on Mesa radv is no longer present.
If I disable buffer barriers, the glitch comes back.

DarioSamo · 2024-01-08T17:51:46Z

@akien-mga tested a standalone Vulkan sample that I created but we were unable to reproduce the glitch he's getting when using only memory barriers instead of buffer barriers. It seems it'll be much harder to trace what exactly is failing here and what part of the operations are corrupting it.

Adds a new system to automatically reorder commands, perform layout transitions and insert synchronization barriers based on the commands issued to RenderingDevice.

clayjohn

Most recent version looks good. I tested locally with the synchronization layer enabled and can confirm that the errors present in the last version are now gone.

At this point I am comfortable saying that this is ready for merging before Dev2.

akien-mga · 2024-01-09T10:31:19Z

Thanks and congrats, this is an amazing change! 🎉 🥇

Calinou added enhancement topic:rendering labels Nov 16, 2023

Calinou added this to the 4.x milestone Nov 16, 2023

DarioSamo force-pushed the rd_common_render_graph branch 8 times, most recently from 3e9b18a to 7a3b4e6 Compare November 28, 2023 13:14

DarioSamo force-pushed the rd_common_render_graph branch 2 times, most recently from 3bd08dd to 7973cd0 Compare December 4, 2023 14:17

DarioSamo marked this pull request as ready for review December 4, 2023 16:45

DarioSamo requested review from a team as code owners December 4, 2023 16:45

DarioSamo changed the title ~~Acyclic Command Graph for Rendering Device [Prototype]~~ Acyclic Command Graph for Rendering Device Dec 4, 2023

DarioSamo force-pushed the rd_common_render_graph branch from 7973cd0 to 4b504db Compare December 5, 2023 11:43

AThousandShips reviewed Dec 5, 2023

View reviewed changes

core/templates/hashfuncs.h Outdated Show resolved Hide resolved

DarioSamo force-pushed the rd_common_render_graph branch from a1e01bc to 2c1954d Compare December 5, 2023 17:59

DarioSamo force-pushed the rd_common_render_graph branch from e1a29bd to 6065cc1 Compare January 5, 2024 17:54

clayjohn approved these changes Jan 5, 2024

View reviewed changes

DarioSamo force-pushed the rd_common_render_graph branch from 6065cc1 to d7ea8b7 Compare January 7, 2024 03:17

AETHER-O mentioned this pull request Jan 7, 2024

[Tracker] Vulkan error unable to create swapchain (device lost) happening randomly #71929

Open

DarioSamo force-pushed the rd_common_render_graph branch from d7ea8b7 to 53313d0 Compare January 8, 2024 17:12

Acyclic Command Graph for RenderingDevice.

cc4d39b

Adds a new system to automatically reorder commands, perform layout transitions and insert synchronization barriers based on the commands issued to RenderingDevice.

DarioSamo force-pushed the rd_common_render_graph branch from 53313d0 to cc4d39b Compare January 8, 2024 17:55

clayjohn approved these changes Jan 8, 2024

View reviewed changes

akien-mga merged commit e9695d9 into godotengine:master Jan 9, 2024
15 checks passed

akien-mga mentioned this pull request Jan 9, 2024

Fix usage of index offsets in RenderingDevice #86852

Merged

DarioSamo mentioned this pull request Jan 9, 2024

Fix incorrect mapping of initial action as clear region continue to clear. #87022

Merged

DarioSamo mentioned this pull request Jan 18, 2024

Fix memory leak from not clearing the buffer barrier vector properly on the render graph. #87349

Merged

clayjohn mentioned this pull request Jan 29, 2024

Implement Mesh streaming godotengine/godot-proposals#6109

Open

eswartz mentioned this pull request Feb 26, 2024

Vulkan/X11: rendering device Unable to acquire framebuffer after Linux virtual console switch #88879

Open

MrJul mentioned this pull request Mar 28, 2024

AMD GPU corrupted rendering MrJul/Estragonia#5

Closed

TCROC mentioned this pull request Apr 13, 2024

Skinned Skeleton3D calling add_child() with specific classes cause crash on Android with directional shadow in forward+/mobile #90459

Closed

DarioSamo mentioned this pull request May 3, 2024

Add optional driver workaround to RenderingDevice for Adreno 6XX. #91514

Merged

1 task

darksylinc mentioned this pull request May 4, 2024

Add debug utilities for Vulkan #90993

Merged

Calinou mentioned this pull request Jun 19, 2024

Flood of error message in multi-threaded render particles_is_inactive: This function should never be used with threaded rendering, as it stalls the renderer #83046

Open

matheusmdx mentioned this pull request Jul 25, 2024

Android editor crash when interacting with some GUI elements #94741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acyclic Command Graph for Rendering Device #84976

Acyclic Command Graph for Rendering Device #84976

DarioSamo commented Nov 16, 2023 •

edited by adamscott

Loading

DarioSamo commented Dec 4, 2023

BastiaanOlij commented Dec 6, 2023

DarioSamo commented Jan 5, 2024

DarioSamo commented Jan 5, 2024

DarioSamo commented Jan 5, 2024

clayjohn left a comment

DarioSamo commented Jan 6, 2024 •

edited

Loading

DarioSamo commented Jan 7, 2024

Ansraer commented Jan 7, 2024

DarioSamo commented Jan 7, 2024

akien-mga commented Jan 8, 2024

DarioSamo commented Jan 8, 2024

clayjohn left a comment

akien-mga commented Jan 9, 2024

Acyclic Command Graph for Rendering Device #84976

Acyclic Command Graph for Rendering Device #84976

Conversation

DarioSamo commented Nov 16, 2023 • edited by adamscott Loading

Background

Improvements

Implementation

Compatibility breakage

Performance improvements

Future improvements

TODO

DarioSamo commented Dec 4, 2023

BastiaanOlij commented Dec 6, 2023

DarioSamo commented Jan 5, 2024

DarioSamo commented Jan 5, 2024

DarioSamo commented Jan 5, 2024

clayjohn left a comment

Choose a reason for hiding this comment

DarioSamo commented Jan 6, 2024 • edited Loading

DarioSamo commented Jan 7, 2024

Ansraer commented Jan 7, 2024

DarioSamo commented Jan 7, 2024

akien-mga commented Jan 8, 2024

DarioSamo commented Jan 8, 2024

clayjohn left a comment

Choose a reason for hiding this comment

akien-mga commented Jan 9, 2024

DarioSamo commented Nov 16, 2023 •

edited by adamscott

Loading

DarioSamo commented Jan 6, 2024 •

edited

Loading