Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Vulkan CPU/GPU synchronization #80566

Closed

Conversation

darksylinc
Copy link
Contributor

@darksylinc darksylinc commented Aug 12, 2023

The original code had two bugs:

  1. It assumed Swapchain count can be used for proper CPU/GPU synchronization (it's not)
  2. It used a double buffer scheme (FRAME_LAG = 2) but the vkWaitForFences/vkResetFences calls were in the wrong order and in the wrong place
  3. The swapchain bug forced a quad-buffer scheme with roughly 1 or 2 frames of wait; which hid all the issues

This commit gets rid of RenderingDeviceVulkan::frame_count & RenderingDeviceVulkan::frame which should've never existed and makes everything depend on VulkanContext::frame_index &
VulkanContext::FRAME_LAG

See #80550 for details

Fix #80550
Likely fixes #81670.

Specific things that need testing:

Aside from overall stability and ensure there are no regressions

Note: This PR may uncover bugs that were previously hidden or that were rare to reproduce. This may give the illusion that the code has become more unstable when in reality it's just uncovering hidden flaws.

@BastiaanOlij
Copy link
Contributor

I need to test this sometime during the week with VR. I've had a lot of timing issues because our frame timing seemed to clash with OpenXRs xrWaitFrame logic, or at least thats always been my conclusion, but possibly this is the real issue.

@darksylinc
Copy link
Contributor Author

darksylinc commented Aug 13, 2023

OK I have evidence that lag has been reduced.

I couldn't use OBS because it introduced atrocious lag when enabled. I had to use an external camera.

I've ran multiple non-scientific tests and it felt better but I was heavily biased (I want to claim my version does something); so I had to do it with a more precise, scientific method.

Methodology

  1. Xubuntu 20.04 XFCE. No Compositor (Window Manager Tweaks -> Compositor -> Enable display compositing must be off)
  2. X11 - Vulkan (Forward+) - dedicated AMD Radeon RX 6800 XT - AMD Ryzen 9 5900X 12-Core Processor (24 Threads)
  3. Using RADV driver. Mesa 23.0.4 - kisak-mesa PPA
  4. TearFree enabled (xrandr --output DisplayPort-1 -set TearFree on)
  5. LG FLATRON E2342 1920x1080@60hz
  6. Open near-empty project (PopupCrash).
  7. Camera used for recording: Xiaomi POCO F2 set to 1080p@60 FPS
  8. VSync on
  9. Update Continuously is off
  10. Mouse movement is emulated with the following bash script (use sudo apt install xautomation if not installed):
while true
do
	xte 'mousemove 100 100'
	xte 'mousemove 100 125'
	xte 'mousemove 100 150'
	xte 'mousemove 100 175'
	xte 'mousemove 100 200'
	xte 'mousemove 100 225'
	xte 'mousemove 100 250'
	xte 'mousemove 100 275'
	xte 'mousemove 100 300'
	xte 'mousemove 100 325'
	xte 'mousemove 100 350'
	xte 'mousemove 100 325'
	xte 'mousemove 100 300'
	xte 'mousemove 100 275'
	xte 'mousemove 100 250'
	xte 'mousemove 100 225'
	xte 'mousemove 100 200'
	xte 'mousemove 100 175'
	xte 'mousemove 100 150'
	xte 'mousemove 100 125'
	xte 'mousemove 100 100'
done

These are the results (I've had to re-encode the videos because they were too big for Github):

Before PR
https://github.com/godotengine/godot/assets/3395130/63a681df-98e6-4836-aa24-cf0e43977b58

After PR
https://github.com/godotengine/godot/assets/3395130/d5b40e06-23f5-4742-8392-5cd9cb674bd8

We can notice:

  1. Before PR the highlight lags behind a lot
  2. Before PR the hightlight is almost always 9 highlights behind (when the cursor is in "New Inherited Scene..." the highlight is in "Quick Open Scene...")
  3. After PR lags behind a lot less
  4. Even though in a few occasions it manages to be 9 highlights behind, it happens twice in the video, while most often is only at most 7 highlights behind; while most of the time is 3 highlights behind.

Overall I believe the data is consistent with latency being reduced by 1 frame (16 ms). I encourage others to do their own research and validate the result.

Unfortunately I only run the test once (I'm a bit of a hurry, tomorrow it's election day).

IMHO it feels noticeable snappier too. Of course it will never reach the level of snappiness you can get with VSync Off, however theoretically the latency with VSync off should be reduced as well.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial code review. Overall, I think this change is fantastic. But we indeed need to get as much testing as possible

I tested locally on PopOS 22.04 with Intel integrated graphics, X11 WM. I didn't see any perceptible changes in performance/lag/stability. Although my experience was already quite good.

drivers/vulkan/vulkan_context.cpp Outdated Show resolved Hide resolved
drivers/vulkan/vulkan_context.cpp Outdated Show resolved Hide resolved
drivers/vulkan/vulkan_context.cpp Outdated Show resolved Hide resolved
@@ -8933,11 +8968,8 @@ void RenderingDeviceVulkan::initialize(VulkanContext *p_context, bool p_local_de
context = p_context;
device = p_context->get_device();
if (p_local_device) {
frame_count = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implications of removing this aren't totally clear to me. The logic with p_local_device is slightly different as it isn't connected to a window in that case. I'll have to think about this more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

There are two independent things going on:

  1. What frame_count was supposed to do. I've been meaning to ask you that. (and it sounds like this should be user-provided rather than assuming a value).
    • Right now FRAME_LAG does the job of the max frames to be ahead. It is hardcoded but I will be modifying this to be user-tweakable
  2. frame_count was completely broken as I described in Vulkan GPU/CPU synchronization is wrong #80550 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I am referring to comes from the fact that the local device doesn't have a swapchain and it never calls swap_buffers. It is essentially one frame, and it has to call submit() then sync() in order to submit more operations. In those cases, frame_count should always be 1 as there will literally never be more than one "frame".

Obtaining a global frame count from the context fundamentally breaks thing for local devices.

I'm not opposed to grabbing the frame count from context, but if we do that, then we need more checks if we are using a local device to ensure we aren't referencing resources that don't exist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented more on this in another spot. I think this PR will break local rendering devices in a subtle way

@darksylinc
Copy link
Contributor Author

darksylinc commented Aug 19, 2023

Downloads for testing

@darksylinc
Copy link
Contributor Author

darksylinc commented Aug 20, 2023

I'm adding new functionality.

In Window -> V-Sync -> Buffer Count users can now set the FRAME_LAG (which was previously hardcoded) anywhere in range [1; 4].

Since it must be provided at very early initialization, I didn't add a gdscript interface for it. That can be added & figured out later by anyone else.

Godot

The plumbing for that is done now (you may have feedback for changes of course, though).

This setting should also address @clayjohn concerns; although they may be misplaced.

Even if Godot has no windows, VulkanContext::swap_buffers() still must be called. Whether frame_count is 1 or 2 is irrelevant to whether there are windows.
frame_count just means how much the CPU is allowed to get ahead of the GPU.

Edit: I've pushed it as two commits because I want CI to double check I got everything correctly for every platform. Having it as 2 commits also makes evaluation easier. If all is well, I will be squash the commits into one.

@Calinou
Copy link
Member

Calinou commented Aug 20, 2023

Since it must be provided at very early initialization, I didn't add a gdscript interface for it. That can be added & figured out later by anyone else.

This may require something like godotengine/godot-proposals#6423 to be changeable at run-time.

Comment on lines 834 to 841
Sets the number of buffers before stalling to wait for the GPU.
2 corresponds to double buffering and 3 to triple buffering.
Single buffering may give you the lowest lag but at the high cost of no parallelism between CPU & GPU
Assuming you're able to reach 60fps all the time, then double buffering gives you the lowest lag.
Use Triple buffering if V-Sync is enabled and the GPU is struggling to keep 60 fps (or whatever the monitor refresh rate is).
Triple buffering is almost a must if you're using V-Sync MAILBOX mode.
[b]Note:[/b] This setting is not supported by all APIs.
[b]Note:[/b] This property is only read when the project starts.
Copy link
Member

@Calinou Calinou Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every line break creates a new paragraph, so I suggest using those sparingly.

Suggested change
Sets the number of buffers before stalling to wait for the GPU.
2 corresponds to double buffering and 3 to triple buffering.
Single buffering may give you the lowest lag but at the high cost of no parallelism between CPU & GPU
Assuming you're able to reach 60fps all the time, then double buffering gives you the lowest lag.
Use Triple buffering if V-Sync is enabled and the GPU is struggling to keep 60 fps (or whatever the monitor refresh rate is).
Triple buffering is almost a must if you're using V-Sync MAILBOX mode.
[b]Note:[/b] This setting is not supported by all APIs.
[b]Note:[/b] This property is only read when the project starts.
Sets the number of buffers before stalling to wait for the GPU. A higher values allows for higher framerates in some situations at the cost of higher visual latency. 2 corresponds to double buffering and 3 to triple buffering. Single buffering may give you the lowest latency but at the high cost of no parallelism between CPU & GPU
Assuming you're able to reach the monitor's refresh rate all the time, double buffering gives you the lowest latency while still having some level of parallelism.
Use Triple buffering if V-Sync is enabled and the GPU is struggling to keep up with the monitor's refresh rate. Triple buffering is almost a must if you're using the [b]Mailbox[/b] V-Sync mode.
[b]Note:[/b] This setting is only supported with Vulkan-based rendering methods (Forward+ and Mobile), not Compatibility.
[b]Note:[/b] This property is only read when the project starts. There is currently no way to change this value at run-time.

Also, are the terms "double buffering" and "triple buffering" correct here? I see those being used to refer to V-Sync behavior, with double buffering offering less latency but a "permanent" drop to 30 FPS (or 20 FPS, 15 FPS, …) whenever you can't sustain 60 FPS. I thought this was configured separately, see godotengine/godot-proposals#4065.

@therabug
Copy link

2023-10-22.12-42-22.1.1.mp4

i think you guys still need to test before merging. This bug still can happen if you overload too much

@@ -154,7 +159,7 @@ class VulkanContext {
VkSwapchainKHR swapchain = VK_NULL_HANDLE;
SwapchainImageResources *swapchain_image_resources = VK_NULL_HANDLE;
VkPresentModeKHR presentMode = VK_PRESENT_MODE_FIFO_KHR;
VkSemaphore image_acquired_semaphores[FRAME_LAG];
VkSemaphore image_acquired_semaphores[MAX_FRAME_LAG] = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why multiple acquired semaphores are needed per window?
AFAIK, only one swapchain image is generally acquired at a time. (this is actually an issue I was looking into for #84137 and in fact, I saw (in superficial testing) no detriment in keeping just 1 image_aquired_semaphore per Window.

(I'm quite possibly bad and wrong about this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I had the same question actually!

Here's the thing:

  1. Indeed just one semaphore for all windows is how I think should be properly done (unless you're careful with barriers and want to explicitly start doing Compute Work using async compute for the next frame while the current frame is still being worked on)
  2. In OgreNext we use one semaphore and funny enough, that triggered driver bugs for various drivers (which have been long been fixed)
  3. Official and unofficial samples use one semaphore per swapchain (this is most likely why the drivers didn't catch that bug for OgreNext...), and since I saw that Godot was doing the same, I decided not to change this.

But yes, for the record for desktop*, I am more comfortable with one semaphore for everything because that's how I think proper synchronization should be.

* I say desktop because I'm having timing issues on Android that I can't fix in OgreNext and I never tested the theory of whether they're caused because of having one semaphore for everything. But on the other hand, Android has a lot of problems of all sorts.

@@ -1680,7 +1680,10 @@ Error VulkanContext::_create_semaphores() {
/*pNext*/ nullptr,
/*flags*/ VK_FENCE_CREATE_SIGNALED_BIT
};
for (uint32_t i = 0; i < FRAME_LAG; i++) {

CRASH_COND_MSG(frame_count == 0, "Used before initialization.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! I ran into that same issue trying a provisional fix to what your PR does, this seems much cleaner than the current logic on master.

[code]2[/code] corresponds to double buffering and [code]3[/code] to triple buffering. A value of [code]1[/code] is not recommended.
Double buffering may give you the lowest lag/latency but if VSync is on and the system can't render at 60 fps, the framerate will go down in multiples of it (e.g. 30 fps, 15, 7.5, etc ). Triple buffering gives you higher framerate (specially if the system can't reach a constant 60 fps) at the cost of up to 1 frame of latency (when VSync is On in FIFO mode).
Use Double buffering if V-Sync is off. Triple buffering is a must if you plan on using V-Sync MAILBOX mode.
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation to better understand how it is affected by different variables under various conditions.
Copy link
Contributor

@thygrrr thygrrr Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this link should go into the Godot documentation.

Regarding the simulator itself, nice tool! I wish the sliders had manual input fields and/or more precision at the bottom of its range (logarithmic scale). I'm generally interested in high frame rate scenarios (<10 ms vblank time, and values like 8 and 6 ms are really difficult to adjust with how the sliders work)

Copy link
Member

@Calinou Calinou Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this link should go into the Godot documentation.

Why not? We have plenty of external links in the class reference and documentation in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation to better understand how it is affected by different variables under various conditions.
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation, to better understand how it is affected by different variables under various conditions.

Add punctuation to separate clause

if (desiredNumOfSwapchainImages < surfCapabilities.minImageCount) {
desiredNumOfSwapchainImages = surfCapabilities.minImageCount;
}
uint32_t desiredNumOfSwapchainImages = MAX(surfCapabilities.minImageCount, swapchain_desired_count);
Copy link
Contributor

@thygrrr thygrrr Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sources I found seem to state it's sensible to pick the lower bound as 1 + surfCapabilities.minImageCount (this might, on particularly conservative implementations, relax the contention for driver resources a bit when acquiring the next image; the common case here is still that the default 3 images are the value that you are gonna get)

Since you have made the desired values configurable, a potential extra setting could make this configurable as well, defaulting to an offset of 0 or 1. (or a simple flag)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found all references to this info trace back to the Vulkan Tutorial.

After research I found the following:

  1. The OSes that have sync issues (like X11 on Linux) already set minImageCount to a large enough value like 4 or 5.
  2. I suspect it is highly likely that the tutorial confused "wait on the driver to complete internal operations" with how VSync is supposed to work. This is known and can be observed in my VSync Simulator. If you set swapchain_count to 2 and the rendering time is higher than the refresh rate, then the entire framerate will alternate to an integer divisor of the refresh rate (e.g. 60fps, 30 fps, 15 fps). Depending on how the OS compositor internally works, this might happen with swapchain_count > 2.
    • Seeing the other ticket, it is clear some users explicitly want this behavior (because if the CPU & GPU are fast enough to not miss any VBLANK, then latency is minimum).

@@ -57,7 +57,8 @@
return _window_create(p_window_id, p_vsync_mode, surface, p_width, p_height);
}

VulkanContextIOS::VulkanContextIOS() {}
VulkanContextIOS::VulkanContextIOS() {
Copy link
Contributor

@thygrrr thygrrr Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental line break.

<member name="display/window/vsync/buffer_count" type="int" setter="" getter="" default="2">
Sets the number of buffers before stalling to wait for the GPU.
1 may give you the lowest lag/latency but at the high cost of no parallelism between CPU &amp; GPU.
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation to better understand how it is affected by different variables under various conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation to better understand how it is affected by different variables under various conditions.
Try the [url=https://darksylinc.github.io/vsync_simulator/]V-Sync Simulator[/url], an interactive interface that simulates presentation, to better understand how it is affected by different variables under various conditions.

@Opalie
Copy link

Opalie commented Nov 15, 2023

Hi, would it be possible to get the build artifacts of this PR ? They are sadly not available anymore. Thank you very much

@Riteo
Copy link
Contributor

Riteo commented Nov 20, 2023

@Opalie unfortunately the artifacts have expired and it looks like we can't restart the jobs, sorry. I got suggested that a viable route would be to fork this branch and make a small change as to re-trigger them for your fork. This should be all doable from the web UI but might be a bit cumbersome.

I'd personally recommend building it from source, but only if you feel comfortable with the idea. Here's some docs if that were the case: https://docs.godotengine.org/en/stable/contributing/development/compiling/index.html

@vpellen
Copy link

vpellen commented Jan 26, 2024

Is this just waiting on review, or has it effectively been superseded by #87340?

@DarioSamo
Copy link
Contributor

Is this just waiting on review, or has it effectively been superseded by #87340?

It'd effectively be superseded by that, so if you're waiting on this one I'd suggest confirming if that PR fixes the problem as well. The PRs will be completely incompatible and Matias has reviewed that I implemented what he detected correctly in that PR as well, as well as fetch the settings/doc changes from it.

@vpellen
Copy link

vpellen commented Feb 10, 2024

Just a minor follow up, I did test the other PR and it doesn't seem to fix the obnoxious stuttering, although the curious thing is that neither of the directx implementations (both the angle compatibility mode and the recent dx12 stuff) suffer from the stuttering issues, so this may be some insufferable nvidia vulkan/opengl issue. The hunt for "why does vsync cause so much misery" continues. Still, at least there are options now.

<member name="display/window/vsync/swapchain_count" type="int" setter="" getter="" default="3">
Sets the number of swapchains (back buffer + front buffer).
[code]2[/code] corresponds to double buffering and [code]3[/code] to triple buffering. A value of [code]1[/code] is not recommended.
Double buffering may give you the lowest lag/latency but if VSync is on and the system can't render at 60 fps, the framerate will go down in multiples of it (e.g. 30 fps, 15, 7.5, etc ). Triple buffering gives you higher framerate (specially if the system can't reach a constant 60 fps) at the cost of up to 1 frame of latency (when VSync is On in FIFO mode).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Double buffering may give you the lowest lag/latency but if VSync is on and the system can't render at 60 fps, the framerate will go down in multiples of it (e.g. 30 fps, 15, 7.5, etc ). Triple buffering gives you higher framerate (specially if the system can't reach a constant 60 fps) at the cost of up to 1 frame of latency (when VSync is On in FIFO mode).
Double buffering may give you the lowest lag/latency but if VSync is on and the system can't render at 60 fps, the framerate will go down in multiples of it (e.g. 30 fps, 15, 7.5, etc.). Triple buffering gives you higher framerate (specially if the system can't reach a constant 60 fps) at the cost of up to 1 frame of latency (when VSync is On in FIFO mode).

@akien-mga
Copy link
Member

Superseded by #87340. Thanks for the great work on this!

@akien-mga akien-mga closed this Feb 14, 2024
@AThousandShips AThousandShips removed this from the 4.3 milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project