Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with debugPrintfEXT using Vulkan 1.1 #7562

Open
SaschaWillems opened this issue Feb 22, 2024 · 23 comments · Fixed by #8323
Open

Performance issue with debugPrintfEXT using Vulkan 1.1 #7562

SaschaWillems opened this issue Feb 22, 2024 · 23 comments · Fixed by #8323
Assignees
Labels
DebugPrintf debugPrintfEXT GPU-AV GPU Assisted Validation

Comments

@SaschaWillems
Copy link
Contributor

Environment:

  • OS: Windows 11
  • GPU and driver version: NVIDIA RTX 4070 538.42
  • SDK or header version if building from repo: SDK 1.3.275
  • Options enabled (synchronization, best practices, etc.): -

Describe the Issue

When using debugPrintfEXT along with VkValidationFeaturesEXT to fetch the debug output with a debug messenger callback, performance is so severely impacted, that the application is no longer interactive.

The setup:

Application:

	std::vector<VkValidationFeatureEnableEXT>  validation_feature_enables = {VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT};
	std::vector<VkValidationFeatureDisableEXT> disables{};

	VkValidationFeaturesEXT validation_features{VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT};
	validation_features.enabledValidationFeatureCount = 1;
	validation_features.pEnabledValidationFeatures    = validation_feature_enables.data();
	validation_features.pDisabledValidationFeatures   = 0;
	validation_features.pDisabledValidationFeatures   = disables.data();

	VkApplicationInfo app_info{VK_STRUCTURE_TYPE_APPLICATION_INFO};
	app_info.pApplicationName = "Shader debugprintf";
	app_info.pEngineName      = "Vulkan Samples";
	app_info.apiVersion       = VK_API_VERSION_1_2;

	std::vector<const char *> validation_layers = {"VK_LAYER_KHRONOS_validation"};

	VkInstanceCreateInfo instance_create_info{VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO};
	instance_create_info.ppEnabledExtensionNames = enabled_extensions.data();
	instance_create_info.enabledExtensionCount   = static_cast<uint32_t>(enabled_extensions.size());
	instance_create_info.pApplicationInfo        = &app_info;
	instance_create_info.pNext                   = &validation_features;
	instance_create_info.ppEnabledLayerNames     = validation_layers.data();
	instance_create_info.enabledLayerCount       = static_cast<uint32_t>(validation_layers.size());

	VkInstance vulkan_instance;
	result = vkCreateInstance(&instance_create_info, nullptr, &vulkan_instance);

	VkDebugUtilsMessengerCreateInfoEXT debug_utils_messenger_create_info{VK_STRUCTURE_TYPE_DEBUG_UTILS_MESSENGER_CREATE_INFO_EXT};
	debug_utils_messenger_create_info.messageSeverity = VK_DEBUG_UTILS_MESSAGE_SEVERITY_INFO_BIT_EXT;
	debug_utils_messenger_create_info.messageType     = VK_DEBUG_UTILS_MESSAGE_TYPE_VALIDATION_BIT_EXT;
	debug_utils_messenger_create_info.pfnUserCallback = debug_utils_message_callback;

	VK_CHECK(vkCreateDebugUtilsMessengerEXT(vulkan_instance, &debug_utils_messenger_create_info, nullptr, &debug_utils_messenger));

Vertex shader:

	if (gl_VertexIndex == 0) {
		debugPrintfEXT("Position = %v4f", outPos);
	}

After some debugging with a minimal sample, I noticed that this only happens with Vulkan 1.1 (set in VkApplicationInfo). Using any other api version, performance is fine.

Summary:

  • VK_API_VERSION_1_0:
    • Debug output works, performance is fine, even though it shouldn't work at all with VK 1.0
  • VK_API_VERSION_1_1:
    • Debug output works, severe slowdown, application is pretty much unusable
  • VK_API_VERSION_1_2 and VK_API_VERSION_1_3:
    • Debug output works, performance is fine

I can only imagine that the combination of debugPrintfEXT, VkValidationFeaturesEXT and VK_API_VERSION_1_1 triggers an odd constellation in the layers that somehow causes a severe slowdown.

Expected behavior

Using debugPrintfEXT with VkValidationFeaturesEXT shows the same performance no matter what API version.

Valid Usage ID
n/a

Additional context
n/a

@spencer-lunarg spencer-lunarg added the DebugPrintf debugPrintfEXT label Feb 23, 2024
@spencer-lunarg spencer-lunarg self-assigned this Feb 23, 2024
@spencer-lunarg
Copy link
Contributor

I have zero explanation, but thanks for doing some debugging to narrow down this to being a strange Vulkan 1.1 issue

I will try and reproduce and figure out what is going on

@SaschaWillems
Copy link
Contributor Author

Oddly, I can no longer reproduce this. I had multiple SDKs installed, removed all of them and only installed the most recent one. So most probably this was a problem with my setup. Sorry for the noise.

@spencer-lunarg
Copy link
Contributor

ok, if you see it again PLEASE report it

@SaschaWillems SaschaWillems reopened this Mar 7, 2024
@SaschaWillems
Copy link
Contributor Author

Reopening this. I have submitted a PR for a shader debugprintf sample at KhronosGroup/Vulkan-Samples#945 (review) and one of the reviewers seems to have the exact same problem.

I couldn't reproduce it on my machine after cleaning up SDK installs.

@SaschaWillems
Copy link
Contributor Author

In case you need a repo, we have merged the shader debugprintf sample at https://github.com/KhronosGroup/Vulkan-Samples

I can now reproduce this most of the time, and others have reported that problem too. It may be the sample's fault after all, though I can't seem to see why.

@spencer-lunarg
Copy link
Contributor

we have merged the shader debugprintf sample

Yep I saw and plan to use that!

I have as of 2 days ago, started to really get working on Debug PrintF again, it had not had any real attention towards it and so hopefully very soon (pending I'm off next week) get fixed soon

Thanks for keeping this issue alive

@SRSaunders
Copy link

I can also confirm the following results for the same shader_debugprintf sample on macOS Ventura (modified to enable the portability enumeration extension and flag), using Vulkan SDK 1.3.283.0:

  • VK_API_VERSION_1_0:
    • Debug output works, performance is fine
  • VK_API_VERSION_1_1:
    • Debug output works, severe slowdown, application is pretty much unusable
  • VK_API_VERSION_1_2 and VK_API_VERSION_1_3:
    • Debug output works, performance is fine

@spencer-lunarg
Copy link
Contributor

ok, sorry for the delay, grabbed latest ToT of Vulkan-Samples and can experience this... time to dig in

@spencer-lunarg
Copy link
Contributor

ok so the slow frame is occurring from VK_CHECK(get_device().get_queue_by_present(0).wait_idle()); inside ApiVulkanSample::submit_frame()

The issue is not just DebugPrintf, but also GPU-AV which as of #7321 started to use VK_KHR_timeline_semaphore to not have things blocked on a single queue

--- a/layers/gpu/instrumentation/gpu_shader_instrumentor.cpp
+++ b/layers/gpu/instrumentation/gpu_shader_instrumentor.cpp
@@ -236,8 +236,8 @@ void GpuShaderInstrumentor::PreCallRecordCreateDevice(VkPhysicalDevice physicalD
         }
     } else if (api_version == VK_API_VERSION_1_1) {
         // Add our new extensions (will only add if found)
-        const std::string_view ts_ext{"VK_KHR_timeline_semaphore"};
-        vku::AddExtension(*modified_create_info, ts_ext.data());
+        // const std::string_view ts_ext{"VK_KHR_timeline_semaphore"};
+        // vku::AddExtension(*modified_create_info, ts_ext.data());
         add_missing_features();
     }
 }

confirmed this by adjusting the samples to

// framework/core/device.cpp 
        VkDeviceCreateInfo create_info{VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO};
+       enabled_extensions.push_back(VK_KHR_TIMELINE_SEMAPHORE_EXTENSION_NAME);
 
// samples/extensions/shader_debugprintf/shader_debugprintf.cpp
        app_info.pEngineName      = "Vulkan Samples";
-       app_info.apiVersion       = VK_API_VERSION_1_1;
+       app_info.apiVersion       = VK_API_VERSION_1_2; 

and I get the same slowdown effects

@spencer-lunarg
Copy link
Contributor

ok, so was able to find the real source

Queue::Retire in layers/gpu/core/gpu_state_tracker.cpp was calling DispatchWaitSemaphoresKHR

this means when the app was using VK_API_VERSION_1_2 it was actually calling StubWaitSemaphoresKHR and not into the driver

If I swap the app to VK_API_VERSION_1_2 and then use DispatchWaitSemaphores it is also super slow

@jeremyg-lunarg
Copy link
Contributor

I think I see what is going on. The 'logic' I put into calling vvl::Queue::PostSubmit() occasionally misses a submission. This is usually hidden by calling the wrong version of vkWaitSemaphores(). THAT is hidden by the stub functions we create. So we're 3 bugs deep so far but getting closer

@spencer-lunarg
Copy link
Contributor

@SaschaWillems the latest VVL (at bottom of https://github.com/KhronosGroup/Vulkan-ValidationLayers/actions/runs/10115568549 when it finish building) should have this fixed, unfortunately it missed the 1.3.290 SDK, but thanks for reporting this as this would have been worse to figure out later why the synchronization inside GPU-AV was not working for something more complex

@SaschaWillems
Copy link
Contributor Author

Thank you for fixing this. Very much appreciated :)

@SaschaWillems
Copy link
Contributor Author

SaschaWillems commented Oct 22, 2024

I'm seeing the same issue again with the latest SDK, see KhronosGroup/Vulkan-Samples#1187

Might be a regression.

@spencer-lunarg
Copy link
Contributor

I will take a look today/tomorrow (and reply on the Samples PR), thanks for pinging here

@artem-lunarg
Copy link
Contributor

You can check if it started happening after #8546

@spencer-lunarg
Copy link
Contributor

going to just reopen, will take a look at the #8546 thanks @artem-lunarg for pointing that out

@spencer-lunarg
Copy link
Contributor

I just grabbed the last Samples, and hacked to be

-               set_api_version(instanceApiVersion <= VK_MAKE_API_VERSION(0, 1, 3, 290) ? VK_API_VERSION_1_2 : VK_API_VERSION_1_1);
+               set_api_version(VK_API_VERSION_1_1);
 
-       app_info.apiVersion       = instanceApiVersion <= VK_MAKE_API_VERSION(0, 1, 3, 290) ? VK_API_VERSION_1_2 : VK_API_VERSION_1_1;
+       app_info.apiVersion       = VK_API_VERSION_1_1;

and I can't reproduce even on old SDK versions locally

@SaschaWillems
Copy link
Contributor Author

There have been some new discoveries in the linked issue at KhronosGroup/Vulkan-Samples#1187.

tl;dr: It only seems to happen when VulkanConfigurator is active and might be an issue on our side. Not quite sure yet.

@spencer-lunarg
Copy link
Contributor

pulled down that PR and found

283 - 1 fps
290 - 1 fps
296 - 1000 fps
ToT - 1000 fps

gonna try and find the commit that changes it for me

@spencer-lunarg
Copy link
Contributor

add more fuel to the confusion, I tried the 1.3.290 SDK and get 1 FPS, but if I build with that commit tag (a926291) I don't see the slowdown ... officially confused 😢

@spencer-lunarg
Copy link
Contributor

spencer-lunarg commented Oct 28, 2024

Did some researching (@artem-lunarg might need help here)

On Windows AMD CPU there is no issue, but on NVIDIA it is getting blocked (so not a platform issue itself)

I turned off the DebugPrintf pass so it is not instrumenting anything, and it is the same, so it around the queue being blocked

The blocking is happening at auto wait_status = waiter.wait_until(GetCondWaitTimeout());

VkLayer_khronos_validation.dll!vvl::Queue::Wait(const Location & loc, unsigned __int64 until_seq) Line 123
VkLayer_khronos_validation.dll!ValidationStateTracker::PostCallRecordQueueWaitIdle(VkQueue_T * queue, const RecordObject & record_obj) Line 1593
VkLayer_khronos_validation.dll!vulkan_layer_chassis::QueueWaitIdle(VkQueue_T * queue) Line 1833
vulkan_samples.exe!vkb::Queue::wait_idle() Line 107	
vulkan_samples.exe!ApiVulkanSample::submit_frame() Line 581	
vulkan_samples.exe!ShaderDebugPrintf::draw() Line 378

If I change GetCondWaitTimeout to something like std::chrono::steady_clock::now() + std::chrono::milliseconds(10) I get the FPS back up to 60 FPS (but will also hit INTERNAL-ERROR-VkQueue-state-timeout when closing down the app)

I tried to just make it std::chrono::steady_clock::now() and got 60FPS but the values being printed where flashing incorrect values so seems a wait is in fact needed


Edit - So I noticed there were two things hitting this waiter.wait_until() call, one was also the Thread Safety layer, but turning that off VK_VALIDATION_FEATURE_DISABLE_THREAD_SAFETY_EXT (as well as everything else to be safe) and still see the issue... so can confirm with only DebugPrintf on, still see same effect

cc @SRSaunders

@artem-lunarg
Copy link
Contributor

@spencer-lunarg I can check this later today, maybe will come up with questions how to reproduce.

GetCondWaitTimeout is mostly to detect bugs, otherwise good mental model is that its default value should be infinity, e.g. much larger value than 10 seconds. We structured this system in a way that signal is always available. If not it's a bug in our code that should be fixed. Setting small wait timeout does not show much. This removes synchronization- in case of a blocking bug it won't block of course! but synchornization will also be missing.

One useful tip! When debugging this code almost always I want to modify GetCondWaitTimeout and set it to very large number. Otherwise while you investigating code on breakpoint it will report timeout because you waited on breakpoint more than 10 seconds and not because of some issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DebugPrintf debugPrintfEXT GPU-AV GPU Assisted Validation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants