Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vkCmdFillBuffer crashes with Metal Shader Validation enabled #2416

Open
squidbus opened this issue Jan 6, 2025 · 5 comments
Open

vkCmdFillBuffer crashes with Metal Shader Validation enabled #2416

squidbus opened this issue Jan 6, 2025 · 5 comments

Comments

@squidbus
Copy link
Contributor

squidbus commented Jan 6, 2025

With MTL_SHADER_VALIDATION=1 in my environment, I get the following crash after some number of vkCmdFillBuffer calls:

Crashed Thread:        2  Dispatch queue: com.apple.root.default-qos

...

Thread 2 Crashed::  Dispatch queue: com.apple.root.default-qos
0   <translation info unavailable>	      0x7001999b28 ???
1   shadps4                       	     0x200003eea8b 0x20000000000 + 4123275
2   libsystem_platform.dylib      	    0x7ff80affae1d _sigtramp + 29
3   ???                           	    0x600003cb7d80 ???
4   libobjc.A.dylib               	    0x7ff80abf63ba objc_autoreleasePoolPop + 235
5   libdispatch.dylib             	    0x7ff80ae1b455 _dispatch_call_block_and_release + 12
6   libdispatch.dylib             	    0x7ff80ae1c7e2 _dispatch_client_callout + 8
7   libdispatch.dylib             	    0x7ff80ae2c033 _dispatch_root_queue_drain + 872
8   libdispatch.dylib             	    0x7ff80ae2c5b2 _dispatch_worker_thread2 + 147
9   libsystem_pthread.dylib       	    0x7ff80afc0c3e _pthread_wqthread + 261
10  libsystem_pthread.dylib       	    0x7ff80afbfbdb start_wqthread + 15

...

Thread 35:: GpuCommandProcessor
0   ???                           	    0x7ff89b47aa84 ???
1   libsystem_kernel.dylib        	    0x7ff80af85aaa __psynch_cvwait + 10
2   libsystem_pthread.dylib       	    0x7ff80afc47a8 _pthread_cond_wait + 1193
3   libc++.1.dylib                	    0x7ff80af08a8d std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 93
4   libMoltenVK.dylib             	      0x7028cc5cc6 std::__1::cv_status std::__1::condition_variable::wait_for<long long, std::__1::ratio<1l, 1000000000l>>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>> const&) + 18 [inlined]
5   libMoltenVK.dylib             	      0x7028cc5cc6 void std::__1::condition_variable::__do_timed_wait[abi:ne180100]<std::__1::chrono::steady_clock>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>) + 18 (condition_variable.h:234) [inlined]
6   libMoltenVK.dylib             	      0x7028cc5cc6 std::__1::cv_status std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&) + 18 (condition_variable.h:160) [inlined]
7   libMoltenVK.dylib             	      0x7028cc5cc6 bool std::__1::condition_variable::wait_until<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_0>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::steady_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>>> const&, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_0) + 18 (condition_variable.h:168) [inlined]
8   libMoltenVK.dylib             	      0x7028cc5cc6 bool std::__1::condition_variable::wait_for[abi:ne180100]<long long, std::__1::ratio<1l, 1000000000l>, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_0>(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l>> const&, MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer)::$_0) + 39 (condition_variable.h:204) [inlined]
9   libMoltenVK.dylib             	      0x7028cc5cc6 MVKMetalCompiler::compile(std::__1::unique_lock<std::__1::mutex>&, void () block_pointer) + 326 (MVKSync.mm:579)
10  libMoltenVK.dylib             	      0x7028c83c34 MVKComputePipelineCompiler::newMTLComputePipelineState(id<MTLFunction>) + 100 (MVKPipeline.mm:2829)
11  libMoltenVK.dylib             	      0x7028c283de MVKCommandResourceFactory::newMTLComputePipelineState(char const*, MVKVulkanAPIDeviceObject*) + 209 (MVKCommandResourceFactory.mm:670) [inlined]
12  libMoltenVK.dylib             	      0x7028c283de MVKCommandResourceFactory::newCmdFillBufferMTLComputePipelineState(MVKVulkanAPIDeviceObject*) + 222 (MVKCommandResourceFactory.mm:522)
13  libMoltenVK.dylib             	      0x7028c14705 MVKCommandEncodingPool::getCmdFillBufferMTLComputePipelineState() + 69 (MVKCommandEncodingPool.mm:109)
14  libMoltenVK.dylib             	      0x7028c03db4 MVKCmdFillBuffer::encode(MVKCommandEncoder*) + 100 (MVKCmdTransfer.mm:1769)
15  libMoltenVK.dylib             	      0x7028c05a78 MVKCommandEncoder::encodeCommandsImpl(MVKCommand*) + 44 (MVKCommandBuffer.mm:379) [inlined]
16  libMoltenVK.dylib             	      0x7028c05a78 MVKCommandEncoder::encodeCommands(MVKCommand*) + 120 (MVKCommandBuffer.mm:372) [inlined]
17  libMoltenVK.dylib             	      0x7028c05a78 MVKCommandEncoder::encode(id<MTLCommandBuffer>, MVKCommandEncodingContext*) + 232 (MVKCommandBuffer.mm:346)
18  libMoltenVK.dylib             	      0x7028c05ea6 MVKCommandBuffer::submit(MVKQueueCommandBufferSubmission*, MVKCommandEncodingContext*) + 182 (MVKCommandBuffer.mm:240)
19  libMoltenVK.dylib             	      0x7028cb0d56 MVKQueueFullCommandBufferSubmission<16ul>::submitCommandBuffers() + 86 (MVKQueue.mm:664)
20  libMoltenVK.dylib             	      0x7028caf387 MVKQueueCommandBufferSubmission::execute() + 263 (MVKQueue.mm:484)
21  libMoltenVK.dylib             	      0x7028cad07a execute(MVKQueueSubmission*) + 17 (MVKQueue.mm:72) [inlined]
22  libMoltenVK.dylib             	      0x7028cad07a MVKQueue::submit(MVKQueueSubmission*) + 186 (MVKQueue.mm:99)
23  libMoltenVK.dylib             	      0x7028cad50f VkResult MVKQueue::submit<VkSubmitInfo>(unsigned int, VkSubmitInfo const*, VkFence_T*, MVKCommandUse) + 255 (MVKQueue.mm:138)
24  libMoltenVK.dylib             	      0x7028cd862a vkQueueSubmit + 74 (vulkan.mm:438)

Enabling zombie objects using NSZombieEnabled=YES reveals this is related to compute pipeline descriptors:

*** -[MTLComputePipelineDescriptorInternal release]: message sent to deallocated instance 0x60000031a140

Removing my call to vkCmdFillBuffer resolves the issue, so it seems to only be related to compute used for that command and not in general.

@squidbus
Copy link
Contributor Author

squidbus commented Jan 6, 2025

Using newComputePipelineStateWithDescriptor instead of newComputePipelineStateWithFunction resolves the crash for some reason, not sure what's going on inside the function version that is causing this double-release:

MTLComputePipelineDescriptor* plDesc = [MTLComputePipelineDescriptor new];	// temp retain
plDesc.computeFunction = mtlFunction;
[mtlDev newComputePipelineStateWithDescriptor: plDesc
			    completionHandler: ^(id<MTLComputePipelineState> ps, NSError* error) {
				    bool isLate = compileComplete(ps, error);
				    if (isLate) { destroy(); }
			    }];
[plDesc release];								// temp release

@squidbus
Copy link
Contributor Author

squidbus commented Jan 8, 2025

Not sure if it's related but I also seem to get this when using vkCmdFillBuffer with METAL_CAPTURE_ENABLED=1:

2025-01-07 17:04:05.721 shadps4[67662:2382745] Compute Pipeline Descriptor Validation
computeFunction is associated with a different device
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[AGXG16XFamilyComputePipeline touch]: unrecognized selector sent to instance 0x7fc249a0ed30'
*** First throw call stack:
(
	0   CoreFoundation                      0x00007ff80ab82b8e __exceptionPreprocess + 242
	1   libobjc.A.dylib                     0x00007ff80a668f12 objc_exception_throw + 62
	2   CoreFoundation                      0x00007ff80ac2a412 -[NSObject(NSObject) __retain_OA] + 0
	3   CoreFoundation                      0x00007ff80aaf3b44 ___forwarding___ + 1379
	4   CoreFoundation                      0x00007ff80aaf3558 _CF_forwarding_prep_0 + 120
	5   GPUToolsCapture                     0x000000702196b5ea -[CaptureMTLComputeCommandEncoder setComputePipelineState:] + 48
	6   libMoltenVK.dylib                   0x0000007029dd4daf _ZN16MVKCmdFillBuffer6encodeEP17MVKCommandEncoder + 271
	...

Which is odd because I only have one device.

@cdavis5e
Copy link
Collaborator

cdavis5e commented Jan 8, 2025

Not sure if it's related but I also seem to get this when using vkCmdFillBuffer with METAL_CAPTURE_ENABLED=1:

2025-01-07 17:04:05.721 shadps4[67662:2382745] Compute Pipeline Descriptor Validation
computeFunction is associated with a different device
	...

Which is odd because I only have one device.

That's likely because when frame capture is enabled, Metal uses a special MTLDevice that wraps the real one to implement this. How the shader function from one device is being used with the other is currently beyond me.

@squidbus
Copy link
Contributor Author

squidbus commented Jan 8, 2025

That's likely because when frame capture is enabled, Metal uses a special MTLDevice that wraps the real one to implement this. How the shader function from one device is being used with the other is currently beyond me.

Yeah it's very strange... I did realize I was still on my modified version, and confirmed that the capture issue only happens with the workaround I posted above for the original issue. I do have shader validation is disabled when I have capture enabled, so I'm not trying to do both or anything. Not sure what's going on here with the fill buffer path.

@squidbus
Copy link
Contributor Author

squidbus commented Jan 8, 2025

Works if I move the descriptor out a few levels to MVKCommandResourceFactory::newMTLComputePipelineState. So the question I have is, would this be an acceptable solution creating a temporary descriptor instead of providing the function directly? Because I still don't know what's going on or if this is a MoltenVK or Metal bug with their layers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants