[WIP] MSL: mesh shader initial #2074

Try · 2022-12-18T19:14:30Z

This is more initial draft to start conversation on how-to.
Mainly need to know opinion on copy-pass and SPIRVCrossDecorationInterfaceMemberIndex.

Workflow:

When cross-compiling, new types gets synthesized:

spvPerVertex all regular varyings + gl_Position
spvPerPrimitive - varyings marked as perprimitiveEXT
using spvMesh_t = mesh<spvPerVertex, spvPerPrimitive, ... >;

gl_PrimitiveTriangleIndicesEXT becomes spvMesh. Affects handling OpSetMeshOutputsEXT.

OpStore to index buffer remapped as 3 calls, in case of triangles, to spvMesh.set_index
There is ugly hack to implement so - see code.

gl_MeshVerticesEXT and varyings are represented as shared memory arrays. One shader is done - they get packed into single struct. This is to work-around of SPIRV-vs-MSL api-differences.
In theory, if shader writes only with [gl_LocalInvocationIndex] array can be removed.

Concerns

Had to change meaning SPIRVCrossDecorationInterfaceMemberIndex. Current implementation doesn't track array-elements - they all have same ID. On vertex/fragment seems to work fine, don't have any tesselation/geometry test-cases.
I'm not sure how to go about it: current workflow with lambdas isn't suitable for

Bugs/TODO:

Haven't tested gl_ClipDistance/gl_CullDistance - dunno how exactly they should work in spirv-cross.

Task(Object) shader in metal is different in metal:
In GLSL EmitMeshTasksEXT is terminator - calling it suppose to halt shader execution.
set_threadgroups_per_grid - doesn't stop shader execution.

Performance

Apple M1 testing(in OpenGothic) shows ~2x fps regression. Can be because shared memory, but don't know for sure.
Asked about this on apple forum: https://developer.apple.com/forums/thread/722047

Apple M3 seem to work as fast as expected

Try · 2022-12-18T19:56:08Z

spirv_msl.cpp

+	{
+		auto &execution = get_entry_point();
+		auto str = to_expression(lhs_expression);
+		str = str.substr(str.find_first_of('[')+1, str.find_last_of(']') - str.find_first_of('[') - 1);


This code look quite bad, but I don't see better way, atm.
Need some way to access OpAccessChain parameters, or some way to emit shorten chain.

Hey, small bump to this.
My question basically: is there any way to reach original OpAccessChain with all arguments?
Needed here and for DX mesh-shader as well.
Thanks!

Try · 2022-12-19T23:11:16Z

Tested today compilation variant, when instead of shared memory array local variable been used. This happen to be valid for my shaders, since there is 1-to-1 match between gl_LocalInvocationID and vertex.
Shader example: https://shader-playground.timjones.io/641b24c9f6700a03eb9f69414ebbf22b
Still FPS roughly as bad as it was, so probably metal3 mesh implementation is just bad :(

rcaridade145 · 2022-12-21T12:05:58Z

@Try Can this be due to the TBDR arch ?
There is a discussion on that here https://forum.beyond3d.com/threads/apple-powervr-tbdr-gpu-architecture-speculation-thread.61873/page-5#post-2171147

I'll leave there some other links i came across reading about this issue:
https://tellusim.com/mesh-shader-emulation/

Unreal Engine - https://blog.imaginationtech.com/powervr-performance-tips-for-unreal-engine-4/ - when developing focusing on PowerVR recomends disabling Early-Z due to how PowerVR handles geometry. Being Apple based on PowerVR i hope some of this helps you.

Try · 2022-12-21T18:25:13Z

@rcaridade145

Can this be due to the TBDR arch ?

That depends a lot on what kind of TBDR is it. According to https://blog.imaginationtech.com/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/ PowerVR has fat tiler, with native support for vertex shader.
In theory mesh should work very well for them: just take meshlet and write it to polygon list. Naturally even one small implementation detail can ruin performance.

So, I think best way is to ask them.

disabling Early-Z

Not related. In any case, when renderer is same and only difference is vertex-vs-mesh, then performance should be roughly same.

HansKristian-Work · 2023-01-05T10:55:27Z

Back from holiday, just ACK-ing that I've seen it and I'll look at it when I have time.

HansKristian-Work · 2023-01-12T11:54:20Z

reference/opt/shaders-msl/mesh/basic.msl30.spv14.mesh

+[[mesh]] void main0(uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]], uint3 gl_GlobalInvocationID [[thread_position_in_grid]], spvMesh_t spvMesh)
+{
+    threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2];
+    _4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID);


What is the equivalent for SetMeshOutputsEXT in Metal? I see set_primitive_count there, but is there no set_vertex_count?

Similar to NV extension only set_primitive_count. Also index buffer is more similar to original NV extension(uint index[max_prim*3]).

HansKristian-Work · 2023-01-12T11:56:55Z

reference/opt/shaders-msl/mesh/basic.msl30.spv14.mesh

+    threadgroup gl_MeshPerVertexEXT gl_MeshVerticesEXT[2];
+    _4(gl_MeshVerticesEXT, gl_LocalInvocationIndex, spvMesh, gl_GlobalInvocationID);
+    threadgroup_barrier(mem_flags::mem_threadgroup);
+    for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 2; spvI += spvThreadCount)


Not gonna lie, having to emit workarounds like this is just depressing. Is this even going to give meaningful uplifts?

I've tested performance of 2 approaches in my engine:

copy loop

assume that gl_MeshVerticesEXT are always addressed by gl_LocalInvocationIndex and use one thread-local variable.

Both are similar, and very slow, ~2x slower than draw-indexed spam with no culling. Asked them on developer forum: https://developer.apple.com/forums/thread/722047, yet there is no useful answers.

HansKristian-Work · 2023-01-12T12:02:37Z

spirv_msl.cpp

+	case OpSetMeshOutputsEXT:
+	{
+		flush_variable_declaration(builtin_primitive_indices_id);
+		statement("spvMesh.set_primitive_count(", to_unpacked_expression(ops[1]), ");");


How does this handle rules where the first invocation of the workgroup wins?

It doesn't in this draft. Metal spec is unclear about access from multiple threads.
Also emulating vulkan spec here is, unfortunately, another tough workaround: would require something like subgroupElect, but for entire workgroup.

subgroupElect

I didn't realize, back than, that Vulkan requires uniform control flow. Fixed now.

HansKristian-Work · 2023-01-12T12:11:48Z

Looking over this, I'm not very excited about the prospect of merging this. The impedance mismatch with the mesh type is a disaster for performance, and I'm not particularly excited about having to maintain painful workarounds like this. The tessellation code is bad enough as it is, with ridiculous heroics, but we are somewhat forced to implement it.

Mesh shaders on the other hands is still deeply in "would be nice" territory. No one can reasonably rely on it any time soon, and if it cannot provide meaningful uplift either, I'd defer it indefinitely.

Try · 2023-01-12T21:07:08Z

Looking over this, I'm not very excited about the prospect of merging this.

Yes, I do agree on the fact that current mesh-shader in metal3 is not cross-compilation friendly (and bad overall) and PR can be discarded, if no objections from MoltenVK guys.

On my side still have small hope that only M1-laptops do have performance issues, and desktop Mac can be more performant.
Another small hope: what if they have excellent task shader support, that will be way better and fast (unlikely).

Mesh shaders on the other hands is still deeply in "would be nice" territory.

To be more clear: for my engine I've picked mesh-shader as lesser evil - other gpu-driven approaches, like draw-indirect, seem to be worse.

TLTR: feel free do discard PR

Try · 2023-02-12T19:56:16Z

Pushed new take on metal-mesh.

For max_vertices <= num_threads, for loop can be avoided: one thread writes exactly one vertex(or primitive).
I'm not sure, if this is enough, to claim that translated code is readable and clean.
As for shared memory: tested on M1 and one of contributors tested my engine on M2 max - no measurable impact.
(note: XCode GPU-tools still have no mesh-shader support, so we can measure only FPS in game)
As for bad performance overall:
In my game I've made a few test cases:
a) outdoor
b) cave in middle of game world
c) cave in corner of game world

'a' and 'b' shown similar performance, yet 'c' was fastest, when looking away from world and slowest when camera points to world center.
My current theory:

based on WWDC 2016 presentation, compute-warp launch is expensive on apple
most likely mesh-shader also has huge warp launch cost, and that defeats the feature.
they design GPU assuming that every shader will be exactly same as NVidia sample (with task-based culling)
I've asked on apple forum a few days ago, no response so far.

Can't really test task-shader at this point: cross-compilation is relatively straight, but build sensible culling in task is very difficult.

Try · 2023-02-26T18:52:25Z

Submitted initial implementation for Task shader.

Still there are differences. In GLSL EmitMeshTasksEXT is terminator - calling it suppose to halt shader execution.
In Metal: mgp.set_threadgroups_per_grid - doesn't stop shader execution, and behave similar to old gl_TaskCountNV.

For now PR assumes, that EmitMeshTasksEXT is called from main, and generates call+return sequence.

BeastLe9enD · 2023-03-15T13:00:09Z

For MoltenVK, this is really interesting, I`m working on a PR (see KhronosGroup/MoltenVK#1845 ) for this at the moment that is using this branch to convert SPIRV mesh shaders to MSL code.

BeastLe9enD · 2023-03-25T12:21:33Z

@Try I ran into a problem where the generated mesh shader is invalid and does not compile.
I have the following HLSL code generated with spirv to a mesh shader, I also added the resulting spirv assembly:

struct MSOutput {
    float4 Position: SV_Position;
    [[vk::location(0)]] float3 Color: COLOR0;
};

[NumThreads(1, 1, 1)]
[OutputTopology("triangle")]
void mesh_main(out indices uint3 triangles[1], out vertices MSOutput vertices[3]) {
    SetMeshOutputCounts(3, 1);
    triangles[0] = uint3(0, 1, 2);

    vertices[0].Position = float4(-0.5, 0.5, 0.0, 1.0);
    vertices[0].Color = float3(1.0, 0.0, 0.0);

    vertices[1].Position = float4(0.5, 0.5, 0.0, 1.0);
    vertices[1].Color = float3(0.0, 1.0, 0.0);

    vertices[2].Position = float4(0.0, -0.5, 0.0, 1.0);
    vertices[2].Color = float3(0.0, 0.0, 1.0);
}

When I try to do spirv-cross --msl --msl-version 33000 test.spv, it does not work:

[[mesh]] void mesh_main(uint gl_LocalInvocationIndex [[thread_index_in_threadgroup]], spvMesh_t spvMesh)
{
    threadgroup uint spv_primitive_count;
    threadgroup TaskPayload payload;
    threadgroup float4 v_3[3];
    threadgroup float3 out_var_COLOR0[3];
    _1(spvMesh, spv_primitive_count, gl_Position, out_var_COLOR0, gl_LocalInvocationIndex);
    threadgroup_barrier(mem_flags::mem_threadgroup);
    if (spv_primitive_count == 0)
    {
        return;
    }
    spvMesh.set_primitive_count(spv_primitive_count);
    for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 3; spvI += spvThreadCount)
    {
        spvPerVertex spvV = {};
        spvV.gl_Position = v_3[spvI];
        spvV.out_var_COLOR0 = out_var_COLOR0[spvI];
        spvMesh.set_vertex(spvI, spvV);
    }
}

The position is stored in threadgroup float4 v_3[3];, but when _1 is invoked, gl_Position is passed instead of v_3, which results in an error. I just put it here as a comment since this PR has not been merged yet.

; SPIR-V
; Version: 1.6
; Generator: Google spiregg; 0
; Bound: 60
; Schema: 0
               OpCapability MeshShadingEXT
               OpExtension "SPV_EXT_mesh_shader"
               OpMemoryModel Logical GLSL450
               OpEntryPoint MeshEXT %mesh_main "mesh_main" %2 %gl_Position %out_var_COLOR0 %payload
               OpExecutionMode %mesh_main LocalSize 1 1 1
               OpExecutionMode %mesh_main OutputTrianglesNV
               OpExecutionMode %mesh_main OutputVertices 3
               OpExecutionMode %mesh_main OutputPrimitivesNV 1
               OpSource HLSL 660
               OpName %TaskPayload "TaskPayload"
               OpName %payload "payload"
               OpName %out_var_COLOR0 "out.var.COLOR0"
               OpName %mesh_main "mesh_main"
               OpName %MSOutput "MSOutput"
               OpMemberName %MSOutput 0 "Position"
               OpMemberName %MSOutput 1 "Color"
               OpDecorate %2 BuiltIn PrimitiveTriangleIndicesEXT
               OpDecorate %gl_Position BuiltIn Position
               OpDecorate %out_var_COLOR0 Location 0
       %uint = OpTypeInt 32 0
     %uint_1 = OpConstant %uint 1
     %uint_3 = OpConstant %uint 3
     %uint_0 = OpConstant %uint 0
     %uint_2 = OpConstant %uint 2
     %v3uint = OpTypeVector %uint 3
         %14 = OpConstantComposite %v3uint %uint_0 %uint_1 %uint_2
        %int = OpTypeInt 32 1
      %int_0 = OpConstant %int 0
      %float = OpTypeFloat 32
 %float_n0_5 = OpConstant %float -0.5
  %float_0_5 = OpConstant %float 0.5
    %float_0 = OpConstant %float 0
    %float_1 = OpConstant %float 1
    %v4float = OpTypeVector %float 4
         %23 = OpConstantComposite %v4float %float_n0_5 %float_0_5 %float_0 %float_1
    %v3float = OpTypeVector %float 3
         %25 = OpConstantComposite %v3float %float_1 %float_0 %float_0
         %26 = OpConstantComposite %v4float %float_0_5 %float_0_5 %float_0 %float_1
      %int_1 = OpConstant %int 1
         %28 = OpConstantComposite %v3float %float_0 %float_1 %float_0
         %29 = OpConstantComposite %v4float %float_0 %float_n0_5 %float_0 %float_1
      %int_2 = OpConstant %int 2
         %31 = OpConstantComposite %v3float %float_0 %float_0 %float_1
%TaskPayload = OpTypeStruct
%_ptr_Workgroup_TaskPayload = OpTypePointer Workgroup %TaskPayload
%_arr_v3uint_uint_1 = OpTypeArray %v3uint %uint_1
%_ptr_Output__arr_v3uint_uint_1 = OpTypePointer Output %_arr_v3uint_uint_1
%_arr_v4float_uint_3 = OpTypeArray %v4float %uint_3
%_ptr_Output__arr_v4float_uint_3 = OpTypePointer Output %_arr_v4float_uint_3
%_arr_v3float_uint_3 = OpTypeArray %v3float %uint_3
%_ptr_Output__arr_v3float_uint_3 = OpTypePointer Output %_arr_v3float_uint_3
       %void = OpTypeVoid
         %40 = OpTypeFunction %void
%_ptr_Function__arr_v3uint_uint_1 = OpTypePointer Function %_arr_v3uint_uint_1
   %MSOutput = OpTypeStruct %v4float %v3float
%_arr_MSOutput_uint_3 = OpTypeArray %MSOutput %uint_3
%_ptr_Function__arr_MSOutput_uint_3 = OpTypePointer Function %_arr_MSOutput_uint_3
         %44 = OpTypeFunction %void %_ptr_Function__arr_v3uint_uint_1 %_ptr_Function__arr_MSOutput_uint_3
%_ptr_Output_v3uint = OpTypePointer Output %v3uint
%_ptr_Output_v4float = OpTypePointer Output %v4float
%_ptr_Output_v3float = OpTypePointer Output %v3float
    %payload = OpVariable %_ptr_Workgroup_TaskPayload Workgroup
          %2 = OpVariable %_ptr_Output__arr_v3uint_uint_1 Output
%gl_Position = OpVariable %_ptr_Output__arr_v4float_uint_3 Output
%out_var_COLOR0 = OpVariable %_ptr_Output__arr_v3float_uint_3 Output
         %48 = OpUndef %v3uint
         %49 = OpUndef %MSOutput
  %mesh_main = OpFunction %void None %40
         %50 = OpLabel
               OpSetMeshOutputsEXT %uint_3 %uint_1
         %51 = OpAccessChain %_ptr_Output_v3uint %2 %int_0
               OpStore %51 %14
         %52 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_0
               OpStore %52 %23
         %53 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_0
               OpStore %53 %25
         %54 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_1
               OpStore %54 %26
         %55 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_1
               OpStore %55 %28
         %56 = OpAccessChain %_ptr_Output_v4float %gl_Position %int_2
               OpStore %56 %29
         %57 = OpAccessChain %_ptr_Output_v3float %out_var_COLOR0 %int_2
               OpStore %57 %31
         %58 = OpCompositeConstruct %_arr_v3uint_uint_1 %48
         %59 = OpCompositeConstruct %_arr_MSOutput_uint_3 %49 %49 %49
               OpReturn
               OpFunctionEnd

Try · 2023-03-25T17:40:49Z

Hi, @BeastLe9enD !

I have the following HLSL code generated with spirv to a mesh shader, I also added the resulting spirv assembly

What HLSL compiler been used? The resulting spirv doesn't look correct:

OpEntryPoint MeshEXT %mesh_main "mesh_main" %2 %gl_Position %out_var_COLOR0 %payload
Shader claim to output vec4 gl_Position[], what is not possible in mesh, where gl_MeshVerticesEXT should be used instead.

%TaskPayload = OpTypeStruct
Empty playload struct? Also TaskPayload is in HLSL...

BeastLe9enD · 2023-03-25T22:42:20Z

@Try oh yea, true, I missed that! That spirv code does not look right although its running fine on my NVIDIA gpu.

Im using dxcompiler.dll: 1.7 - 1.7.0.3795 (bef540d36) shipped with the newest vulkan sdk.

BeastLe9enD · 2023-03-31T12:15:51Z

@Try are u sure this isn't correct? I think DXC just names it %gl_Position instead of %gl_MeshVerticesEXT.
It decorates it with OpDecorate %gl_Position BuiltIn Position, the equivalent glslangValidator does is OpMemberDecorate %gl_MeshPerVertexEXT 0 BuiltIn Position which should be the same imo.

Try · 2023-04-01T13:31:38Z

@BeastLe9enD

I think DXC just names it %gl_Position instead of %gl_MeshVerticesEXT.

This is not what provided spiv code shows:

_arr_v4float_uint_3 = OpTypeArray %v4float %uint_3
%_ptr_Output__arr_v4float_uint_3 = OpTypePointer Output %_arr_v4float_uint_3
`%gl_Position = OpVariable %_ptr_Output__arr_v4float_uint_3 Output`

In GLSL, this would be:
out vec3 gl_Position[3], what is not correct, as gl_Position must be member of gl_MeshPerVertexEXT

BeastLe9enD · 2023-09-10T17:50:15Z

Is this still a thing? And is there a possibility of merging this in the future?

Try · 2023-09-16T17:57:33Z

Is this still a thing?

Can ask you same :D Any news about molten-vk prototype?

On my end not doing much here, as mesh-shader is too broken on apple: multiple complex hack are required to make it compile and even then shader is too slow. When performance feature runs slow - that's very bad

BeastLe9enD · 2023-09-16T20:42:17Z

Can ask you same :D Any news about molten-vk prototype?

I hope its doing well :D don't know, at least I have plans to finish it this year

multiple complex hack are required to make it compile and even then shader is too slow

you mean mesh shader on apple are slow in general or just the spirv -> msl mapped code is suboptimal?

Try · 2023-09-17T20:23:17Z

you mean mesh shader on apple are slow in general or just the spirv -> msl mapped code is suboptimal?

It's hard to say for sure:
XCode has no profiling for mesh/object shaders. Also when I asked, apple provide no useful answer on what good practice is. So, we can only speculate. Few key-points to mention:

shared memory usage is bad - in most gpu's it puts limit on how many warps can run in parallel
nvidia model of mesh shader is not something what can work on tile-based gpu's, like my MacM1
in my project mesh-path show 2x performance regression versus vertex-path
no way to reason why it's slow - need profiling tools

BeastLe9enD · 2023-09-17T21:56:01Z

shared memory usage is bad - in most gpu's it puts limit on how many warps can run in parallel

fair point, I think we could stop using shared memory if we detect that meshlet data are only written in order like you would do in MSL although it would be really painful to implement into spirv-cross because we need to analyze the spirv code first

nvidia model of mesh shader is not something what can work on tile-based gpu's, like my MacM1

what do you mean by nvidia model of mesh shaders? the mesh shaders how they are implemented in dx12/vulkan in general or the guidlines that nvidia gave for getting decent performance like this ? https://developer.nvidia.com/blog/advanced-api-performance-mesh-shaders/
would be interesting here what apple suggests you to use mesh shaders for....

in my project mesh-path show 2x performance regression versus vertex-path

is your project open source? what are you using mesh shaders for?

no way to reason why it's slow - need profiling tools

yep thats indeed really bad. if you have neither a profiler or at least know what is good practise and whats not, the only real thing you can do is guess and pray that the thing you're doing is good xD

Try · 2023-09-18T20:26:36Z

could stop using shared memory

I've tested this path as well (my shader happens to be like so) - no measurable FPS improvement.

what do you mean by nvidia model of mesh shaders?

pre-rasterization pipeline is very different across GPU-vendors. On NV (apparently) any thread in warp can output any part of meshlet. On AMD(RDNA2) - each thread responsible for exactly one vertex+primitive and driver need to emulate NV behavior in many cases. On tile-based GPU every vendor do they own thing; on simple case: replay draw-calls multiple time once per tile - so meshlets do no make any sense.
M1 is tile-based, so driver need to do a lot on apple side to make it at least valid.

would be interesting here what apple suggests you to use mesh shaders for...

https://developer.apple.com/forums/thread/722047 nothing interesting, This is probably working as expected with the M1 GPU; Mesh shaders on M1 are intended to enable use-cases that cannot be expressed as draws

is your project open source? what are you using mesh shaders for?

Yes: https://github.com/Try/OpenGothic/blob/master/shader/materials/main.mesh
use-case is quite simple: hiz+frustum culling.

zmarlon · 2023-11-03T16:19:33Z

Since Apple has now introduced chips with the M3 generation that support hardware mesh shading, I wanted to find out again whether this branch will be developed further, or whether there are plans to merge this feature into the main branch.
I would like to continue working on the mesh shader implementation in MoltenVK based on this.

Try · 2023-11-04T18:20:35Z

M3 generation that support hardware mesh shading

Hm, took them less than forever :) Do you have M3 to test it?

or whether there are plans to merge this feature into the main branch

I've rebased it on current main to resolve merge conflicts, yet not it has sporadic/unreproducable failures in CI.
Let me do clear rewrite..
In general I would like to have full mesh support in my engine and deprecate vertex, so merging is desirable (if mesh-shader really fixed in apple driver&hw).

UPD:
Now CI is green

spnda · 2024-04-01T23:05:30Z

I think it is important to have a long-term solution for EmitMeshTasksEXT to terminate the task shader, but I don't think it should be a blocker for this PR.

As I would very much like to see this land, and given that the Metal mesh shading interface aligns more with that from VK_NV_mesh_shader, I propose we firstly support that extension and then add support for the EXT-variant later? This way we could release a working version into the wild much earlier and fix more bugs sooner, and then discuss how to handle things like EmitMeshTasksEXT later.

With VK_NV_mesh_shader, setting the amount of mesh shaders to be spawned is done through assigning gl_TaskCountNV a value with no shader termination occurring, similarly to Metals set_threadgroups_per_grid. Also, @Try commented on one of the reviews "Similar to NV extension only set_primitive_count. Also index buffer is more similar to original NV extension(uint index[max_prim*3]).", further suggesting more similarities to the NV extension.

BeastLe9enD · 2024-04-07T11:24:09Z

@spnda I get your point, on the other side I think nobody is using the NV extension anymore (since the EXT one is available), so its questionable who would benefit from the NV extension being available.

BeastLe9enD · 2024-08-11T16:05:16Z

What if we support an initial mesh shader version with only mesh shader support, so VkPhysicalDeviceMeshShaderFeaturesEXT returns false for taskShader? So we would benefit in cases where only mesh shaders are required (like it is for me), and we would not need to switch to the NV extension. Your thoughts on this?

Try · 2024-08-12T09:55:58Z

Your thoughts on this[VkPhysicalDeviceMeshShaderFeaturesEXT] ?

You probably should raise it on MoltenVK github. For me - it's better to have mesh-shader upstream, so there wont be need to update this path all the time.
Maybe one thing to have in mind: taskShader=false is not conformant according to Vulkan. Spec says both task and mesh stages must be supported, in this extension.

squidbus · 2024-10-13T08:38:00Z

Are there any plans to move forward with this? Even just in the NV extension form, mesh shaders would be useful to have support for.

zmarlon · 2024-10-13T10:50:31Z

You could also support a sub-set of the task shader. This would still be better, as it would still cover a large number of cases.

HansKristian-Work · 2024-10-14T09:42:10Z

VK_NV_mesh_shader, I propose we firstly support that extension and then add support for the EXT-variant later?

Only supporting a vendor extension when there is an EXT is a dead-end, and I don't see the point.

What is blocking task shaders from being potentially supported? As mentioned, that would not be conformant, indeed.

squidbus · 2024-10-14T09:47:33Z

What is blocking task shaders from being potentially supported? As mentioned, that would not be conformant, indeed.

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

HansKristian-Work · 2024-10-14T09:50:01Z

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

Is that the only problem? In practical scenarios, no shader relies on that. If EmitMeshTasksEXT is called in main(), a simple return; after would be correct, and for calls in a function, we can just throw an error. MoltenVK could in theory run spir-v inlining on task shaders to avoid that problem for conformance, but I'm not going to hell and back to try and workaround something that can be worked around at the SPIR-V level.

squidbus · 2024-10-14T09:55:26Z

I'm a bit of a late-comer here but it's the main problem I'm aware of from the comment history here, @Try is there anything else blocking this if not terminating in EmitMeshTasksEXT is not an issue?

zmarlon · 2024-10-14T12:30:48Z

EmitMeshTasksEXT here does not terminate the shader, unlike in the spec.

Is that the only problem? In practical scenarios, no shader relies on that. If EmitMeshTasksEXT is called in main(), a simple return; after would be correct, and for calls in a function, we can just throw an error. MoltenVK could in theory run spir-v inlining on task shaders to avoid that problem for conformance, but I'm not going to hell and back to try and workaround something that can be worked around at the SPIR-V level.

That sounds like a good idea. If this was implemented, could it actually be merged? @HansKristian-Work @Try

Try · 2024-10-14T17:31:32Z

@squidbus @HansKristian-Work

@Try is there anything else blocking this if not terminating in EmitMeshTasksEXT is not an issue?
Is that[EmitMeshTasksEXT ] the only problem?

In principle:

EmitMeshTasksEXT
gl_ClipDistance/gl_CullDistance are untested (and not much unit tests in general)
merge conflicts - can work on it this weekend, if we agree on eventually merge it

BeastLe9enD · 2024-10-14T18:58:10Z

Sounds nice. if this is merged, @zmarlon and me can finally finish the implementation on the MoltenVK side.

HansKristian-Work · 2024-10-15T08:28:41Z

merge conflicts - can work on it this weekend, if we agree on eventually merge it

I wouldn't spend time on this yet before I've committed to supporting it. I need to study the implementation in more detail to see how much of mess mesh shaders end up adding ...

zmarlon · 2024-10-15T16:19:35Z

The Problem is that we are blocked on the MVK side if this is not getting merged. Do you have any Plans at which time you want to look into it?

HansKristian-Work · 2024-10-21T12:36:46Z

I'll try to have a look this week if I don't get sidetracked with other stuff ...

HansKristian-Work · 2024-10-23T12:09:24Z

spirv_msl.cpp

-	    BuiltIn(get_decoration(lhs_expression, DecorationBuiltIn)) == BuiltInSampleMask &&
-	    is_array(type))
+	// Meshlet indices
+	if (lhs_e != nullptr && lhs_e->loaded_from == builtin_mesh_primitive_indices_id &&


This is not the right place to do it. Leaf functions should receive a plain threadgroup uint3 gl_Indices[MaxPrimitives] array. The lowering to set_index needs to happen in the wrapped main.

Pretty sure you can read the content of that array in SPIR-V and this code would break that.

HansKristian-Work · 2024-10-23T12:12:28Z

spirv_msl.cpp

+		flush_variable_declaration(builtin_mesh_primitive_indices_id);
+		statement("if (gl_LocalInvocationIndex == 0)");
+		begin_scope();
+		statement("spv_primitive_count = ", to_unpacked_expression(ops[1]), ";");


There is no need for the primitive count to be threadgroup when shaders write to it. There can be a threadgroup variable in the wrapped main that is written before the barrier. It can just be plain thread.

Also, it's missing the vertex count. That is useful when copying out vertex data. No need to copy out unused vertices ... This fake builtin can just be a uvec2 really.

HansKristian-Work · 2024-10-23T12:13:09Z

spirv_msl.cpp

-			// Relevant for multiple entry-point modules which might declare unused builtins.
-			if (!active_input_builtins.get(bi_type) || !interface_variable_exists_in_entry_point(var_id))
-				return;
-


Is this just reindentation? Needs to be fixed.

HansKristian-Work · 2024-10-23T12:13:41Z

spirv_msl.cpp

-					statement(builtin_type_decl(bi_type), " ", to_expression(var_id), " = ",
-					          to_expression(builtin_invocation_id_id), ".x % ", this->get_entry_point().output_vertices,
-					          ";");
-				});


This also just looks like a massive reindentation to me.

HansKristian-Work · 2024-10-23T12:15:45Z

spirv_msl.cpp

+	// GLSL: Once this instruction is called, the workgroup must be terminated immediately, and the mesh shaders are launched.
+	// TODO: find relieble and clean of terminating shader.
+	statement("spvMpg.set_threadgroups_per_grid(uint3(", to_unpacked_expression(block.mesh.groups[0]), ", ",
+	          to_unpacked_expression(block.mesh.groups[1]), ", ", to_unpacked_expression(block.mesh.groups[2]), "));");


return; should be called after this. Also needs to check that this is only used in the entry function, otherwise just throw an error saying it's not implemented.

HansKristian-Work · 2024-10-23T12:16:29Z

spirv_msl.cpp

+		}
+
+		string quals;
+		quals = member_location_attribute_qualifier(type, index);


flat/centroid/sample is only relevant for fragment stage.

HansKristian-Work · 2024-10-23T12:16:50Z

spirv_msl.cpp

+	{
+		if (is_builtin)
+		{
+			switch (builtin)


This smells like code duplication. Any reason it's done like this?

HansKristian-Work · 2024-10-23T12:23:22Z

spirv_msl.cpp

+		statement("spvPerVertex spvV = {};");
+		for (uint32_t index = 0; index < uint32_t(type_vert.member_types.size()); ++index)
+		{
+			uint32_t orig_var =


This path is a bit too naive for more complex objects. E.g.

#version 450 #extension GL_EXT_mesh_shader : require layout(max_vertices = 3, max_primitives = 1, triangles) out; layout(local_size_x = 1) in; out gl_MeshPerVertexEXT { invariant vec4 gl_Position; } gl_MeshVerticesEXT[3]; layout(location = 0) out float foos[3][4]; layout(location = 4) out Foo { float bar; } bars[3]; void main() { SetMeshOutputsEXT(3, 1); gl_MeshVerticesEXT[0].gl_Position = vec4(1.0); gl_MeshVerticesEXT[1].gl_Position = vec4(1.0); gl_MeshVerticesEXT[2].gl_Position = vec4(1.0); gl_PrimitiveTriangleIndicesEXT[0] = uvec3(0, 1, 2); foos[0][0] = 10.0; foos[1][1] = 20.0; foos[2][2] = 20.0; bars[0].bar = 4.0; bars[1].bar = 5.0; bars[2].bar = 6.0; }

for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = (gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < 3; spvI += spvThreadCount) { spvPerVertex spvV = {}; spvV.gl_Position = gl_MeshVerticesEXT[spvI].gl_Position; spvV.foos_0 = foos[spvI]; spvV.foos_1 = foos[spvI]; spvV.foos_2 = foos[spvI]; spvV.foos_3 = foos[spvI]; spvV.bars_bar = bars[spvI].bar; spvMesh.set_vertex(spvI, spvV); }

it doesn't seem to lower arrayed objects.

I feel like there should probably be a way to reuse the existing lambda stuff to lower the output writes. That might be something I have to look into once the rest of the implementation is in an acceptable state.

HansKristian-Work · 2024-10-23T12:32:45Z

spirv_msl.cpp

+		num_invocaions = mode.workgroup_size.x * mode.workgroup_size.y * mode.workgroup_size.z;
+	}
+
+	{


This doesn't need a separate block.

HansKristian-Work · 2024-10-23T12:33:02Z

spirv_msl.cpp

+		if (num_invocaions < mode.output_vertices)
+		{
+			statement("for (uint spvI = gl_LocalInvocationIndex, spvThreadCount = "
+			          "(gl_WorkGroupSize.x*gl_WorkGroupSize.y*gl_WorkGroupSize.z); spvI < ",


This should loop over spv_vertex_count that was set earlier.

HansKristian-Work

I tried a local merge conflict resolve and it seemed trivial.

My biggest concerns now is high code duplication.
Lots of diffs which only seem to be indentation changes which makes it impossible to review.
Misc structural issues.
Also make sure to rebase and squash. Having 10+ commits with random commit messages isn't helpful. Having several clean commits to go through would be the ideal, but that is asking for a lot of extra work and not a requirement.

Try mentioned this pull request Dec 18, 2022

[MSL] Translate SPV_NV_mesh_shader Mesh shaders to Metal 3 mesh shaders #1962

Open

Try commented Dec 18, 2022

View reviewed changes

Try force-pushed the msl-mesh-shader branch from e02cf1c to f09df55 Compare January 2, 2023 21:27

HansKristian-Work reviewed Jan 12, 2023

View reviewed changes

Try mentioned this pull request Feb 2, 2023

MacOS/Metal bringup Try/OpenGothic#142

Closed

rebase mesh-shader PR on latest main

04fd811

Try marked this pull request as ready for review December 29, 2023 20:12

Try added 2 commits January 5, 2024 21:46

Merge branch 'main' into msl-mesh-shader

1f0a4f7

Merge branch 'main' into msl-mesh-shader

d52dd84

HansKristian-Work reviewed Oct 23, 2024

View reviewed changes

HansKristian-Work requested changes Oct 23, 2024

View reviewed changes

[WIP] MSL: mesh shader initial #2074

Are you sure you want to change the base?

[WIP] MSL: mesh shader initial #2074

Conversation

Try commented Dec 18, 2022 • edited Loading

Workflow:

Concerns

Bugs/TODO:

Performance

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Try commented Dec 19, 2022

rcaridade145 commented Dec 21, 2022 • edited Loading

Try commented Dec 21, 2022

HansKristian-Work commented Jan 5, 2023

HansKristian-Work Jan 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HansKristian-Work commented Jan 12, 2023

Try commented Jan 12, 2023

Try commented Feb 12, 2023 • edited Loading

Try commented Feb 26, 2023

BeastLe9enD commented Mar 15, 2023

BeastLe9enD commented Mar 25, 2023

Try commented Mar 25, 2023

BeastLe9enD commented Mar 25, 2023

BeastLe9enD commented Mar 31, 2023

Try commented Apr 1, 2023

BeastLe9enD commented Sep 10, 2023

Try commented Sep 16, 2023

BeastLe9enD commented Sep 16, 2023 • edited Loading

Try commented Sep 17, 2023

BeastLe9enD commented Sep 17, 2023

Try commented Sep 18, 2023

zmarlon commented Nov 3, 2023

Try commented Nov 4, 2023 • edited Loading

spnda commented Apr 1, 2024 • edited Loading

BeastLe9enD commented Apr 7, 2024

BeastLe9enD commented Aug 11, 2024

Try commented Aug 12, 2024

squidbus commented Oct 13, 2024

zmarlon commented Oct 13, 2024

HansKristian-Work commented Oct 14, 2024

squidbus commented Oct 14, 2024

HansKristian-Work commented Oct 14, 2024 • edited Loading

squidbus commented Oct 14, 2024 • edited Loading

zmarlon commented Oct 14, 2024 • edited Loading

Try commented Oct 14, 2024

BeastLe9enD commented Oct 14, 2024

HansKristian-Work commented Oct 15, 2024

zmarlon commented Oct 15, 2024

HansKristian-Work commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HansKristian-Work Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HansKristian-Work left a comment • edited Loading

Choose a reason for hiding this comment

Try commented Dec 18, 2022 •

edited

Loading

rcaridade145 commented Dec 21, 2022 •

edited

Loading

HansKristian-Work Jan 12, 2023 •

edited

Loading

Try commented Feb 12, 2023 •

edited

Loading

BeastLe9enD commented Sep 16, 2023 •

edited

Loading

Try commented Nov 4, 2023 •

edited

Loading

spnda commented Apr 1, 2024 •

edited

Loading

HansKristian-Work commented Oct 14, 2024 •

edited

Loading

squidbus commented Oct 14, 2024 •

edited

Loading

zmarlon commented Oct 14, 2024 •

edited

Loading

HansKristian-Work Oct 23, 2024 •

edited

Loading

HansKristian-Work left a comment •

edited

Loading