Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory packing difference C++ <-> HLSL #58

Open
fstrugar opened this issue Jul 12, 2022 · 16 comments
Open

Memory packing difference C++ <-> HLSL #58

fstrugar opened this issue Jul 12, 2022 · 16 comments
Assignees
Labels

Comments

@fstrugar
Copy link

Unlike in HLSL, HLSL++ will still pack all types in 16b n128, even if running in HLSLPP_SCALAR. That breaks the C++<->HLSL interop with constant/structured buffer packing which is an useful use case. Any ideas on how it could be (easily) hacked to support HLSL-like alignment? :) Thanks!

@redorav
Copy link
Owner

redorav commented Jul 12, 2022

Hi @fstrugar, that is indeed a limitation of the system. In SIMD mode it would be impossible to do, because internally all types are 16 bytes as you've observed. In HLSLPP_SCALAR it would in theory be possible, if you changed the n128 type and made it so that there's n128, n64, n32 or something like that. I don't think it's an easy change though although I could be mistaken.

What I've ended up doing in my toy renderer is to have actual interop types (e.g. shader::float3) that are trivially convertible to the hlslpp types, and they work fine, if that's what you need. This library wasn't thought as a way to interop but as a way to migrate hlsl code from the GPU to the CPU or viceversa so it can be a bit awkward.

@redorav redorav self-assigned this Jul 12, 2022
@fstrugar
Copy link
Author

Thanks! Yeah I'm currently using something like the shader::float3 as you suggested, and HLSL++ looked like it could clean & unify that up, as well as provide full interop on math functions. I'll play around a bit, see what happens, thanks for the quick answer, much appreciated!

@redorav
Copy link
Owner

redorav commented Jul 13, 2022

Yeah I don't know, maybe a header called hlsl++_interop would be useful so that not everyone has to write the same code many times. Let me know how your experience goes and what problems you encounter, it could guide a future implementation.

@devshgraphicsprogramming

Every type would need a template tag argument which would tell what packing convention to follow, then you could modify the aligments.

Pretty invasive change though.

@redorav
Copy link
Owner

redorav commented Mar 22, 2023

@devshgraphicsprogramming In terms of code, what are you suggesting? (just so I understand what you mean by tagging)

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Mar 22, 2023

enum class Packing
{
   Scalar, // align to component type
   ConstantBuffer, // std140 in GLSL
   StructuredBuffer // std430 in GLSL
};

Then

template<Packing align_tag=Scalar/*or whatever you want the default to be*/>

in front of every type you have or as last template arg (you probably don't use variadics anywhere).

then depending on the tag you can constexpr or consteval your way around what to put in alignas or assume_aligned

P.S. C++ is a bit weird, you can't have a size thats not a multiple of alignment, so there's no way to achieve this

struct MyStruct
{
    float3<ConstantBuffer> rgb;
    float a;
};
static_assert(sizeof(MyStruct)==16);

if float3<ConstantBuffer> would align to 16, then its size is also 16. HLSL or GLSL would pack this for you.

See this: https://godbolt.org/z/dhvqGvGGc
The only way it works if you declare alignas on the site where the type is being used as a member : https://godbolt.org/z/hfz9K356a
but I have no clue how to automate that except with fugly macros :D

@devshgraphicsprogramming

P.S. is hlsl++'s float2 actually 8 bytes or 16 ?

P.P.S. Are matrix types stored as rows and is their alignment and size same as the row types?

@redorav
Copy link
Owner

redorav commented Mar 22, 2023

in front of every type you have or as last template arg (you probably don't use variadics anywhere).

I see what you mean, and I agree it's quite intrusive. This isn't the first time this size and alignment problem has come up. It's an unfortunate tradeoff that is imposed by the SIMD requirement. What I do in my own code which I mention elsewhere is that I use some lightweight translation types through the store() and load(), I've yet to migrate this to hlsl++ but it's in the back of my mind.

P.S. is hlsl++'s float2 actually 8 bytes or 16 ?

hlsl++ uses the minimum size it can get away with on the platform, which is 4 floats (i.e. 16 bytes). Once upon a time I thought the __m64 types were a viable option but MSVC decided to deprecate them in 64-bit builds for no particular reason (as they are still supported by the CPUs and the other compilers like Clang and GCC) I've always wanted to just make my own implementation of the types and live with it but it's work I need to do and I don't know how well it plays with the rest of the float types (in terms of conversions). If this is something that would be useful to you I can add it to the list to look at.

P.P.S. Are matrix types stored as rows and is their alignment and size same as the row types?

hlsl++ stores data in whatever way is best to save memory, which means there is no particular physical layout they prefer. For example, float4x2 and float2x4 are the same thing (two "rows"), and float2x2 is a single __m128. The multiplication functions take care of abstracting all that and extract the data as needed.

That's why interfacing with shaders and data storage is a tricky thing, this library is meant for manipulation of data, and not necessarily shader upload which is what some people have complained about. I don't provide a solution in the library but it shouldn't be hard to make your own as all the translation primitives are there.

@devshgraphicsprogramming

__m64 types were a viable option but MSVC decided to deprecate them in 64-bit builds for no particular reason (as they are still supported by the CPUs and the other compilers like Clang and GCC)

cause thats MMX, not even SSE... I don't know if CPUs removed MMX and now you're just getting silent CISC2CISC firmware patch-ups that promote old MMX to SSE.

float2 really needs to be stored as scalar or NEON width-2, you're not gaining any perf from promoting them to float4 (because SSE isn't 4x a lot of the time, and compilers autovectorize quite well). There's a potential 2x ALU gain (realistically 1.7x) but you'll end up fetching 2x the cache lines, so you're just as likely to have a performance regression as you are to have a gain.

Silent promotion and demotion (like you do for your own float3 code) is much better because you're not bloating the storage, and are likely to win on RAM bandwidth which bottlenecks more than pure FLOPS.

hlsl++ stores data in whatever way is best to save memory, which means there is no particular physical layout they prefer. For example, float4x2 and float2x4 are the same thing (two "rows"), and float2x2 is a single __m128. The multiplication functions take care of abstracting all that and extract the data as needed.

HLSL is GLSL row_major by default, meaning that if I were to cast a matrix pointer to a float* I'll first be iterating through a row's components and then rows.

For matrices aligned to 16 bytes, using float4 as internal storage its not a problem (2x2, 4x2, 2x4, 3x4, 4x4).
If you enforce alignment rules of HLSL and GLSL then 3 column matrices will have a 4th "shadow" column and that will make it okay for the 2x3, 3x3 and 4x3.

The corner case is really just the 3x2 which should have a sizeof equal to 3 times float2 which is 24.

I don't provide a solution in the library but it shouldn't be hard to make your own as all the translation primitives are there.

Do you recommend I hard fork and call it something else then?

@redorav
Copy link
Owner

redorav commented Mar 22, 2023

float2 really needs to be stored as scalar or NEON width-2, you're not gaining any perf from promoting them to float4 (because SSE isn't 4x a lot of the time, and compilers autovectorize quite well). There's a potential 2x ALU gain (realistically 1.7x) but you'll end up fetching 2x the cache lines, so you're just as likely to have a performance regression as you are to have a gain.

You are 100% in the right here with respect to NEON, and swizzles in NEON could do with some improvements. However I have limited time to dedicate to the library :) If there's interest and you'd like to help out I'm all for it. The idea behind float2 being SSE is that we never convert back and forth between types.

Do you recommend I hard fork and call it something else then?

This is really up to you, I wouldn't want it in all honesty but I wonder if you have hard requirements that hlsl++ doesn't meet and I can't fix.

I'm open to everything else we've talked about except making the codebase templated. I can see an acceptable compromise where float4x4 optionally derives from some floatNxM<4, 4> type (and so on for the other types) and you use that in your code. That way the codebase stays the same and you get what you need.

@devshgraphicsprogramming

The idea behind float2 being SSE is that we never convert back and forth between types.

You've saved a register pack/unpack (or unaligned load, masked store), but now you're wasting 2x bandwidth.

I'd just let float2 be scalar (unless NEON) and 8 bytes.

@redorav
Copy link
Owner

redorav commented Mar 22, 2023

We're derailing this issue already quite a bit, and there are quite a few requests in the mix. If you can measure that this change you're talking about is considerably faster I can take a look. It's not trivial to change. Otherwise we can stick to the issue or open a new one.

@devshgraphicsprogramming

This is really up to you, I wouldn't want it in all honesty but I wonder if you have hard requirements that hlsl++ doesn't meet and I can't fix.

In my case its a hard requirement that alignof(float2)==8, but yeah I can benchmark and open an issue.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Mar 22, 2023

enum class Packing
{
   Scalar, // align to component type
   ConstantBuffer, // std140 in GLSL
   StructuredBuffer // std430 in GLSL
};

Then

template<Packing align_tag=Scalar/*or whatever you want the default to be*/>

in front of every type you have or as last template arg (you probably don't use variadics anywhere).

then depending on the tag you can constexpr or consteval your way around what to put in alignas or assume_aligned

P.S. C++ is a bit weird, you can't have a size thats not a multiple of alignment, so there's no way to achieve this

struct MyStruct
{
    float3<ConstantBuffer> rgb;
    float a;
};
static_assert(sizeof(MyStruct)==16);

if float3<ConstantBuffer> would align to 16, then its size is also 16. HLSL or GLSL would pack this for you.

See this: https://godbolt.org/z/dhvqGvGGc The only way it works if you declare alignas on the site where the type is being used as a member : https://godbolt.org/z/hfz9K356a but I have no clue how to automate that except with fugly macros :D

There is no way for this to happen automatically without C++ Reflection TS, a mountain of templates which rebuild your types with correct member alignments.

The only compromise with the current state of C++ is to provide two flavours of float3, one which secretly stores a float4 (std140) and another which stores x,y,z and promotes to float4 when doing any operation (scalar).

Std430 (storage buffer) is not possible in C++ because it would allow another member in a structure to alias the padding.

There' is a dirty hack possible with a macro and a template

#define HLSL_MEMBER(TYPE) alignas(align_trait<TYPE>::alignment) align_trait<TYPE>::type

template<AlignmentTrait align_trait=constant_buffer_alignment_trait>
struct MyStruct
{
   HLSL_MEMBER(float3) rgb;
   float a;
};

but then this will make your structs have sizeof==alignof and I'd need to re-read the rules for std430 and Structured buffer whether that is in line with what they dictate about struct types.

Also IIRC this would STILL require that you have a special float3 type of size float4 for array members. And similar for all types for std140 packing.

This is all nuts, you will go insane and I just recommend not worrying about this at all and pretending everyone supports VK_EXT_scalar_block_layout

This is DirectX's default for Storage Buffers anyway, no idea how to make DX use that rule for Uniform Buffers. Do you just compile with -no-legacy-cbuf-layout ?

But then I think both DirectX and HLSL will make the NPoT column-count matrices actually occupy just as much space as is needed.

Sorry I'm new to HLSL and DirectX.

@devshgraphicsprogramming

support for scalarBlockLayout seems to be ubiquitous across all Vulkan devices now, if you're targetting SPIR-V with DXC/HLSL use -fvk-use-scalar-layout however your HLSLPP_SCALAR does not mean that alignof(float3)==sizeof(float) and sizeof(float3)==12 which would be a problem.

@redorav
Copy link
Owner

redorav commented Sep 1, 2023

HLSLPP_SCALAR means non-SIMD, it is unrelated to any extension of SPIR-V or Vulkan. I only created it to compare performance between the SIMD version and one that isn't. I'm going to have to make a clear note in the front page to say that shader interop of this kind is not a feature of this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants