-
Notifications
You must be signed in to change notification settings - Fork 43
v128.const and v128.shuffle blow up code size #255
Comments
Agreed that these instructions are very large, but they're likely to be relatively rare in the module. I'd be curious to see how often they occur in real modules. |
Shuffles are very common in realtime 3d applications. From regular 3x3, 3x4, and 4x4 matrix math to quaternion math too. Over a whole application it might not contribute all that much though, but in a hot code path it definitely could add significant overhead. As long as the bytecode can be compiled into efficient machine code, it should go down in asm size and not be a huge deal IMO. |
@binji just a 4x4 transpose ended with 8 of them, and they would be present in most of the codes that rearrange elements in an input, for example in code dealing with matrices or bitmapped graphics. Free-form shuffles are relatively rare real-world instructions, but the operations demonstrated in the example above are quite common in SIMD code. Part of the problem is that we end up using @Maratyszcza thank you for the great illustration! |
Given all the related discussion we've had about shuffles, I would really want to see the code size impact on a whole application (even one constructed to heavily use shuffles) before reconsidering whether we should have more specific shuffle operations. |
Worth noting is that repeated shuffle masks should compress reasonably well. I've compiled both examples into a binary (extracting just "machine" code) and used zstd to compress these (zstd over gzip since it has smaller headers, unsure how to coerce gzip to use raw deflate without the cruft):
So yes there's a large impact pre-compression, but it's less significant post-compression. On a larger scale program I feel like the efficiency of LZ matches is going to be even better, here the matches are likely mostly 4 byte wide and occasionally 8 byte wide for shuffle masks, whereas on a larger program I'd expect few unique shuffle masks and deduplication (through LZ) across different functions. |
v128.const
andv128.shuffle
instructions are always 18 bytes long, which is excessive in the majority of cases. For comparison, native shuffle instructions on x86-64 are at most five bytes long, even though they are more restricted in shuffle patterns. For an example here's an implementation of 4x4 matrix transpose in WAsm SIMD and SSE intrinsics:x86-64 SSE version compiles to 67 bytes of code, but WAsm SIMD version produce 231 bytes of code 3.5X larger.
I suggest that WAsm SIMD defines more restricted variants of
v128.shuffle
andv128.const
which could be encoded in more concise representation.The text was updated successfully, but these errors were encountered: