-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: add package for using SIMD instructions #53171
Comments
As for what the API might look like, the following yenc encoder may provide some insight. Rustfn yenc(input: [u8;64]) -> [u8;128] {
use core::simd::*;
let mut indata = u8x64::from_array(input); //u8x64::from_array(input);
indata += u8x64::splat(42);
// mark special characters
let mut mask = indata.lanes_eq(u8x64::splat(0));
mask |= indata.lanes_eq(u8x64::splat(10));
mask |= indata.lanes_eq(u8x64::splat(13));
mask |= indata.lanes_eq(u8x64::splat(61));
indata += mask.select(
u8x64::splat(64), // add 64 where the mask is set to true
u8x64::splat(0) // and don't do anything where it isn't
);
let escape = mask.select(
u8x64::splat(61), // add '=' (ASCII 61) where the mask is set to true
u8x64::splat(0) // and don't do anything where it isn't
);
let mut result: [u8;128] = [0;128];
for i in 0..64 {
result[i*2] = escape[i];
result[i*2+1] = indata[i];
}
return result;
} Gopackage
|
The difficulty with a general purpose approach to SIMD, which is what you are suggesting, is that the performance can be dramatically different on different processors. Also, for specific processors, performance is not optimal as not all special purpose instructions are available. (The difficulty with a processor-specific approach to SIMD is that you have to write different code for each processor.) (As a side note, I don't see any reason to have a package like simd/sse2 in your description. Instead, we would arrange to use the appropriate implementation when building the simd package.) |
I hope the proposal can support different architectures. Arm SVE has a feature called (VLA)Vector Length Agnostics. Different H/W may implement the vector size differently. See: a sneak peek into sve and vla programming On the API level, it is better that we can decide the vector length at runtime instead of hard coding it as 8x64. Reference: OpenJDK Vector API On the compiler side, so far Go ABI needs the frame size to be decided at compile time, which doesn't support VLA programming very well yet. |
That might not be as easy as it sounds since that would make binaries potentially less portable. Adding a fallback to use in case a specific instructionset is not supported would help in offsetting this. (Yes, I'm aware that this is also not as simple as it sounds. |
I think we should provide different packages for different instruction sets . |
@mpldr We already have a mechanism for making binaries more or less portable: the |
It's not the same of writing assembly file for each CPU? I think that defeats the purpose of SIMD package. I think the Zig approach is better (and I think Rust is similar). For instance, Zig provides one |
I think the Zig approach is better (and I think Rust is similar). For instance, Zig [provides one ***@***.***`](https://ziglang.org/documentation/master/#Vectors) function which works on any CPU, can [read more about that here](https://zig.news/michalz/fast-multi-platform-simd-math-library-in-zig-2adn). That makes the code as portable as non-simd version.
I also like the approach of just allowing the various operations to be
applied to it, but this would require a language change. (Also adding
60-something new types to the language seems rather contrary to "the Go
spirit")
|
This seems to make the code more portable, and I support it. |
Indeed, a vectorize package that is portable and that uses the CPU relevant instructions would be the best solution. On platforms that do not have such instructions, the default implementation could still be useful for portability and optimized for performance. |
The current consensus on portability is to use appropriate implementations when building SIMD packages, which I support. |
I came across https://github.com/google/highway a few weeks ago. Is there anything we can learn from that about portable API for SIMD? |
This proposal has been added to the active column of the proposals project |
Overall, as mentioned by people a platform independent implementation if simd is at best a half implementation. While some of the aspects are platform independent, and it may be possible to port a fraction, there is simply too much difference between platforms to make anything that would be genuinely useful. While it is "neat" Falling back to Go implementation of SIMD would in many cases be much worse than straight up Go, and overall design of this will just slow down the availability. Example: MPSADBW -if there is no HW support, the fallback will be horrible. SIMD intrinsics should be able to live alongside Go code. The compiler controls register use. This will allow using simd and other instructions without the now forced function call overhead. SIMD should be guarded by platform tags. SIMD types should be available for all platforms, but the functions shouldn't be abstracted, and should match underlying instructions, maybe with some compound functions. Here are some of my previous observations when looking at intrinsics for Go: Feature detectionThere is a huge number of individual features. Compile-time feature specification will always just select a very low common denominator, and It should be easy (maybe looking at imports), which features to check for. The detection should be part of the intrinsic offering. A quite tricky thing is that some instructions contains several "forms", but with different features. For example ANDPS xmm1, xmm2 (SSE) also has a non-destructive 3 register version Data typesThe tricky part about data types is that it will often require type conversion, or be untyped. Data loaded as a Some intrinsics has no clear type. For example I don't have a ready-to-go solution, but having to copy input from a []byte to a []float32 or vice versa must be avoidable. The compiler cannot enforce forced constant values. Take PSHUFD, which has an 8 bit immediate value. This must be resolvable at compile time. With current function definitions that isn't really doable, so some handling would be needed for this. Pointer arguments are a little tricky. There aren't that many, but prefetch instructions and gather/scatter and of course loading and saving. Edit: Final word on portable SIMD: I am not against it, but Go should supply the tools to write portable SIMD packages, that cover the feasible subsets. |
@rsc Highway seems to be a good idea. Since Go now has generics, and Highway is Apache licensed, is there any reason why someone interested could not port it to Go? That would be the first step, I think. @klauspost It sounds more like you want to use inline SIMD assembly than have a portable vector API. While the go compiler already supports assembly in separate files, I don't think inline SIMD assembly in Go is a good idea. Edit: a third approach would be for the Go compiler to optimize certain array and float operations using SIMD if possible. This should be documented then, though. |
@beoran That is the title of the proposal. While portable solutions and automatic vectoring are neat, it only provides a band-aid solution, with quite limited usability. |
I think the most promising approach to deal with the plethora of instruction set variants and CPU types while avoiding the explosion of the number of compiler flags, different binaries, etc. is to use a JIT-based code generation approach like https://asmjit.com makes possible. This concept is used very successfully in a highly optimized 2D graphics library, for generating CPU-specific graphic pipelines on the fly, that outperform any other library. |
Will SIMD support in Go include back-propagation of ISA (Instruction Set Architecture) incompatibilities during compile-time (from callees to callers) via Go's type system, probably by using an updated version of generic types with support for partial specialization? If this won't be the case, then the runtime performance of AVX-tuned programs will be random on ARM/RISC-V/etc (and vice versa, the performance of ARM/RISC-V/etc-tuned programs will be random on x86 CPUs). The main purpose, and almost the only purpose, of adding SIMD support to an already general-purpose programming language is performance (faster execution and/or lower energy consumption): If the type system of the programming language is unable to propagate information about potential performance issues, then adding SIMD support to the runtime and/or to the language itself (language syntax) is questionable/suboptimal, assuming that it is a multi-platform programming language. |
FWIW, Go used to do very minimal JIT and found that made Go untenable in certain environments, so we removed that. Adopting a JIT here would be a big step and probably not what we want. |
Can we come up with a list of example use cases that must be supported well by any approach that we take? Right now it's all very hypothetical. |
@rsc Many, many moons ago I made a test how it could look. It parsed Intel/Arm intrinsic information and generated stubs for intrinsic functions. It is pretty much a 1:1 mapping of C intrinsics to Go functions. This choice was to limit the documentation needed and also to make C intrinsics easy to convert. Example A is a conversion of the assembly version. The idea was that with the native types it would be possible to hook platform intrinsics into the compiler, similar to how selective intrinsics are done in Example B is a more complex example that loads a linear 16 bit RGB image data, applies a 3x3 matrix, an sRGB gamma curve and finally converts to 8 bit values and stores the result. There are also other bonus examples in this file. This is by no means "final proposal quality", and the types are just temporary, but was to give an idea of how Go could look with intrinsics. This package makes no real attempt to unify similar types across platforms. The ARM part is rather lacking, but was included for discussion. |
@klauspost, With all due respect, but that example A in Go is about as readable as it's assembly counterpart. It is all very low level. I don't see any significant benefits over just using assembly. I'd rather have a more high level package with Matrix and Vector, etc. types that use whatever acceleration is available. Somewhat like https://github.com/go-gl/mathgl perhaps. Edit: or something like this: https://github.com/rkusa/gm |
@beoran Someone who does understand it can write your abstracted high-level libraries on top of this. You cannot write this on top of an abstracted, lowest common denominator package. |
@klauspost Now, someone who understands can also write such a higher level library in assembly, like in gm. My point is that I don't see how such a low level API is significantly easier to use than assembly. Even for something simple like the sum of int in Go, there is the significant convenience that the compiler chooses the assembly instructions necessary for the platform. This is not the case with such an extremely low level API. Looking at the examples, it seems like adding an int128 type to the compiler, and then have a (standard) library with vectors and matrixes would be much more convenient and teachable to new and existing Go programmers. |
I quite like the shown API.
I would personally disagree with this, but 20 people will likely give you 21 opinions. I find it way easier to read and understand (it could use some more comments though)
Writing assembly and using intrinsics is leagues apart. This also would require to write assembler for every architecture (which is not exactly "easy") instead of having the compiler do this. |
Here are two (slightly redacted) examples from code I personally wrote. These are both quite straightforward, as SIMD goes. I can dig up more examples if it'd be helpful. (1)
This has two variants, one that uses SSE2 and one that uses AVX2. The SSE2 variant constructs an X register filled with 128 in each byte (PUNPCKLBW, PSHUFL). Then on each loop, it does two loads into X registers (MOVOU), subtracts 128 from both inputs to convert signed comparison to unsigned comparison (PSUBB), does a per-byte comparison (PCMPGTB), extracts the top bit from each byte (PMOVMSKB), then exits with true if any of those bits are set. The AVX2 variant is similar, but doesn't require the 128 register (VMOVDQU/VPSUBB/VPCMPGTB/VPMOVMSKB). Don't forget the VZEROUPPER. (2)
countMatches8 accepts a []uint64 (ptr to 0th element and length), a mask, and a word to match. For each uint64 The caller arranges to handle the scalar entries at the end. (I happen to have alignment guarantees.) This version requires AVX2. It broadcasts mask and word to Y registers (VPBROADCASTQ), zeros a single sum accumulation register (VXORPS), and loops over each input. It does a load (VMOVDQU), a bitwise mask (VANDPS), an eq comparison (VPCMPEQQ), and adds that to the accumulation register (VPADDQ). At the end, it does a horizontal sum across the accumulation register by extracting the two halves into X registers (VEXTRACTI128/VEXTRACTI128) and adding them (VPADDQ), then reduces from X to GP the same way (PEXTRQ/PEXTRQ/ADDQ), then negates the result (because "equal" generates 0xFFFFFFFF, which is -1, but we want to count up the results, not down). And then VZEROUPPER. There is also countMatches16, which does the same, but is masking+comparing pairs of uint64s (i.e. 16 bytes at a time). This is trickier. The main trick is using VPERMQ to rearrange the internal components of the register and then do VANDPS, so that both members of a pair of uint64s must match in order to qualify as a match. I'll drop the code at the end of this comment. I also implemented these on arm64. The broad approach is very similar, although the VPERMQ was implemented using two rounds of VTRN1/VTRN2. Lastly, I also have "hasMatches" variants for each of these which exit early as soon as a single match is found. I chose these two examples because they're (a) not particularly unusual or tricky, just a bunch of vector manipulation and yet (b) they're not completely trivial, particularly the vector rearrangement. Code for amd64 countMatches16. I'm happy to share any of the other full code mentioned here on request.
|
Its true that it could be written in assembly. But assembly has all of the following properties:
This long list of legitimate downsides leads to things like the (very reasonable) Go Assembly policy. But the thing is that none of those properties are intrinsic (no pun intended) to the problem space. Only the first one (non-portability) is. And even for that one, the way you build portable APIs is you define them in terms of non-portable APIs (e.g. package os building on package syscall). Today if you want to build portable APIs, you can either do it in the compiler, or do it in assembly (and pick up all the negative properties I mentioned above). That seems like a false choice. |
Just a note (1): It is "non-portable" only as long as the assembly code is expected to run on a different CPU architecture at the same performance level as on a CPU which can run the code natively.
Just a note (2): It is possible (although in practice it is highly uncommon) to write safe code in assembly, as long as the compiler is able to keep track of how information flows within the assembly program. |
Talking about prior art for portable APIs, there is also the ongoing work on the new Java Vector API (JEPs 338, 414, 417, and 426) that, funnily enough, is somewhat reminiscent of |
If we add a package here, we really want to do something that's not CPU-specific. There are already things like Avro to help write assembly. #53171 (comment) showed a Rust program that looks portable. Is there something we can learn from Rust's approach? |
My key takeaways would be:
I think the goals/non-goals from JEP 417 could serve as some inspiration as well. |
@rsc Avo doesn't address the fact that Assembly functions can't be inlined or use the register-based ABI. The practical impact is that rather than implementing algorithms largely in Go and using a small bit of assembly to actually access PCMPESTRI or similar, it seems necessary to implement entire compute kernels in assembly. Any interest in addressing either of those issues, rather than CPU-specific Go intrinsics? (Admittedly, there might be other followup compiler work necessary to truly eliminate any motivation to have large compute kernels reimplemented in assembly, e.g. improving bounds-check elimination and similar). FWIW, I absolutely agree that there should be a portable SIMD library. But I also feel like that'll leave perf on the table. E.g. Intel in their infinite wisdom has dedicated silicon to "Test if any characters within the specified ranges are in the input string" (same link as above). That seems pretty much tailor-made for accelerating Regex character classes (among other things), but I think you'd correctly reject someone adding significant assembly to the Regex package to take advantage of that silicon even if they did use Avo. It seems like to take advantage of things like this, there has to be a middle ground between portable SIMD in Go and assembly (even with Avo).
It seems pretty great, as far as portable SIMD is concerned. For Go, it seems simpler to just generate distinct types rather than trying to do this via Generics. e.g. generate types Open question: how do we avoid conditional branches inside each of these functions for checking whether e.g. AVX2 is available or similar. One option is that it'll use GOAMD64 and friends to only use instructions guaranteed to be available and otherwise fallback to emulation using scalar code (or a smaller vector size). Another option is the compiler could try to coalesce the conditional checks to a higher level, rather than doing it per function. (Seems like maybe generally useful functionality, e.g. for coalescing multiple write-barrier invocations and such, so maybe this already exists). Open question 2: it seems less obvious how to adapt this type of API to variable-length SIMD, like ARM SVE or the RISCV vector extension. This seems a bit unfortunate, as that microarchitecture pattern really seems like the best case scenario from a portability perspective. |
It sounds like we are not converging on anything in particular here. I've heard that other people are thinking about this too but also don't have a clear plan forward. Let's put this to the side for now and continue to experiment in hopes of finding something we can propose in the future. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
For people here that were interested in a portable SIMD, you could join the discussion in #58610 . |
That's for Go 2 and requires significant breaking changes |
What was wrong with just implementing a library/package similar to Rust? https://doc.rust-lang.org/std/simd/struct.Simd.html |
Just to clarify your point, the proposed change will not break any existing program. The same way that generic was introduced without breaking past code, this proposal is also not introducing any. |
SIMD has the potential to greatly increase performance of a lot of data processing applications. Since #35307 was closed with the remark
Generics are now available, so I want to use this opportunity to necrobump and suggest a
simd
package, which allows using SIMD instructions via a highlevel API. I think a simple API like Rust's experimental SIMD support, or the previously linked WebAssembly and Dart would help a lot of developers in improving their data processing performance in a simple and easy way.The text was updated successfully, but these errors were encountered: