-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: Go 2: Add implicit single program multiple data #58610
Comments
I am not an expert in the topics covered by this proposal but I can tell you right now this seems very unlikely to get accepted to me. Adding keywords to the language is a violation of the Go 1 compatibility guarantee in 99.9% of circumstances because it causes any existing code where those words are used as identifiers to break, and |
Indeed, this is a valid concern, but I do not think you can actually use those keyword in this context in your code today. Sure you can create a type or a variable with those name, but you wouldn't be able to have them in the language grammar use in the same place. It will likely require a bit of gymnastic in the go parser to not break existing code, but I do think it is doable to implement this proposal without breaking existing program. |
What if I write? func Example(uniform int32) {
...
} Is that an unnamed argument of type |
In my proposal, you can not have an uniform outside of an ispmd context (as all variable outside of an ispmd context are by definition uniform). You would have to prefix the function with ispmd first to have this problem, which then isn't one as ispmd context didn't exist and so you would not run into this problem. |
I'm sorry, I don't understand this proposal. Can you give a short, clear explanation of what the ismpd mandel function is doing at the end of the proposal. Go is intended to be a simple language that is very to understand. I'm sure I am missing something but right now this proposal does not seem simple nor easy to understand. |
It is a direct implementation of the Mandelbrot Set. As pointed out in the Context section, Mandelbrot Set are often used as a benchmark. The main reason for it to be used as a benchmark is that without a proper expression of parallelism in the control flow of the program, you can not use SIMD instruction for it (as the parallelism is not per iteration inside of a pixel, but between pixels). In this example, I used ispmd to tell that the iteration on x allow for data parallelism (processing a bunch of pixels per line in parallel). The ispmd before the |
Thanks. What is the exact meaning of |
Is the usage of |
There is two reason to specify The second reason, is to be able to specify that some of the parameters or the returned value are In this example, |
No, it isn't a mistake. It actually what make things work. It does mean here that With that in mind, it iterate a bunch of pixels from each line at a time, doing all the vertical first (via the |
I don't understand how the Go compiler could use this information. I don't understand how it could affect the code generated for this function. |
From the compiler point of view, each line is directly manipulating SIMD register. Let's start inside This first line switch to an Now come the call to With Back in SerialMandelbrot, the returned value can be saved in output. index being a SIMD variable of the same size as the returned value, it provide the location for all elements being returned to get saved at the right place in Whatever SIMD instruction set, the compiler will with this always be able to generate SIMD code for a |
I still don't really understand, but let me try asking a different question. Does the |
The |
Maybe a different angle to answer your question about semantic is to take a step back on what means |
I understand that |
Totally, in the same way goroutine doesn't mandate to have multi-core or thread, |
Thanks. Since this doesn't affect semantics, that suggests that this should be a compiler directive, like |
Wouldn't that make it less obvious when reading the code that there is potential for race condition? |
How do you imagine adding |
Actually thinking more about this, if the question was "can you add ispmd to any function and expect the same result", the answer would be no. Does that change your recommendation? |
When you say that you can not expect the same result, do you mean that you could write a function that will return something different for the same arguments based purely on the presence of the |
The second code example show a potential race condition that should result ideally in a compile time failure. As for a run-time failure, a silly example like the following would give unpredictable result: ispmd func lastSeenOddEven(seen uniform [2] uniform int, value int) {
seen[value % 2] = value
} |
This proposal is focused on code generation rather than language, and the details of how it changes the language are difficult to understand. Go is intended to be a simple language that it is easy to understand. Introducing complex semantics for performance reasons is not a direction that the language is going to take. Therefore, this is a likely decline. Leaving open for four weeks for final comments. |
+1 for this proposal, but i dont have my hope up for it getting in.. Its seems to be useful for so many usecases:
|
From what I can tell, some possible usages of SIMD can be done automatically as a compiler optimization. I don't know if the Go compiler does such optimizations already or not, but, if it doesn't, would it be acceptable to you to change this proposal to add them, @Bluebugs? You could probably revisit the idea of more explicit usage later once the benefits can be shown in practice in the cases covered by the automatic optimizations, though I doubt that language changes to that end will ever be accepted so it would have to be via a new standard library package or something. |
@DeedleFake I have pointed in my proposal to explanation why auto vectorisation is not an approach that as yielded anything so far. To paraphrase things, the code you write does not express enough information to allow for reliable vectorisation. As SIMD is something that people are less familiar here with. Let's take the example of thread. You wouldn't be arguing with me here that the compiler can auto add thread to the code base if that code base was not written with it in mind. We have the The problem with simd is the same, the compiler can only be very restrictive on what it can vectorize, because it can not detect all race condition and has to switch to safest possible solution. The developer has to write his code with the constraint of the simd parallelism in mind in the same way a developer write a thread safe function. That what this proposal is about. The difference between goroutine discussion and spmd/simd discussion is that most people were told in school what thread are and there is a larger culture and understanding of those. simd has been an assembly art mostly, with a bit of takeoff with GPU programming, but still a sub culture of computer science that most people do not understand and know about. This means it require additional learning for anyone to pick it. I do think this proposal would lead to the most readable and accessible syntax for writing code that generate simd instruction, but the community doesn't seem ready for something like it. I consider that adding auto vectorization to Go will be a waste of everyone time and effort. I will prefer this proposal to be closed than to lead the ecosystem into something I think is wrong. If someone want to push in that direction, they should feel free to open a proposal for that purpose. |
I am fascinated with SIMD and if this makes it easier to play and educate myself about SIMD that’s a bonus. Would allow more Golang’s devs to optimise hotspots and grow the knowledge around using SIMD too. |
I disagree with the above viewpoint (highlighted in bold). A compiler can detect all SIMD race conditions fairly easily. A major problem is that usually the set of run-time race conditions is only a very small subset of compile-time race conditions, in which case one part of possible questions reduces to whether the compiler is a JIT compiler. |
@atomsymbol oh, indeed, that's a good point. An assumption of this proposal is also that JIT is a no go. I would think that even with a JIT, you would still have the need to express somehow cross lane operation to take full advantage? |
Hey all, this is related... https://github.com/SnellerInc/sneller using avx512 to search over JSON. Lots of SIMD acceleration ot make it fast. |
I'm not sure if I understand correctly the topic, but what about to use import "os/cpu"
func Sum(a []int32) int32 {
switch {
case cpu.HasAVX():
return asmReduceAddAVX(&a[0], len(a))
case cpu.HasSSE4():
return asmReduceAddSSE4(&a[0], len(a))
default:
return noneReduceAdd(&a[0],len(a))
}
}
func asmReduceAddAVX(a *int32, size int32) int32
func asmReduceAddSSE4(a *int32, size int32) int32
func noneReduceAdd(a *int32, size int32) int32 {
var sum int32
for _, b := range unsafe.Slice(a, size) {
sum += b
}
return b
} |
@bdandy the goal is that most of the code should not be written as assembly, but in Go directly. The compiler would most of the time generate the code. I would expect that in general developer won't use this feature and will rely on module that provide the few cross lane primitive that are useful (ispc as a good list of reduce and shuffle operation to start from). The exception being for cross lane operation in this module. For those operation, I think they will be implemented in assembly, but the caller will already have done the architecture dispatching (as you need to specialize the entire ispmd context for one architecture). As the caller already did the dispatching, the selection of which function to call seems to be, to me, something that is more of a compiler directive. Still your solution could likely work too. It would require exposing and passing also the lane mask to the assembly function. The benefit is that we wouldn't need to add compiler directive. And I think os/cpu can already be implemented today without any change to any think. I also do think that ispmd make it possible to completely ignore what the underlying SIMD instruction set is. I think it is an important feature, as it get people to focus on the algorithm and code, instead of assembly. In a way, when developers moved away from assembly for scalar code a few decade ago in favour of higher level language, that what this proposal would do here for SIMD code. We have never programmed GPU in assembly, but we find it normal to write SIMD code using assembly even today. |
Ah so an assembler generator written in golang ? |
@gedw99 the go code written in the ispmd section can directly be converted to SIMD code. It is as much of an assembler generator as the go compiler is. The ispmd section give enough guarantee and hint to the compiler to safely generate SIMD code. For a full understanding of how the compiler can do that, I really recommend ready Matt post on the topic here: https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd |
Thanks ! ok read that and the story of how the compiler came to be open sourced .. good story So basically from go call this c code ? https://github.com/ispc/ispc |
This looks similar effort in golang ? |
@gedw99 yes, this proposal is totally influenced by this thesis and discussion with Matt. There are a few limitation to use ispc (mostly weird syntax and requirement on cgo). But the thing is, there is no reason for the go compiler to not be able to actually do the same things that ispc does, without calling ispc. This enable better syntax, full integration with go and no cgo. |
No change in consensus. |
Author background
I have been doing Go development for a few years now and have had a long experience before that doing C, C++ development and also have experience writing OpenGL shaders.
Proposal
This is a proposal to add the capability to do implicit Single Program Multiple Data, SPMD, computation a la ISPC. It does not have any intent to cover intrinsic, but in a way cover portable SIMD. For the rest of the proposal, I will refer to implicit Single Program Multiple Data as iSPMD.
This proposal is a Go 2 language change proposal that will require change in the go language and large change in the compiler and go tools. I do understand that the bar for such a change is high, but I do think bringing iSPMD into Go might outweigh those difficulties and that people will consider the potential improvement this would bring to the Go ecosystem: readable, maintainable, safe, reliable, portable and fast vectorized code. This proposal should make iSPMD code usable and acceptable in all of Go stdlib and accessible to the entire ecosystem.
There has been already a few proposal trying to bring some form of vectorized code in Go, none looked from the iSPMD angle at it:
#35307
#53171
Due to the high burden of maintaining assembly code and the safety risk it pose, it is very hard to get SIMD optimization in the stdlib even when there is demand as the follow issues shows:
#53178
#24499
#20206
#19636
iSPMD could make it more manageable for projects like gonum to support more architecture and not have to hand write as much assembly for each architecture. Rasterx and go-text will likely benefit from it too. Any compute intensive workload would likely be better off with iSPMD than handwritten assembly. Vector databases like Weaviate have expressed interest in seeing the situation around SIMD improved.
An additional challenge of this proposal is that no language exists, to my knowledge, with both a iSPMD and a non iSPMD context natively. Nevertheless I believe it is doable and this is the aim of this proposal.
TL;DR;
Add the notion of iSPMD context where all code operations are done in parallel when using the keyword
ispmd
. Add the new keyworduniform
to specify that a data is actually representing a single value for all parallel streams of computation in an iSPMD context. Structure and data declared outside of an iSPMD context are all by definitionuniform
implicitly. Structure and data declared inside of an iSPMD context are all by definition varying of the length of a SIMD register selected by the compiler. Add aneach
keyword when iterating to specify that each element of a gang comes from a different element in arange
, simply loading one SIMD register per step of thefor
loop instead of loading the same value in each member of the gang.Proposed syntax for iSPMD context could look like the following short example below and longer one at the end:
For a developer trying to optimize a workload, they just have to pick which unit can be processed in parallel and switch to iSPMD context for it: number, character, word, string, pixel, line of pixels, image, glyph, and so on.
Context
For discussion, I highly recommend reading all of the history of ISPC here as it is full of very useful information to understand the context and benefit of iSPMD: https://pharr.org/matt/blog/2018/04/30/ispc-all with a lot of insight on bringing this to more conventional workload.
ISPC is not the only iSPMD solution in use today. Obviously anyone that wrote a GPU shader in OpenGL, Vulkan or DirectX did write an iSPMD program. Same applies to people writing OpenCL or CUDA compute programs.
In contrast, the claims from Matt Pharr regarding the lack of reliability of auto vectorisation by compiler is still true today with no solution in sight. Auto vectorisation is still not a solution relied upon for generating efficient code. Intrinsic is used a lot, but the amount of developers writing them is still limited and there is significant risk in writing intrinsic or assembly directly.
ISPC was inspired by C, but it did introduce the keyword
launch
which does pretty much the same things as the keywordgo
in Golang. As an example: https://github.com/ispc/ispc/blob/main/examples/cpu/mandelbrot_tasks/mandelbrot_tasks.ispc will not feel very far from what a Go iSPMD program could look like. ISPC was focused on just writing a small bit of your program that needed speed, but it enable more complex piece of optimized software like this BC7 encoder: https://github.com/richgel999/bc7enc_rdo/blob/master/bc7e.ispc which would be very hard to do with any other approach .An interesting experiment from Saksham Jain and Sivaprasad Sudhir that used the Go parser to generate ISPC code from Go code for some specialized function. This was presented in there thesis final report: https://www.andrew.cmu.edu/user/sakshamj/15618/final.pdf and the result of their work can be seen here: https://github.com/sakjain92/govec . Please look at the result of their benchmark which shows the potential for iSPMD in Go. This indicates the potential for several fold speeds up on many compute tasks. This also proves that there is not necessarily the need for much change to actually get iSPMD code mixed with normal Go code and get significant performance out of it. As a readable example of what that looks like and can be compared to the previous ISPC file a mandelbrot example: https://github.com/sakjain92/govec/blob/master/blas/go/mandelbrot.go . The main drawback of this prototype is that it relies on CGo, has some weird syntax requirements to get things working and limitations in the code it generates. It does mean we couldn’t use it to improve any of the standard libraries.
There has also been attempts to make iSPMD work within C++ code with projects like: https://github.com/richgel999/CppSPMD_Fast/ or https://github.com/google/highway. It works and the syntax doesn’t feel bad, again with a mandelbrot example as it seems to be the benchmark: https://github.com/richgel999/CppSPMD_Fast/blob/master/mandelbrot_imp.h . This obviously demonstrates mixing iSPMD context and non iSPMD context, but not really with the help of the language.
Additionally for reference you can read https://www.codeproject.com/Articles/1223361/Benchmarking-NET-Core-SIMD-performance-vs-Intel-IS and https://api-depositonce.tu-berlin.de/server/api/core/bitstreams/7489a854-6557-496e-8b3e-30b47e67b0bf/content to get an idea about the performance of alternative solution and how well ISPC fare. It seems that the only benchmark (short of full intrinsic/assembly approach) that managed to generate better code is rust packed_simd+rayon approach: https://github.com/rust-lang/packed_simd/tree/master/examples/mandelbrot . There is definitively a speedup from their approach at the moment compared to ISPC output, but the code complexity seems quite higher and much less readable.
Proposal
The goal of this proposal is to enable iSPMD code in portion of a golang program that the compiler will be able to generate directly as a vectorized set of operation (see https://pharr.org/matt/blog/2018/04/21/ispc-volta-c-and-spmd for an explanation on how the compiler would do so). There should not be any dependency on CGo. It should not break existing programs as they could not be using the new keyword in a manner that would get the compiler confused. It should be easy for all Go compilers to properly handle an iSPMD section even if they do not have support for SIMD. It should give clear expectations to the developers of what is happening in a iSPMD section.
An iSPMD context in go will execute the code in parallel for each member of a gang using SIMD instruction and an execution mask to follow the execution flow of the code. It means that race conditions are possible when writing code using iSPMD and tools will need to help detect them. A classic example:
I would expect the above example to actually not compile due to a type mismatch (
uniform int32
vsvarying int32
), but it is nevertheless showing what race condition to expect.We will need to be able to tag pieces of code to indicate that they are iSPMD pieces of code. For that purpose, a new keyword
ispmd
could be used to switch into an iSPMD context (I did think of usingigo
, but I am afraid this would be confusing as this is not creating any goroutine).There is also a need to support the
for
iteration picking ateach
element in the slice independently. Basically loading a bunch of the slice in an entire SIMD register at a time. Putting all of that together, it should look like that:The problem now is that from this context if we do a function call, we must go out of the iSPMD context. A solution to that is by reusing the
ispmd
keyword, we should be able to indicate to the compiler that a function is actually ok to continue executing it as part of the iSPMD context. This would look like:When an iSPMD context calls a function, it should generate a function that matches the vector length, propagate the mask and not break out of the iSPMD context. If the function is not tagged, it should break out, iterate over the vector and go back into the iSPMD context after that (This means that when calling a non iSPMD function from an iSPMD context, due to the nature of all the code path in the iSPMD context being explored in parallel for a gang, they would look like being called in random order a bit like if they were called from multiple goroutine meaning there is potential for race condition that the developer need to be aware of). If the function is called from a non iSPMD context, the compiler could for now error out and not allow it even if we should always be able to generate a serial version as fallback.
It should be allowed to have iSPMD functions defined in interfaces.
It should be possible to provide multiple variants of the same function, but with different iSPMD architecture specified to be able to call specialized assembly functions. I could not think of different way than the following:
As shown in this example, we might need to be able to call assembly code for this proposal to express all its potential. This means that there should be a calling convention for calling assembly functions from an iSPMD context. This is mostly a detail of implementation and might vary between compilers. The main issue, here, is that it introduces a concept of overloading which doesn’t exist at this point in the Go ecosystem.
Also as stated earlier, it is left out of this proposal how to optimize the transition from iSPMD to assembly and this proposal does not address nor should block the support for intrinsic types of solution which are orthogonal to this proposal.
In terms of code generation, it will be expected that a version of the iSPMD context is generated per SIMD extension supported for the architecture targeted by the compiler along with a serial implementation to cover CPU that do not have the extension when necessary (The same logic applies on iSPMD tagged functions). If the compiler knows the SIMD extension and tries to generate code for it, it should never fail to generate the code. The compiler should generate the right code to select from the best SIMD extension for the architecture it is targeting and provide meaningful fallback depending on the host CPU capability.
It is also logical that you can not use an
ispmd
construction inside a iSPMD context (be it a function or inside a previousispmd
) as it doesn’t make sense to switch again to an iSPMD context when you are already in one.We should trust that the compiler will generate code that for each of the SIMD extensions leads to the exact same computational result, but for testability purposes we should be able to have support in Test and Benchmark to force the SIMD extension choice when running them as we might not want to tests all combination all the time. And we might want to use PGO after running through a benchmark. It would also be necessary to upgrade staticcheck to help detect data race conditions inside an iSPMD context.
Additionally while reading https://pharr.org/matt/blog/2018/04/26/ispc-volta-more-on-performance and https://ispc.github.io/perfguide.html , it seems important to address the following keywords that exist in ISPC.
Specifying instruction set and mask size at compile time: with ISPC and cpp_spmd, it is necessary to specify the instruction set targeted and the size of the mask. This is not really simple for the developers and requires work for any newly supported architecture. It would be best to avoid having to specify it in most cases. It seems possible to me that the compiler could have a pass that explore each iSPMD context and figure out the smallest varying data type processed in that iSPMD context (For function it might just actually infer it directly from the parameters type). With that information the compiler should be able to infer the size of the mask (length of SIMD register/sizeof type) and avoid having the developer specify this information. This should enable a more portable code and easier to read.
uniform
: which is a very common keyword in many GPU languages, indicates that the variable has the same value for the current iSPMD context and for all members of the gang. In the case of this proposal, any variable created outside of anispmd
block is automatically anuniform
and any variable inside is a varying. A difficulty might arise when calling a function as you can now pass a variable that has been declared outside of the iSPMD context and is an uniform to a iSPMD function that expects only varying. It is also possible to need to declare a variable before we do an assignment and this would also mean that there will be a case where a uniform variable needs to be declared before the association that would allow for type information to be propagated. This information can apply on different levels of a variable type. If in an iSPMD context you are building an array filled with the content of a variable that comes from a non iSPMD context, you need to specify that the type inside the array is also an uniform. This means that, in an iSPMD context, we need the ability to define that a type isuniform
. As an example:I do not see how it would be possible to make the type system work without the
uniform
keyword which is the reason why the entire proposal is designed with the idea of deep change to the language.It would be logical to ask the question if we do not need a
varying
to do the opposite of auniform
keyword. To define that a variable or a type coming from a iSPMD context could be accessed from a non iSPMD context or carry around by non iSPMD context or just define a useful structure type for our iSPMD context, we would need a way to define structure with iSPMD data type (ideally not just anonymous structure). As we have been using the keywordispmd
so far, we can likely continue and use that to define types that embedvarying
data. Something like that would do:In this case, if you can pass the structure around in non iSPMD context, only the id which is a
uniform
would be accessible and all the other fields wouldn’t be accessible to the non iSPMD context. For simplicity, the structure can likely only be allocated inside an iSPMD context and the nonuniform
field/varying field can only be used inside an iSPMD context.cif
: having to explore both branches of if might be quite costly especially if one is rarely walked (error handling for example). cif exists to tell ISPC to generate additional tests to avoid walking a branch that is usually not worth it (checking the content of the entire mask and jumping depending on it). This should be something equivalent to calling aif AnyTrue(test) { if test {} }
withispmd func AnyTrue(bool test) uniform bool
. Not the prettiest syntax, but as this would be possible within the existing proposal without adding any keyword, I don’t think we need at this point to have an equivalent tocif
.foreach
: this has two uses. First one it iterates on a different step for each gang. And second it does enable to generate a more optimized loop. The use ofeach range
should be directly matched to ISPC foreach construct. So I do not think we need a foreach keyword or its feature at this point.The question of how to iterate on just numbers when you do not have a slice to start from is open and I do not have a good answer at this point, nor do I know if it is really necessary in practice. A possibility would be to extend the use of each for the following:
The second foreach construct is the foreach tiled. I think we can skip it at the moment and look later on how to implement it as this might be related to things like multidimensional array/slice.
Summary
This proposal changes the language by adding 3 keyword:
ispmd
,uniform
andeach
, the ability to switch to a iSPMD context when entering a block of code specified withispmd
and continuing into a iSPMD context when calling a function and ability to call assembly from within a iSPMD context directly. This change will impact the entire ecosystem and all tooling. It should yield significant performance gain for all computation and I would expect yield performance close to what ISPC has proven was doable.The Go compiler might slow down a little bit at first, as it will need to have an additional pass and generate multiple variant of the same code, but as it does parse and encode, it should itself be able to use iSPMD context and so it should over time get faster when this proposal get used.
This is trying to still be a “minimal” proposal, but I would expect that over time new functions like the example
ReduceAdd
orAnyTrue
above would be proposed in the stdlib depending on the Go ecosystem needs.Fully understanding all the potential of the language will be harder, but at the same time it will make using Go as a teaching tool about parallelism one of the best solutions as alternatives would be more complex to approach or difficult to have access to. This could mean university potentially switching to teaching Go as part of the curriculum.
The benefit of having iSPMD directly in the language shouldn’t escape to anyone as parser, encoder, decoder, compression, compare, search, sort would be a good target to improve once the compiler has iSPMD capability. I can see a lot of functions inside the standard library that could use this, but as this is not a magical solution, the code will have to express this parallelism to actually benefit from this. This means code change and not just adding a bunch of iSPMD flags around (This is logical otherwise auto vectorisation would have succeeded by now).
Typical benchmark example
The text was updated successfully, but these errors were encountered: