-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: unsafe: inline assembly with unsafe.Asm function #26891
Comments
If considered to be accepted, I think the signature should be func Asm(string) error |
@cznic Why should there be a return error value? In what cases would an error be deferred from compile time to run time? |
Scratch the return value in my post, IDK what I was thinking. What I really wanted to say is that the arguments and all variations of arguments (Asm2, Asm3, ...) should be replaced by just a string. There are more things that are needed in assembler code than just instructions. For example directives, declarations and even comments are sometimes a must have. |
@cznic for single string argument, I have these questions:
Note that most of the time, one can use Go variables without having to specifying registers. |
This is out of scope of this proposal. Another important case is getting rid of special treatment of intrinsified functions inside the compiler.
Just use Go comments. |
How will this proposal ensure that the assembly is correct at compile time rather than run time? Across architectures?
I think containment is extremely useful when dealing with platform-specific code. How does the feature benefit the maintainer of the codebase? It is easy to tell where an assembly function is called, whereas in this scenario it would be difficult to see where it is being used. I'm confused about the end goal. We would use this inside of loops, so we don't have to use them inside pure assembly functions? I would rather have a function that implements the loop inside of it rather than invoke the instructions within the loop. Are there any other advantages of doing it this way other than convenience for the writer? |
What do you mean by "assembly is correct"?
Could you clarify, please? It's possible to write portable 3-rd party library that gives such primitives as cross-platform SIMD operations. The advantage is that they can be inlineable, so this makes them more composable than pure asm alternatives (user always pays for the function call).
Making it possible to get rid of "intrinsics" from the compiler and make it possible to implement them without so much special casing. |
For context, this is where it was unclear:
If I have an assembly function that contains an invalid or unsupported instruction, and I run |
Inline asm is a bad idea in my opinion. In C/C++ it leads to run-on sections like.. You indicated that you can protect the code with a build tag, but that only means users of other architectues won't have access to the code at all. In truth, most of the time the inline asm will only be written for amd64, which will make for huge porting problems to other architectures. The way things are now, asm is really only used (other than in the go runtime itself) for accellerating code that has already been written in Go. Becuase it's written in Go it's portable. Inline asm will destroy the admirable portability of the Go language. It also destroys readability. |
All errors happen in the same way, Suppose this is the compilation pipeline:
The After BE finishes optimizations and lowering, |
Does anything prevent a user from separating the opcode from the call string by using a constant, such as: const myInstruction = "MOVQ" |
@billotosyr you already can write asm-only function without any go fallback, but I don't think this happens now. |
In the prototype I've rolled, no. Any constant string will do. The intention is to provide very minimalistic API that makes it possible to write a less error-prone intrinsic-like library as a 3-rd party package. For package x86
func Mov64(dst, src interface{}) {
unsafe.Asm("MOVQ", dst, src)
} The other way is to provide named constants in package x86
const Mov64 = "MOVQ" The other benefits came to my mind:
|
Something like this (or really any way to make user defined intrinsics) would be very useful, especially with how expensive golang function calls are, and the lack of inlining making most AVX assembly slower than it could be in a tight loop. |
One of your examples is
You also say that an input argument to I don't see how that can be compatible. If the |
If your goal is to eliminate the need for intrinsic functions known to the compiler, then this proposal is missing a few tricks. For example, any intrinsic that maps to atomic instructions will also have to explicitly declare that it has observable side-effects and that the instruction(s) must be emitted in order with respect to the surrounding code. Another contrived example is input and output register constraints (for example, variable-width shifts on x86 are restricted to using only CL for the shift counter). Practically speaking, the only way gcc and clang are able to provide inline assembly is by exposing compiler's model of instruction constraint modelling to the user (i.e. "this sequence clobbers %xmm0 and %xmm1 and produces an output in %1, which must be a 64-bit general purpose register). The compiler back-end cannot produce these constraints for you unless it is taught about every single machine instruction, so we would have to expose most or all of the SSA back-end's machine model to the user. I worry that exposing those implementation details would make it too difficult to modify the internal representation of the compiler down the road. Moreover, the SSA back-end does not know about every kind of register constraint (paired registers, sequential registers, etc.) All of these concerns have been brought forth by various folks in previous inline assembly proposals, and I don't see them addressed here. I would prefer, instead, that this effort be focused on improving the calling convention to pass arguments in registers and use callee-save registers. |
@philhofer I mostly agree with you, but I note that the assembler is just a set of Go packages, and in fact the assembler and the compiler share output generation (in cmd/internal/obj), so it is entirely feasible for the compiler to understand the register requirements of all the instructions that the assembler recognizes. |
|
I prototyped a quick high-level API: github.com/dave/asm It's relatively easy to generate function stubs for all the instructions using the x86spec command in github.com/golang/arch - I expect the other architectures are similar. Perhaps if this was readily available, the signature of |
Here's a (not complete) list of things gcc inline assembly is able to specify. I think any proposal should probably either include these, or explicitly state why they aren't needed.
Unline gcc inline assembly, it sounds like you're proposing only one instruction per |
All registers used in
No control flow is possible from/into the
All
All
It's safe to use flag-generating instruction followed by a flag consuming one.
This is why we have distinct
I don't exactly sure how to approach this one. The idea is to minimize explicit registers usage as much as possible. Instead of using |
I need to re-check one thing before answering your question. Go assembler does require some "implicit" arguments to be specified explicitly, like |
Though |
I want to push back on the idea that inline assembly is a feature that must be added at all. Inline assembly has enormous semantic complexity that this proposal does not adequately grapple with, although the responses here are trying to. Inline assembly also eliminates the pressure to actually produce good designs where assembly should not be necessary. The hard part about design is finding interfaces that are generally useful and work well across a wide variety of settings. Yes, we spend a lot of time on that, but the end result should be better overall. We've seen this repeatedly, with math/bits, with FMA, with 128-bit integer operations. Inline assembly also removes what has been a useful separation between Go code and assembly code. Go code is almost always portable, assembly code almost always not. Projects that place a premium on portability can have a simple rule like "no assembly files". For those willing to use assembly, the current separation makes it easy to add new assembly for new architectures (by file name build rules). If inline assembly is sprinkled into Go code then you'd have to first These are very significant costs. To what benefit? It is already possible to write assembly code. The incremental benefits of adding a second, completely different way to write assembly code seem small. If we were starting from scratch and the proposal was "write all assembly this way and don't have separate *.s files" that would be different. But that ship has sailed. The proposal makes very little case for the benefits:
The proposal changes this situation not at all. You still have to write assembly to get these speedups. And bytes.IndexByte is >100 lines of x86 assembly. Surely that's not going to turn into some kind of Go-asm hybrid? (Or if it is, the proposal should make clear why that's an improvement.)
General enough for who? For compiler writers? For writers of assembly? Maybe. But for users, actually defining useful primitives that are broadly applicable is exactly the point. That's our responsibility as language and library designers. The proposal is essentially arguing "it is too difficult to duck that responsibility." I disagree, of course: it is good that it is difficult to duck that responsibility.
Here it's worth considering exactly what class of functions benefits from inlining. We took a very long time to intrinsify Surely the proposal is not suggesting that bytes.IndexByte be inlined at each call site.
Assembly files are scalable too, and as noted above they make it far less likely that people will accidentally write x86-specific (or anything-else-specific) code.
I believe the overall burden here as a fraction of compiler work is very small. And the clutter has moved out of the compiler into Go source code. That's not an obvious win.
It really sounds again like you expect to write IndexByte in assembly. If so, please show it, so we can see why the result is an improvement.
In general fulfilling a feature request is not an argument in favor of a proposal. Thanks. |
AFAIK, there is a "safe Go" definition that requires no asm file + no unsafe. |
I'd like to articulate that the main goal of this proposal (at least in my opinion) is to reduce amount of asm in go projects, not to increase it. I absolutely agree that having proper api is much better, but there are hundreds of instructions with very specific use cases, and creating reasonable api for all of them isn't feasible or desirable. So if someones absolutely needs to use them (most likely for performance reasons) they will rewrite part of their code in asm. And due to call overhead they will need to rewrite significant portion of their code. For example here are most used amd64 instructions from go asm files (see below for methodolgy):
First instruction that actually accomplishes something, that pure go can not is MOVOU at 5, which is order of magnitude less popular than MOVQ at position 1. I don't think there is a lot of situations were it makes sens to write So most asm code introduces accidental complexity, and should be replaced with pure go, reducing maintenance burden and making it easier to code-review/audit/port to different arch/replace with pure go/use new api. Now there are different ways to archive this. We could introduce inline asm, that can be mixed with pure go, like this proposal suggests. We may go gcc way and introduce a few thousands new intrinsics. Or we may switch to register based abi and allow marking clobbered registers of asm function, reducing call overhead to negligible and breaking existing asm. But an order of magnitude reduction in asm lines is IMHO a desirable goal anyway. Methodologyobtained by running following query against github data
There several problems with this method:
But IMHO it provides a reasonable estimate. |
As a "normal" Go programmer with near-zero knowledge of assembly programming, it seems to me that this proposal is kinda saying - "If you need super-fast code, you need to write in assembly. The compiler won't emit instructions for latest processors". Which is kinda discomforting because I don't know much assembly, so does that mean there is a limit beyond which I cannot write more optimized code ? Will it be possible to get further discussion going on #25489 ? So that all the work of generating optimal instructions fall on the compiler rather than the user. Will there still be a need for this proposal if the compiler can do it of its own ? |
@agnivade, this is the case with the advanced C++ compilers too. Even if they do vectorize some loops, it doesn't mean they can optimize any code to the best form possible. So, even C++ programmers that usually have |
Fair enough. I guess I was trying to prioritize things. If the end-goal is to generate instructions for latest processors, I would want the compiler to generate it first. And then have intrinsics to take it to the next step. But I understand that doing it from the compiler is a sufficiently big undertaking. |
In my long comment above I mentioned the other things we've added to eliminate the need to write assembly:
If there are more of these, let's focus on these specific use cases and not on trying to debug an inline assembly proposal. The table from @TocarIP is interesting - if it points out other specific kinds of functionality we need to better support in the library, let's open new issues for those. |
Proposal: inline assembly
Author: Iskander Sharipov
With input from Ilya Tocar.
Last updated: 9 August, 2018
Abstract
This proposal describes how inline assembly feature can be integrated into
Go language in a backwards-compatible way and without any syntax extensions.
Users that do not write/maintain assembly or not interested in raw clock
performance would not see any difference.
Background
Right now the only way to get high performance for CPU-bound operation is to
write an assembly implementation using latest instructions available (with appropriate
run time CPU flags switch with fallbacks to something more conservative).
Sometimes the performance advantages of assembly version are astonishing,
for functions like
bytes.IndexByte
it's orders of magnitude improvement:The
old
is portable pure Go version andnew
is assembly code withAVX2
.Other cases are addressed with increasing amount of intrinsified functions.
The downside is that they pollute the compiler and speedup only a finite
set of intrinsified functions. Not a general enough solution.
The advantage of Go intrinsics is that they can be inlined, unlike
manually written assembly functions. This leads to conclusion: what if
there was a way to describe ordinary Go function (hence, inlineable) that
does use machine instructions explicitly? This can address all problems described above:
can be defined as a normal functions inside Go sources.
This reduces the burden from the Go compiler maintainers.
assembly code. Also easier to maintain and test.
This proposal describes how to indroduce that facility into the language without
breaking changes and as unintrusive as possible.
Proposal
This document proposes a single new Go function,
unsafe.Asm
defined as:This function is the low level mechanism for Go programmers to inject
machine-dependent code right into the function body at the
unsafe.Asm
call site.For example, this line of code results in a single
MOVQ AX, $10
instruction:It can be used to build more high-level, intrinsic-like API.
The best part is that it can be implemented as a third-party library.
Like other arch-dependent code,
unsafe.Asm
should be protected by a buildtag or appropriate filename suffix, like
_amd64
.unsafe
package is preferable, because:unsafe.Pointer
can be useful when dealing with memory operands.other Go packages.
unsafe.Asm arguments
opcode
refers to the instruction name for the host machine.All opcodes are in Go assembler syntax and require size suffixes.
It's also possible to pass opcode suffixes along with instruction name.
These suffixes should be separated by period, just like in ordinary Go asm.
dst
accepts any assignable Go values, with exception of compound expressionslike index expression and function calls that return a pointer. One can use
temporary variables and/or address taking to overcome this limitation.
args
are more permissive thandst
and also accept integer and floating-pointconstants for immediates as well as more complex Go expressions that yield
value that is permitted for
unsafe.Asm
arguments.The permitted values include all numeric types sans complex numbers.
Value must fit the hardware register, so it matches the
unsafe.Sizeof(int)
.For 32-bit platforms, 64-bit types can't be used.
For all other values pointers should be used.
Pointer types (including
unsafe.Pointer
) force memory operand interpretation.Non-pointer types follow default Go value semantics.
Note that dst/src order follows Go conventions, not assembly language convention:
destination goes first, then sources. This also helps to make destination
parameter more distinguishable inside
unsafe.Asm
signature.As a special case, instructions that have no explicit arguments use
nil
destination:Comparison-like instructions that usually used to update flags and do not have
explicit destination also use
nil
destination argument:See Efficient control flow for more details.
Guarantees
It is important to clearly describe guarantees that programmer may rely on.
unsafe.Asm
is determenistic,these calls can't be sheduled somewhere else.
This means that a sequence of
unsafe.Asm
is executed in order theyappear inside source code.
unsafe.Asm
calls andunsafe.Asm
itselfis marked as flag clobbering operation.
Efficient control flow
There is no
JMP
support because inlined assembler does not see Go labels.In order to make writing efficient programs possible,
SSA backends can recognize this operation sequence and produce optimal code:
SETNE
can be eliminated as well asfound
variable read.Generated machine code becomes close to one that is produced out of hand-written assembly.
Error reporting
There are different kinds of programming errors that may occur during
unsafe.Asm
usage.Go compiler frontend,
gc
, can catch invalid opcodes and obviouslywrong operand types. For example,
JAVA
opcode does not exist and willresult in compile-time error triggered from
gc
. Operandsare checked using generic rules that are shared among all instructions.
Most other errors are generated by assembler backends.
For
AMD64
such backend iscmd/internal/obj/x86
.This is the direct consequence of opaqueness of the asm ops during compilation.
That property reduces the amount of code needed to implement inline assembly,
but does delay error reporting, leading to somewhat more cryptic error messages.
In turn, this may be a good opportunity to imporve assembler error reporting.
Example
Given
math.Trunc
intrinsified function, we can try to defineAMD64
versionwithout direct compiler support.
trunc1
andtrunc2
generate same code sequence:The only difference is that
trunc1
doesruntime.support_sse41
checkwhich can be done inside
trunc2
as well.Compatibility
The API changes are fully backwards compatible.
Implementation
Most of the work would be done by the author of this proposal.
Initial implementation will include
AMD64
support forunsafe.Asm
code generation.Other backends can adopt that implementation ideas to add missing architectures support.
Go parts that need modifications:
unsafe
: new function,Asm
cmd/compile/internal/gc
:unsafe.Asm
typechecking and SSA generationcmd/compile/internal/ssa
: changes toregalloc
plus new asm-related opscmd/compile/internal/amd64
: code generation forunsafe.Asm
-generated opscmd/asm/internal
: parser is used to parse asm operand stringsAdditional notes
Initial implementation prototype gives 85-100% of hand-written assembly code performance.
There is some room for improvements, especially for the memory operations, which
can bump lower bound closer to 90-95%. The remaining performance difference is mostly
due to advanced branching tricks used in some assembly code and more efficient
code layout/registers usage.
Open questions
How to express write-only destination operands to avoid extra zeroing?
Proposed solution: ?
What about gccgo and other Go implementations?
Proposed solution: we can probably start by not permitting
unsafe.Asm
inside compilers that do not support it.How to express multi-output instructions?
Proposed solution A: interpret
[]interface{}
argument as a multi-value destination.Note that
[]interface{}
causes no allocations and is consumed during the compile time.This is consistent with a way how
unsafe.Sizeof
works.Proposed solution B: add
unsafe.Asm2
function that has 2 destination arguments.The text was updated successfully, but these errors were encountered: