Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: unsafe: inline assembly with unsafe.Asm function #26891

Closed
quasilyte opened this issue Aug 9, 2018 · 30 comments
Closed

proposal: unsafe: inline assembly with unsafe.Asm function #26891

quasilyte opened this issue Aug 9, 2018 · 30 comments

Comments

@quasilyte
Copy link
Contributor

quasilyte commented Aug 9, 2018

Proposal: inline assembly

Author: Iskander Sharipov

With input from Ilya Tocar.

Last updated: 9 August, 2018

Abstract

This proposal describes how inline assembly feature can be integrated into
Go language in a backwards-compatible way and without any syntax extensions.

Users that do not write/maintain assembly or not interested in raw clock
performance would not see any difference.

Background

Right now the only way to get high performance for CPU-bound operation is to
write an assembly implementation using latest instructions available (with appropriate
run time CPU flags switch with fallbacks to something more conservative).

Sometimes the performance advantages of assembly version are astonishing,
for functions like bytes.IndexByte it's orders of magnitude improvement:

name            old time/op    new time/op     delta
IndexByte/32-8    32.2ns ± 0%      4.1ns ± 0%    -87.14%  (p=0.000 n=9+10)
IndexByte/4K-8    2.43µs ± 0%     0.08µs ± 2%    -96.55%  (p=0.000 n=10+10)

name            old speed      new speed       delta
IndexByte/32-8   993MB/s ± 0%   7724MB/s ± 0%   +677.74%  (p=0.000 n=9+9)
IndexByte/4K-8  1.68GB/s ± 0%  48.80GB/s ± 2%  +2801.13%  (p=0.000 n=10+10)

The old is portable pure Go version and new is assembly code with AVX2.

Other cases are addressed with increasing amount of intrinsified functions.
The downside is that they pollute the compiler and speedup only a finite
set of intrinsified functions. Not a general enough solution.

When referring to intrinsics, functions like math.Sqrt are implied.

The advantage of Go intrinsics is that they can be inlined, unlike
manually written assembly functions. This leads to conclusion: what if
there was a way to describe ordinary Go function (hence, inlineable) that
does use machine instructions explicitly? This can address all problems described above:

  • It's scalable. Users may define their own intrinsics if they really need to.
  • No need to clutter the compiler internals with intrinsic definitions, they
    can be defined as a normal functions inside Go sources.
    This reduces the burden from the Go compiler maintainers.
  • Writing these functions is less error-prone than writing hundreds lines of
    assembly code. Also easier to maintain and test.
  • It makes inlineable assembler feature request fulfilled, like issue17373 and issue4978.

This proposal describes how to indroduce that facility into the language without
breaking changes and as unintrusive as possible.

Proposal

This document proposes a single new Go function, unsafe.Asm defined as:

func Asm(opcode string, dst interface{}, args ...interface{})

This function is the low level mechanism for Go programmers to inject
machine-dependent code right into the function body at the unsafe.Asm call site.

For example, this line of code results in a single MOVQ AX, $10 instruction:

unsafe.Asm("MOVQ", "AX", 10)

It can be used to build more high-level, intrinsic-like API.
The best part is that it can be implemented as a third-party library.

Like other arch-dependent code, unsafe.Asm should be protected by a build
tag or appropriate filename suffix, like _amd64.

unsafe package is preferable, because:

  1. Inline assembly, just like normal assembly, is unsafe.
  2. unsafe.Pointer can be useful when dealing with memory operands.
  3. It does explicitly state that it may not be as backwards-compatible as
    other Go packages.

unsafe.Asm arguments

opcode refers to the instruction name for the host machine.
All opcodes are in Go assembler syntax and require size suffixes.
It's also possible to pass opcode suffixes along with instruction name.
These suffixes should be separated by period, just like in ordinary Go asm.

dst accepts any assignable Go values, with exception of compound expressions
like index expression and function calls that return a pointer. One can use
temporary variables and/or address taking to overcome this limitation.

args are more permissive than dst and also accept integer and floating-point
constants for immediates as well as more complex Go expressions that yield
value that is permitted for unsafe.Asm arguments.

The permitted values include all numeric types sans complex numbers.
Value must fit the hardware register, so it matches the unsafe.Sizeof(int).
For 32-bit platforms, 64-bit types can't be used.
For all other values pointers should be used.

Pointer types (including unsafe.Pointer) force memory operand interpretation.
Non-pointer types follow default Go value semantics.

var x int64
unsafe.Asm("MOVQ", x, 10)  // MOVQ x(SP), AX; MOVQ $10, AX
unsafe.Asm("MOVQ", &x, 10) // LEAQ x(SP), AX; MOVQ $10, (AX)

Note that dst/src order follows Go conventions, not assembly language convention:
destination goes first, then sources. This also helps to make destination
parameter more distinguishable inside unsafe.Asm signature.

As a special case, instructions that have no explicit arguments use nil destination:

unsafe.Asm("SFENCE", nil)

Comparison-like instructions that usually used to update flags and do not have
explicit destination also use nil destination argument:

// Compare `x` with 1; updates flags.
unsafe.Asm("CMPQ", nil, 1, x)

See Efficient control flow for more details.

Guarantees

It is important to clearly describe guarantees that programmer may rely on.

  • The order of unsafe.Asm is determenistic,
    these calls can't be sheduled somewhere else.
    This means that a sequence of unsafe.Asm is executed in order they
    appear inside source code.
  • CPU flags are preserved between unsafe.Asm calls and unsafe.Asm itself
    is marked as flag clobbering operation.
  • Explicitly allocated registers are not clobbered by the Go compiler.

Efficient control flow

There is no JMP support because inlined assembler does not see Go labels.

In order to make writing efficient programs possible,
SSA backends can recognize this operation sequence and produce optimal code:

var found bool                        // 1. Some bool variable.
unsafe.Asm("VPTEST", nil, "Y3", "Y3") // 2. Some flag-generating operation.
unsafe.Asm("SETNE", found)            // 3. Flags assignment to bool variable.
if found {                            // 4. Branching using that bool variable.
	// Body to be executed (hint: can use goto to Go label here).
}

SETNE can be eliminated as well as found variable read.
Generated machine code becomes close to one that is produced out of hand-written assembly.

Error reporting

There are different kinds of programming errors that may occur during
unsafe.Asm usage.

Go compiler frontend, gc, can catch invalid opcodes and obviously
wrong operand types. For example, JAVA opcode does not exist and will
result in compile-time error triggered from gc. Operands
are checked using generic rules that are shared among all instructions.

Most other errors are generated by assembler backends.
For AMD64 such backend is cmd/internal/obj/x86.

This is the direct consequence of opaqueness of the asm ops during compilation.
That property reduces the amount of code needed to implement inline assembly,
but does delay error reporting, leading to somewhat more cryptic error messages.
In turn, this may be a good opportunity to imporve assembler error reporting.

Example

Given math.Trunc intrinsified function, we can try to define AMD64 version
without direct compiler support.

package example

import (
	"math"
	"unsafe"
)

func trunc1(x float64) float64 {
	return math.Trunc(x)
}

func trunc2(x float64) float64 {
	unsafe.Asm("ROUNDSD", x, 3, x)
	return x
}

trunc1 and trunc2 generate same code sequence:

MOVSD	x(SP), X0
ROUNDSD	$3, X0, X0
MOVSD	X0, ret+(SP)

The only difference is that trunc1 does runtime.support_sse41 check
which can be done inside trunc2 as well.

Compatibility

The API changes are fully backwards compatible.

Implementation

Most of the work would be done by the author of this proposal.

Initial implementation will include AMD64 support for unsafe.Asm code generation.

Other backends can adopt that implementation ideas to add missing architectures support.

Go parts that need modifications:

  • unsafe: new function, Asm
  • cmd/compile/internal/gc: unsafe.Asm typechecking and SSA generation
  • cmd/compile/internal/ssa: changes to regalloc plus new asm-related ops
  • cmd/compile/internal/amd64: code generation for unsafe.Asm-generated ops
  • cmd/asm/internal: parser is used to parse asm operand strings

Additional notes

Initial implementation prototype gives 85-100% of hand-written assembly code performance.
There is some room for improvements, especially for the memory operations, which
can bump lower bound closer to 90-95%. The remaining performance difference is mostly
due to advanced branching tricks used in some assembly code and more efficient
code layout/registers usage.

Open questions

How to express write-only destination operands to avoid extra zeroing?

Proposed solution: ?

What about gccgo and other Go implementations?

Proposed solution: we can probably start by not permitting unsafe.Asm inside compilers that do not support it.

How to express multi-output instructions?

Proposed solution A: interpret []interface{} argument as a multi-value destination.

var quo, rem uint8
// Note that IDIV expects first operand to be in AX.
unsafe.Asm("MOVB", "AX", uint8(x))
unsafe.Asm("IDIV", []interface{}{quo, rem}, uint8(y))
// AL is moved to quo.
// AH is moved to rem.

Note that []interface{} causes no allocations and is consumed during the compile time.

This is consistent with a way how unsafe.Sizeof works.

Proposed solution B: add unsafe.Asm2 function that has 2 destination arguments.

func Asm2(opcode string, dst1, dst2 interface{}, args ...interface{})
@gopherbot gopherbot added this to the Proposal milestone Aug 9, 2018
@cznic
Copy link
Contributor

cznic commented Aug 9, 2018

If considered to be accepted, I think the signature should be

func Asm(string) error

@ghost
Copy link

ghost commented Aug 9, 2018

@cznic Why should there be a return error value? In what cases would an error be deferred from compile time to run time?

@cznic
Copy link
Contributor

cznic commented Aug 9, 2018

Scratch the return value in my post, IDK what I was thinking. What I really wanted to say is that the arguments and all variations of arguments (Asm2, Asm3, ...) should be replaced by just a string. There are more things that are needed in assembler code than just instructions. For example directives, declarations and even comments are sometimes a must have.

@quasilyte
Copy link
Contributor Author

quasilyte commented Aug 9, 2018

@cznic for single string argument, I have these questions:

  1. How to determine dst argument? There can be 0, 1 or more of them. Without this info, it's impossible to model data flow properly in SSA regalloc.
  2. How to pass Go value into that code? I mean something like this: unsafe.Asm("LEAQ", "AX", &a[0]).

Note that most of the time, one can use Go variables without having to specifying registers.
The only notable exception is vector registers like X/Y/Z on AMD64. Programmer has to use them directly. For scalars and pointers, there no need to spell registers by names; regalloc will do that for you.

@quasilyte
Copy link
Contributor Author

quasilyte commented Aug 9, 2018

There are more things that are needed in assembler code than just instructions. For example directives, declarations and even comments are sometimes a must have.

This is out of scope of this proposal.
At least this was my initial goal: make it possible to use SIMD inside Go loops without having to write whole function in asm.

Another important case is getting rid of special treatment of intrinsified functions inside the compiler.

and even comments are sometimes a must have.

Just use Go comments.
Single unsafe.Asm encodes single instruction.

@as
Copy link
Contributor

as commented Aug 9, 2018

How will this proposal ensure that the assembly is correct at compile time rather than run time? Across architectures?

make it possible to use SIMD inside Go loops without having to write whole function in asm.

I think containment is extremely useful when dealing with platform-specific code. How does the feature benefit the maintainer of the codebase? It is easy to tell where an assembly function is called, whereas in this scenario it would be difficult to see where it is being used.

I'm confused about the end goal. We would use this inside of loops, so we don't have to use them inside pure assembly functions? I would rather have a function that implements the loop inside of it rather than invoke the instructions within the loop. Are there any other advantages of doing it this way other than convenience for the writer?

@quasilyte
Copy link
Contributor Author

quasilyte commented Aug 9, 2018

How will this proposal ensure that the assembly is correct at compile time rather than run time?

What do you mean by "assembly is correct"?
If you mean correct as in assembly code, just "assembles correctly", then it's the asm backend responsibility. The unsafe.Asm produces SSA value that is turned into matching obj.Prog object after optimization passes. These are handled by the asm backend as usual.

Across architectures?

Could you clarify, please?
The unsafe.Asm is as portable as normal asm (read: not portable at all). If one wants several implementations inside one loop, it's still possible to wrap a SIMD instruction calls into a function (that function will be inlined, so no performance penalties there).

It's possible to write portable 3-rd party library that gives such primitives as cross-platform SIMD operations. The advantage is that they can be inlineable, so this makes them more composable than pure asm alternatives (user always pays for the function call).

Are there any other advantages of doing it this way other than convenience for the writer?

Making it possible to get rid of "intrinsics" from the compiler and make it possible to implement them without so much special casing.

@as
Copy link
Contributor

as commented Aug 9, 2018

What do you mean by "assembly is correct"?

For context, this is where it was unclear:

This function is the low level mechanism for Go programmers to inject machine-dependent code right into the function body at the unsafe.Asm call site.

If I have an assembly function that contains an invalid or unsupported instruction, and I run go build. I will get an error and no binary will be produced. If the same scenario occurs in this proposal, what will happen when the user runs go build?

@billotosyr
Copy link

Inline asm is a bad idea in my opinion. In C/C++ it leads to run-on sections like..
#elif defined(i386)
asm ...
#elif defined(x86_64) || defined(amd64)
asm ...
#elif defined(powerpc) || defined(ppc)
asm ...
#elif defined(s390x)
asm ...
#elif defined(sparc)
asm ...
#elif defined(ia64)
asm ...

You indicated that you can protect the code with a build tag, but that only means users of other architectues won't have access to the code at all. In truth, most of the time the inline asm will only be written for amd64, which will make for huge porting problems to other architectures.

The way things are now, asm is really only used (other than in the go runtime itself) for accellerating code that has already been written in Go. Becuase it's written in Go it's portable. Inline asm will destroy the admirable portability of the Go language.

It also destroys readability.

@quasilyte
Copy link
Contributor Author

If I have an assembly function that contains an invalid or unsupported instruction, and I run go build. I will get an error and no binary will be produced. If the same scenario occurs in this proposal, what will happen when the user runs go build?

All errors happen in the same way, unsafe.Asm("FOO", nil) results in invalid instruction during go build. Same for invalid arguments.

Suppose this is the compilation pipeline:

compiler FE -> compiler BE -> assembler

The unsafe.Asm is replaced with OpAsm SSA value during the FE->BE transition (gc/ssa.go),
this catches invalid opcodes.

After BE finishes optimizations and lowering, BE->assembler transformation produces obj.Prog lists, these are then verified by the asm backends. This catches all other errors like invalid arguments combinations, etc.

@as
Copy link
Contributor

as commented Aug 9, 2018

Does anything prevent a user from separating the opcode from the call string by using a constant, such as:

const myInstruction = "MOVQ"

@TocarIP
Copy link
Contributor

TocarIP commented Aug 9, 2018

@billotosyr you already can write asm-only function without any go fallback, but I don't think this happens now.

@quasilyte
Copy link
Contributor Author

quasilyte commented Aug 9, 2018

Does anything prevent a user from separating the opcode from the call string by using a constant, such as

In the prototype I've rolled, no. Any constant string will do.
I believe this property does not make things worse.

The intention is to provide very minimalistic API that makes it possible to write a less error-prone intrinsic-like library as a 3-rd party package. For MOVQ, we can have these signatures:

package x86
func Mov64(dst, src interface{}) {
  unsafe.Asm("MOVQ", dst, src)
}

The other way is to provide named constants in github.com/foobar/x86 package:

package x86
const Mov64 = "MOVQ"

The other benefits came to my mind:

  • It's easier to do static code analysis inside Go code. unsafe.Asm has quite straightforward signature and can be verified for semantics with tools like staticcheck.
  • We could implement auto loop vectorization with this, using code generation. Without gc compiler support, that is.

@docmerlin
Copy link

docmerlin commented Aug 9, 2018

Something like this (or really any way to make user defined intrinsics) would be very useful, especially with how expensive golang function calls are, and the lack of inlining making most AVX assembly slower than it could be in a tight loop.

@ianlancetaylor
Copy link
Member

One of your examples is

unsafe.Asm("MOVB", "AX", uint8(x))
unsafe.Asm("IDIV", []interface{}{quo, rem}, uint8(y))

You also say that an input argument to unsafe.Asm can be a complex Go expression.

I don't see how that can be compatible. If the IDIV can take a complex Go expression as an input, then the compiler will have to generate code to compute that input. That code may well use AX, thereby clobbering the value stored by the MOVB.

@philhofer
Copy link
Contributor

If your goal is to eliminate the need for intrinsic functions known to the compiler, then this proposal is missing a few tricks.

For example, any intrinsic that maps to atomic instructions will also have to explicitly declare that it has observable side-effects and that the instruction(s) must be emitted in order with respect to the surrounding code.

Another contrived example is input and output register constraints (for example, variable-width shifts on x86 are restricted to using only CL for the shift counter).

Practically speaking, the only way gcc and clang are able to provide inline assembly is by exposing compiler's model of instruction constraint modelling to the user (i.e. "this sequence clobbers %xmm0 and %xmm1 and produces an output in %1, which must be a 64-bit general purpose register). The compiler back-end cannot produce these constraints for you unless it is taught about every single machine instruction, so we would have to expose most or all of the SSA back-end's machine model to the user. I worry that exposing those implementation details would make it too difficult to modify the internal representation of the compiler down the road. Moreover, the SSA back-end does not know about every kind of register constraint (paired registers, sequential registers, etc.) All of these concerns have been brought forth by various folks in previous inline assembly proposals, and I don't see them addressed here.

I would prefer, instead, that this effort be focused on improving the calling convention to pass arguments in registers and use callee-save registers.

@ianlancetaylor
Copy link
Member

@philhofer I mostly agree with you, but I note that the assembler is just a set of Go packages, and in fact the assembler and the compiler share output generation (in cmd/internal/obj), so it is entirely feasible for the compiler to understand the register requirements of all the instructions that the assembler recognizes.

@docmerlin
Copy link

docmerlin commented Aug 10, 2018

I would prefer, instead, that this effort be focused on improving the calling convention to pass arguments in registers and use callee-save registers.

  • I agree that this would be a good thing, however,wouldn't that be a breaking change and thus slated for go2 not go1? I mean wouldn't it break all the ASM out there.

@dave
Copy link
Contributor

dave commented Aug 10, 2018

I prototyped a quick high-level API: github.com/dave/asm

It's relatively easy to generate function stubs for all the instructions using the x86spec command in github.com/golang/arch - I expect the other architectures are similar.

Perhaps if this was readily available, the signature of unsafe.Asm could be more low-level and just take a ...interface{} instead of having special cases for the destination?

@randall77
Copy link
Contributor

Here's a (not complete) list of things gcc inline assembly is able to specify. I think any proposal should probably either include these, or explicitly state why they aren't needed.

  1. Clobbers - list registers this instruction clobbers.
  2. GotoLabels - List of labels the assembly may jump to. I can see punting on this one.
  3. Volatile - don't reorder memory ops around this asm. Good for memory barriers.
  4. In-out operands. How would we do .Asm("ADDQ $5", dst, src) and require dst==src?
  5. Flags - how are condition flag inputs/outputs handled? How do we ensure they are preserved between Asm calls?
  6. Loads & Stores - how do you tell the compiler that the instruction is a load or store, and needs to be ordered with surrounding code (related to Problem with quietgcc #3)
  7. Scratch registers

Unline gcc inline assembly, it sounds like you're proposing only one instruction per Asm call. That leaves a lot of control you may want at the mercy of the compiler. For instance, how would you do a LL/SC loop? Branch prediction hints?

@quasilyte
Copy link
Contributor Author

  1. Clobbers - list registers this instruction clobbers.

All registers used in dst are marked as clobbered.
One can't express non-clobbering dst at the moment, see open issues.
This affects the performance slightly, but is not a fatal flaw that can't be solved.

  1. GotoLabels - List of labels the assembly may jump to

No control flow is possible from/into the unsafe.Asm. One can use bools to do loops, if branches and gotos from the Go code.

  1. Volatile - don't reorder memory ops around this asm.

All unsafe.Asm are "memory" (in SSA terms), so their order is preserved.
Is that enough or could you elaborate on this one?

  1. In-out operands. How would we do .Asm("ADDQ $5", dst, src) and require dst==src?

All dst are marked as clobbered (and also as inputs in the regalloc), so it's safe to encode in-out operands. There is an opposite problem, see (1).

  1. Flags - how are condition flag inputs/outputs handled? How do we ensure they are preserved between Asm calls?

It's safe to use flag-generating instruction followed by a flag consuming one.
If flag of interest is saved into bool variable, it can be used afterwards, even if there are flag-clobbering instructions along the way. For the performance-sensitive use cases, like branching inside a loop, there is a special case that can be optimized that is described in the proposal.

  1. Loads & Stores - how do you tell the compiler that the instruction is a load or store, and needs to be ordered with surrounding code

This is why we have distinct dst argument; if it's a memory, it is a store.

  1. Scratch registers.

I don't exactly sure how to approach this one.

The idea is to minimize explicit registers usage as much as possible. Instead of using AX one should use local int or uint (or whatever) variable. The exception is vector registers, and their dependencies (whether they are clobbered, etc.) are tracked inside regalloc phase. The aliasing of X/Y/Z registers is also handled as an arch-specific behavior for AMD64.

@quasilyte
Copy link
Contributor Author

I don't see how that can be compatible. If the IDIV can take a complex Go expression as an input, then the compiler will have to generate code to compute that input. That code may well use AX, thereby clobbering the value stored by the MOVB.

I need to re-check one thing before answering your question.
Will post back after that.

Go assembler does require some "implicit" arguments to be specified explicitly, like ST0 for most x87 instructions. It would be great if it was this way for IDIV also, so we can encode instruction input dependencies in a more convenient way; but this is not the case, unfortunately.

@quasilyte
Copy link
Contributor Author

It's relatively easy to generate function stubs for all the instructions using the x86spec command in github.com/golang/arch - I expect the other architectures are similar.

Though x86spec is quite broken and misses a lots of instructions.
It's better to use XED or fix x86spec before doing what you're describing (see CL104496).

@rsc rsc changed the title proposal: inline assembly with unsafe.Asm function proposal: unsafe: inline assembly with unsafe.Asm function Aug 13, 2018
@rsc
Copy link
Contributor

rsc commented Aug 13, 2018

I want to push back on the idea that inline assembly is a feature that must be added at all.

Inline assembly has enormous semantic complexity that this proposal does not adequately grapple with, although the responses here are trying to.

Inline assembly also eliminates the pressure to actually produce good designs where assembly should not be necessary. The hard part about design is finding interfaces that are generally useful and work well across a wide variety of settings. Yes, we spend a lot of time on that, but the end result should be better overall. We've seen this repeatedly, with math/bits, with FMA, with 128-bit integer operations.

Inline assembly also removes what has been a useful separation between Go code and assembly code. Go code is almost always portable, assembly code almost always not. Projects that place a premium on portability can have a simple rule like "no assembly files". For those willing to use assembly, the current separation makes it easy to add new assembly for new architectures (by file name build rules). If inline assembly is sprinkled into Go code then you'd have to first mv x.go x_386.go and then arrange something for all the other architectures as well. This may well lead to more duplicated Go code than there was assembly code before. The current split also strongly encourages writing a pure Go version at the same time as the first assembly version, and that pure Go version helps keep code analyzable too.

These are very significant costs. To what benefit? It is already possible to write assembly code. The incremental benefits of adding a second, completely different way to write assembly code seem small. If we were starting from scratch and the proposal was "write all assembly this way and don't have separate *.s files" that would be different. But that ship has sailed.

The proposal makes very little case for the benefits:

Right now the only way to get high performance for CPU-bound operation is to
write an assembly implementation using latest instructions available (with appropriate
run time CPU flags switch with fallbacks to something more conservative).

Sometimes the performance advantages of assembly version are astonishing,
for functions like bytes.IndexByte it's orders of magnitude improvement:

name            old time/op    new time/op     delta
IndexByte/32-8    32.2ns ± 0%      4.1ns ± 0%    -87.14%  (p=0.000 n=9+10)
IndexByte/4K-8    2.43µs ± 0%     0.08µs ± 2%    -96.55%  (p=0.000 n=10+10)

name            old speed      new speed       delta
IndexByte/32-8   993MB/s ± 0%   7724MB/s ± 0%   +677.74%  (p=0.000 n=9+9)
IndexByte/4K-8  1.68GB/s ± 0%  48.80GB/s ± 2%  +2801.13%  (p=0.000 n=10+10)

The old is portable pure Go version and new is assembly code with AVX2.

The proposal changes this situation not at all. You still have to write assembly to get these speedups. And bytes.IndexByte is >100 lines of x86 assembly. Surely that's not going to turn into some kind of Go-asm hybrid? (Or if it is, the proposal should make clear why that's an improvement.)

Other cases are addressed with increasing amount of intrinsified functions.
The downside is that they pollute the compiler and speedup only a finite
set of intrinsified functions. Not a general enough solution.

General enough for who? For compiler writers? For writers of assembly? Maybe. But for users, actually defining useful primitives that are broadly applicable is exactly the point. That's our responsibility as language and library designers. The proposal is essentially arguing "it is too difficult to duck that responsibility." I disagree, of course: it is good that it is difficult to duck that responsibility.

When referring to intrinsics, functions like math.Sqrt are implied.

The advantage of Go intrinsics is that they can be inlined, unlike
manually written assembly functions.

Here it's worth considering exactly what class of functions benefits from inlining. We took a very long time to intrinsify math.Sqrt precisely because for a very long time we had no benchmark showing that inlining it would help at all. It seems to me that very small operations are the ones that will be worth inlining. But at the same time there should not be a huge number of these, and so we should actually expect to spend time making them work well at the library level, knowing that people who need those instructions today can already write assembly code.

Surely the proposal is not suggesting that bytes.IndexByte be inlined at each call site.

This leads to conclusion: what if
there was a way to describe ordinary Go function (hence, inlineable) that
does use machine instructions explicitly? This can address all problems described above:

  • It's scalable. Users may define their own intrinsics if they really need to.

Assembly files are scalable too, and as noted above they make it far less likely that people will accidentally write x86-specific (or anything-else-specific) code.

  • No need to clutter the compiler internals with intrinsic definitions, they
    can be defined as a normal functions inside Go sources.
    This reduces the burden from the Go compiler maintainers.

I believe the overall burden here as a fraction of compiler work is very small. And the clutter has moved out of the compiler into Go source code. That's not an obvious win.

  • Writing these functions is less error-prone than writing hundreds lines of
    assembly code. Also easier to maintain and test.

It really sounds again like you expect to write IndexByte in assembly. If so, please show it, so we can see why the result is an improvement.

In general fulfilling a feature request is not an argument in favor of a proposal.

Thanks.
Russ

@quasilyte
Copy link
Contributor Author

Inline assembly also removes what has been a useful separation between Go code and assembly code. Go code is almost always portable, assembly code almost always not. Projects that place a premium on portability can have a simple rule like "no assembly files".

AFAIK, there is a "safe Go" definition that requires no asm file + no unsafe.
So, safe Go would be still very pure and portable.
But I got your points, thanks for detailed response.

@TocarIP
Copy link
Contributor

TocarIP commented Aug 23, 2018

I'd like to articulate that the main goal of this proposal (at least in my opinion) is to reduce amount of asm in go projects, not to increase it.

I absolutely agree that having proper api is much better, but there are hundreds of instructions with very specific use cases, and creating reasonable api for all of them isn't feasible or desirable. So if someones absolutely needs to use them (most likely for performance reasons) they will rewrite part of their code in asm. And due to call overhead they will need to rewrite significant portion of their code. For example here are most used amd64 instructions from go asm files (see below for methodolgy):

Instruction Count
MOVQ 785002
MOVL 221292
ADDQ 179534
RET 134262
MOVOU 88749
SHA 73673
JMP 73642
SYSCALL 59480
CALL 57401
MULQ 50406
MOVO 50309
CMPQ 50002
ADCQ 48290
PXOR 44407
ROUND 40942
ANDQ 37361
SUBQ 34873
AESENC 34559
LEAQ 31744
MOVUPS 26528

First instruction that actually accomplishes something, that pure go can not is MOVOU at 5, which is order of magnitude less popular than MOVQ at position 1. I don't think there is a lot of situations were it makes sens to write ADDQ AX, BX, instead of result += i or move scalar values manually via MOVQ .

So most asm code introduces accidental complexity, and should be replaced with pure go, reducing maintenance burden and making it easier to code-review/audit/port to different arch/replace with pure go/use new api.

Now there are different ways to archive this. We could introduce inline asm, that can be mixed with pure go, like this proposal suggests. We may go gcc way and introduce a few thousands new intrinsics. Or we may switch to register based abi and allow marking clobbered registers of asm function, reducing call overhead to negligible and breaking existing asm. But an order of magnitude reduction in asm lines is IMHO a desirable goal anyway.

Methodology

obtained by running following query against github data

SELECT  REGEXP_EXTRACT(line, r'\s*([A-Z]+)') insn,
        COUNT(*) count
FROM
(
  SELECT SPLIT(cont,'\n') line,
  FROM
    (
      SELECT cont.content AS cont
      FROM
        [bigquery-public-data:github_repos.contents] AS cont
      JOIN
        (
        SELECT ID  AS fid
          FROM
            [bigquery-public-data:github_repos.files] AS file
           WHERE path LIKE '%_amd64.s'
              AND path NOT LIKE '%vendor/%'
         ) ff
      ON
        cont.ID == ff.fid
    )
)
GROUP BY
  1
ORDER BY
  count DESC
LIMIT
  500;

There several problems with this method:

  • Dataset is ~2 yers old
  • It will miss amd64 asm files with explicit tags, instead of _amd64.s name
  • It will falsely count non-go asm code from files ending in _amd64.s
  • Only first instruction on each line is taken into account, so something like ADDQ AX,BX;ADDQ CX,DX wont count

But IMHO it provides a reasonable estimate.

@agnivade
Copy link
Contributor

As a "normal" Go programmer with near-zero knowledge of assembly programming, it seems to me that this proposal is kinda saying - "If you need super-fast code, you need to write in assembly. The compiler won't emit instructions for latest processors". Which is kinda discomforting because I don't know much assembly, so does that mean there is a limit beyond which I cannot write more optimized code ?

Will it be possible to get further discussion going on #25489 ? So that all the work of generating optimal instructions fall on the compiler rather than the user. Will there still be a need for this proposal if the compiler can do it of its own ?

@quasilyte
Copy link
Contributor Author

@agnivade, this is the case with the advanced C++ compilers too.

Even if they do vectorize some loops, it doesn't mean they can optimize any code to the best form possible. So, even C++ programmers that usually have -march=native and more freedom in compile time optimizations (no hard time constraints), have to rely on intrinsics or asm if the optimizer fails for whatever reason. Also unlikely that gc compiler will get close to the amount of optimizations gcc or clang perform due to design reasons.

@agnivade
Copy link
Contributor

Fair enough. I guess I was trying to prioritize things. If the end-goal is to generate instructions for latest processors, I would want the compiler to generate it first. And then have intrinsics to take it to the next step. But I understand that doing it from the compiler is a sufficiently big undertaking.

@rsc
Copy link
Contributor

rsc commented Sep 19, 2018

In my long comment above I mentioned the other things we've added to eliminate the need to write assembly:

The hard part about design is finding interfaces that are generally useful and work well across a wide variety of settings. Yes, we spend a lot of time on that, but the end result should be better overall. We've seen this repeatedly, with math/bits, with FMA, with 128-bit integer operations.

If there are more of these, let's focus on these specific use cases and not on trying to debug an inline assembly proposal. The table from @TocarIP is interesting - if it points out other specific kinds of functionality we need to better support in the library, let's open new issues for those.

@rsc rsc closed this as completed Sep 19, 2018
@golang golang locked and limited conversation to collaborators Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests