-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: allow methods implemented in assembly #4978
Comments
It's far from clear to me that this is a use case we want to support. It complicates the tool chain and creates new ways that internal changes might break packages in the future, without a significant benefit. Why not just call an assembly function from your method? Is the assembly function doing so little that the overhead hurts? |
Yes. The overhead can be 30% in some cases and goes down toward 5%-10% in general. What I'm using is this non-leaf method: // add: z += a func (z *Integer) Add(a *Integer) { FixedAdd(z, a) // Sigh... not possible to generate methods in assembler } // increment: z += 1 func (z *Integer) Inc() { FixedInc(z) // Sigh... not possible to generate methods in assembler } An alternate would be to inline these calls. Perhaps gcflags="-l -l -l -l -l -l -l -l -l -l -l -l -l -l" will (asymptotically with enough L's) do the job. Another is to unwrap my method calls in all the call sites. I have three versions of math libraries under my code, one of which has the assembler versions. In the others, the code is inline or else uses math/big. I could make an m4 macro ADD(a,b) that expands to a.Add(b) in two cases and Add(a,b) in the third. Easy to do, and easy to run m4 from a Makefile. The reason for the bug/feature request was aspirational toward a clean/complete solution as opposed to the workarounds. I'm actually comfortable with the workarounds. Every tool has limits, and limits can be circumvented. The problem is that the code itself is trivial in assembler but impossible in Go (or C or ...) because I need to look at the carry flag. Not doing that means very awkward workarounds: func FixedAdd(z, a *Integer) { // general, portable, slow version carry := uint64(0) for i, v := range a { t := z[i] + v + carry z[i] = t carry = 0 if t < v { carry = 1 } } } vs // FixedAdd: z += a, z and a are 4-word (256 bit) fixed precision integers // func FixedAdd(z, a *Integer) TEXT �FixedAdd(SB),$0-16 MOVQ a+8(FP), CX // make a[] accessible MOVQ z+0(FP), AX // make z[] accessible MOVQ (CX), BX // load a[0] MOVQ (AX), R8 // load z[0] ADDQ BX, R8 // sum = a[0] + z[0], set carry MOVQ R8, (AX) // store sum in z[0] MOVQ 8(CX), BX // load a[1] MOVQ 8(AX), R8 // load z[1] ADCQ BX, R8 // sum = a[1] + z[1] + carry MOVQ R8, 8(AX) // store sum in z[1] MOVQ 16(CX), BX // load a[2] MOVQ 16(AX), R8 // load z[2] ADCQ BX, R8 // sum = a[2] + z[2] + carry MOVQ R8, 16(AX) // store sum in z[2] MOVQ 24(CX), BX // load a[3] MOVQ 24(AX), R8 // load z[3] ADCQ BX, R8 // sum = a[3] + z[3] + carry MOVQ R8, 24(AX) // store sum in z[3] RET // FixedInc: z++, z is a 4-word (256 bit) fixed precision integer // func FixedInc(z *Integer) TEXT �FixedInc(SB),$0-8 MOVQ z+0(FP),AX // make z[] accessible MOVQ (AX),BP // load z[0] ADDQ $1,BP // increment MOVQ BP,(AX) // store sum in z[0] MOVQ 8(AX),BP // load z[1] ADCQ $0,BP // add carry MOVQ BP,8(AX) // store sum in z[1] MOVQ 16(AX),BP // load z[2] ADCQ $0,BP // add carry MOVQ BP,16(AX) // store sum in z[2] MOVQ 24(AX),BP // load z[3] ADCQ $0,BP // add carry MOVQ BP,24(AX) // store sum in z[3] RET for the 4-Word unrolled case. I know most people don't care about things like this, but my code is 3x faster than math/big and that matters to me when the program has month-long runtimes on my 8-core Xeon. 10 days is a lot shorter than a month, especially in terms of threats like power outages or accidental family unplugging. ;-) The (seemingly) needless call in the method is at most two or three days wasted. I can deal with that, though it feels incomplete that developers can't compile -S, tweak code, and replace the original with an assembler version. One other thought--slightly hacky--is that my assembler versions *are already the methods* that I need, given the calling sequence. If there was a relaxation of something in the existing toolchain that would look at the declaration file and call methods there as the simple functions that they are, all would be well. For clarity: Present case, in fixed_decl.go: package integer // implemented in add_$GOARCH.s func FixedAdd(z, a *Integer) func FixedInc(z *Integer) Present case, in mind of compiler, "Fixed add is a function of two arguments..." Alternate case, fixed_decl.go would be: package integer // implemented in add_$GOARCH.s func (z *Integer) FixedAdd(a *Integer) func (z *Integer) FixedInc() Alternate case, in mind of compiler, "Fixed add is a function of two arguments...". The change being to use the declaration as "func (z *Integer) FixedAdd(a *Integer)" for calling purposes but knowingly link in "func FixedAdd(z, a *Integer)" for symbol table/type purposes. This is acknowledging that if the tool chain would just play along, then it would all work out fine as it already is. (The futility of which makes waiting an extra week seem like an extra two weeks! ;-) Best to you whatever makes sense for Go, Michael |
I'm still reluctant to do this for Go 1.1, but if you'd like to patch it into your tree, the change itself is tiny: https://golang.org/cl/7492043 Russ |
Awesome! Just applied the patch, changed the Go code, assembly-generator code, rebuilt my tree and ran some tests: old: BenchmarkAdd 500000000 7.38 ns/op BenchmarkInc 500000000 6.01 ns/op new: BenchmarkAdd 500000000 6.38 ns/op BenchmarkInc 500000000 5.14 ns/op Deltas: Add -13.550% Inc -14.475% That's great, 13.55% on 30 days saves 4.065 days; that's 4 days not spent doing those empty call/ret pairs. Fantastic. |
Further: old: 2013/02/15 18:00:34.665274 work ends, elapsed time 149.9099 sec 2013/02/15 18:00:34.665296 execution time: user 149.5949 sec, system 0.3235 sec, CPU 100.006%, memory 36.000 MiB 2013/02/15 18:00:34.665303 exiting new: 2013/03/07 06:16:40.291520 work ends, elapsed time 121.1103 sec 2013/03/07 06:16:40.291537 execution time: user 121.0659 sec, system 0.1636 sec, CPU 100.098%, memory 30.691 MiB 2013/03/07 06:16:40.291543 exiting Which is a 19.07% gain for me. Thank you very much! This is single-threaded on an 8-core system. Multi-threaded yields: 2013/03/07 06:14:02.558200 work ends, elapsed time 19.7233 sec 2013/03/07 06:14:02.558223 execution time: user 126.9043 sec, system 4.1617 sec, CPU 664.524%, memory 169.324 MiB 2013/03/07 06:14:02.558229 exiting |
Change https://golang.org/cl/111648 mentions this issue: |
An alternative Idea would be to define this as a special case of inlining. Then as a first implementation turn the call into an immediate jump under the following conditions:
The debug story is still unclear to me then as the only statement from the enclosing method would be the jump. |
At this point, we've survived for a very long time without a way to write methods directly in assembly, and we've also successfully hidden the ABI for methods, and changed it over time. This is good and probably should not be undone. See also #27539. |
The text was updated successfully, but these errors were encountered: