A faster s2b function #637

zhangyunhao116 · 2019-08-18T16:52:11Z

A faster s2b function

The new function just use the return stack space to store the final value, without the allocation of a temporary struct. s2bFast is 100% faster if the Go compiler doesn't use deeper optimization in some code, if there is deeper optimization (inline for example), s2bFast is 5%~15% faster. The s2bFast is always same as s2b in any situations, both functions identically from the perspective of caller. You can see this in ASM, the new one has a smaller stack space and without locals.

Environment: go1.12.7 darwin/amd64
(The go code)

func s2b(s string) []byte {
sh := (*StringHeader)(unsafe.Pointer(&s))
bh := SliceHeader{
Data: sh.Data,
Len: sh.Len,
Cap: sh.Len,
}
return ([]byte)(unsafe.Pointer(&bh))
}

func s2bFast(s string) (b []byte) {
bh := (*SliceHeader)(unsafe.Pointer(&b))
sh := *(*StringHeader)(unsafe.Pointer(&s))
bh.Data = sh.Data
bh.Len = sh.Len
bh.Cap = sh.Len
return b
}

(In ASM)

"".s2b STEXT nosplit size=88 args=0x28 locals=0x20
0x0000 00000 (main.go:33) TEXT "".s2b(SB), NOSPLIT|ABIInternal, $32-40
0x0000 00000 (main.go:33) SUBQ $32, SP
0x0004 00004 (main.go:33) MOVQ BP, 24(SP)
0x0009 00009 (main.go:33) LEAQ 24(SP), BP
0x000e 00014 (main.go:33) FUNCDATA $0, gclocals·9fad110d66c97cf0b58d28cccea80b12(SB)
0x000e 00014 (main.go:33) FUNCDATA $1, gclocals·7d2d5fca80364273fb07d5820a76fef4(SB)
0x000e 00014 (main.go:33) FUNCDATA $3, gclocals·ebb0e8ce1793da18f0378b883cb3e122(SB)
0x000e 00014 (main.go:33) FUNCDATA $4, "".s2b.stkobj(SB)
0x000e 00014 (main.go:35) PCDATA $2, $0
0x000e 00014 (main.go:35) PCDATA $0, $0
0x000e 00014 (main.go:35) XORPS X0, X0
0x0011 00017 (main.go:35) MOVUPS X0, "".bh(SP)
0x0015 00021 (main.go:35) MOVQ $0, "".bh+16(SP)
0x001e 00030 (main.go:36) MOVQ "".s+40(SP), AX
0x0023 00035 (main.go:36) MOVQ AX, "".bh(SP)
0x0027 00039 (main.go:37) MOVQ "".s+48(SP), AX
0x002c 00044 (main.go:37) MOVQ AX, "".bh+8(SP)
0x0031 00049 (main.go:38) PCDATA $0, $1
0x0031 00049 (main.go:38) MOVQ "".s+48(SP), CX
0x0036 00054 (main.go:38) MOVQ CX, "".bh+16(SP)
0x003b 00059 (main.go:40) PCDATA $2, $1
0x003b 00059 (main.go:40) MOVQ "".bh(SP), DX
0x003f 00063 (main.go:40) PCDATA $2, $0
0x003f 00063 (main.go:40) PCDATA $0, $2
0x003f 00063 (main.go:40) MOVQ DX, "".~r1+56(SP)
0x0044 00068 (main.go:40) MOVQ AX, "".~r1+64(SP)
0x0049 00073 (main.go:40) MOVQ CX, "".~r1+72(SP)
0x004e 00078 (main.go:40) MOVQ 24(SP), BP
0x0053 00083 (main.go:40) ADDQ $32, SP
0x0057 00087 (main.go:40) RET
0x0000 48 83 ec 20 48 89 6c 24 18 48 8d 6c 24 18 0f 57 H.. H.l$.H.l$..W
0x0010 c0 0f 11 04 24 48 c7 44 24 10 00 00 00 00 48 8b ....$H.D$.....H.
0x0020 44 24 28 48 89 04 24 48 8b 44 24 30 48 89 44 24 D$(H..$H.D$0H.D$
0x0030 08 48 8b 4c 24 30 48 89 4c 24 10 48 8b 14 24 48 .H.L$0H.L$.H..$H
0x0040 89 54 24 38 48 89 44 24 40 48 89 4c 24 48 48 8b .T$8H.D$@H.L$HH.
0x0050 6c 24 18 48 83 c4 20 c3 l$.H.. .

"".s2bFast STEXT nosplit size=43 args=0x28 locals=0x0
0x0000 00000 (main.go:43) TEXT "".s2bV1(SB), NOSPLIT|ABIInternal, $0-40
0x0000 00000 (main.go:43) FUNCDATA $0, gclocals·39d1b96ca581879f548ad2c8aeb3a5fe(SB)
0x0000 00000 (main.go:43) FUNCDATA $1, gclocals·7d2d5fca80364273fb07d5820a76fef4(SB)
0x0000 00000 (main.go:43) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (main.go:43) FUNCDATA $4, "".s2bV1.stkobj(SB)
0x0000 00000 (main.go:43) PCDATA $2, $0
0x0000 00000 (main.go:43) PCDATA $0, $1
0x0000 00000 (main.go:43) MOVQ $0, "".b+24(SP)
0x0009 00009 (main.go:43) XORPS X0, X0
0x000c 00012 (main.go:43) MOVUPS X0, "".b+32(SP)
0x0011 00017 (main.go:45) MOVQ "".s+16(SP), AX
0x0016 00022 (main.go:45) PCDATA $0, $2
0x0016 00022 (main.go:45) MOVQ "".s+8(SP), CX
0x001b 00027 (main.go:46) MOVQ CX, "".b+24(SP)
0x0020 00032 (main.go:47) MOVQ AX, "".b+32(SP)
0x0025 00037 (main.go:48) MOVQ AX, "".b+40(SP)
0x002a 00042 (main.go:49) RET
Benchmark code here

func Benchmarks2b(b *testing.B) {
for i := 0; i < b.N; i++ {
s2b("111")
}
}

func Benchmarks2bFast(b *testing.B) {
for i := 0; i < b.N; i++ {
s2bFast("111")
}
}

Benchmark result ( Enable all optimizations)

goos: darwin
goarch: amd64
pkg: main/utils
Benchmarks2b-8 2000000000 0.29 ns/op
Benchmarks2bFast-8 2000000000 0.26 ns/op

Benchmark result ( Disable inline for benchmark, simulate no optimization situation )

goos: darwin
goarch: amd64
pkg: main/utils
Benchmarks2b-8 500000000 3.48 ns/op
Benchmarks2bFast-8 2000000000 1.56 ns/op

erikdubbelboer

Good find, just one question.

bytesconv.go

zhangyunhao116 · 2019-08-19T02:47:30Z

I also think so before, but from the point of view of Go compiler ASM, keep sh as *(*reflect.StringHeader)(unsafe.Pointer(&s)) costing lower than keep it as a pointer. There may some compiler optimizations in this case. Let's see it in ASM.
(Go code)

func s2bV1(s string) (b []byte) {
bh := (*SliceHeader)(unsafe.Pointer(&b))
sh := *(*StringHeader)(unsafe.Pointer(&s))
bh.Data = sh.Data
bh.Len = sh.Len
bh.Cap = sh.Len
return b
}

func s2bV2(s string) (b []byte) {
bh := (*SliceHeader)(unsafe.Pointer(&b))
sh := (*StringHeader)(unsafe.Pointer(&s))
bh.Data = sh.Data
bh.Len = sh.Len
bh.Cap = sh.Len
return b
}

(In ASM, without GC code)

"".s2bV1 STEXT nosplit size=43 args=0x28 locals=0x0
0x0000 00000 (t_main2.go:18) TEXT "".s2bV1(SB), NOSPLIT|ABIInternal, $0-40
0x0011 00017 (t_main2.go:20) MOVQ "".s+16(SP), AX
0x0016 00022 (t_main2.go:20) MOVQ "".s+8(SP), CX
0x001b 00027 (t_main2.go:21) MOVQ CX, "".b+24(SP)
0x0020 00032 (t_main2.go:22) MOVQ AX, "".b+32(SP)
0x0025 00037 (t_main2.go:23) MOVQ AX, "".b+40(SP)
0x002a 00042 (t_main2.go:24) RET

"".s2bV2 STEXT nosplit size=48 args=0x28 locals=0x0
0x0000 00000 (t_main2.go:27) TEXT "".s2bV2(SB), NOSPLIT|ABIInternal, $0-40
0x0011 00017 (t_main2.go:30) MOVQ "".s+8(SP), AX
0x0016 00022 (t_main2.go:30) MOVQ AX, "".b+24(SP)
0x001b 00027 (t_main2.go:31) MOVQ "".s+16(SP), AX
0x0020 00032 (t_main2.go:31) MOVQ AX, "".b+32(SP)
0x0025 00037 (t_main2.go:32) MOVQ "".s+16(SP), AX
0x002a 00042 (t_main2.go:32) MOVQ AX, "".b+40(SP)
0x002f 00047 (t_main2.go:33) RET

We can see the version one can use more registers and the version two can use only one, so V1 has fewer instructions in ASM, actually V1 use 5 instructions, V2 use 6 instructions. (Size of V1 is 43, and size of V2 is 48, and this is the only difference.)

erikdubbelboer · 2019-08-19T08:45:11Z

Interesting. Thanks!

zhangyunhao116 added 4 commits August 18, 2019 23:37

Update bytesconv.go

07392d8

Fix

54aa990

Min stack space

3e98d1c

Use pointer for smaller stack space

c48d373

erikdubbelboer reviewed Aug 18, 2019

View reviewed changes

bytesconv.go Show resolved Hide resolved

erikdubbelboer added the pending/submitter-response label Aug 18, 2019

erikdubbelboer merged commit c5413ff into valyala:master Aug 19, 2019

zhangyunhao116 mentioned this pull request Aug 21, 2019

util: Faster-Slice-Function pingcap/tidb#11808

Merged

erikdubbelboer mentioned this pull request Feb 12, 2021

fix s2b go vet warning #967

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A faster s2b function #637

A faster s2b function #637

zhangyunhao116 commented Aug 18, 2019 •

edited

Loading

erikdubbelboer left a comment

zhangyunhao116 commented Aug 19, 2019 •

edited

Loading

erikdubbelboer commented Aug 19, 2019

A faster s2b function #637

A faster s2b function #637

Conversation

zhangyunhao116 commented Aug 18, 2019 • edited Loading

erikdubbelboer left a comment

Choose a reason for hiding this comment

zhangyunhao116 commented Aug 19, 2019 • edited Loading

erikdubbelboer commented Aug 19, 2019

zhangyunhao116 commented Aug 18, 2019 •

edited

Loading

zhangyunhao116 commented Aug 19, 2019 •

edited

Loading