Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CPU instruction "NOP" utilizes ~50% CPU and pprof doesn't explain that #30708

Closed
xaionaro opened this issue Mar 9, 2019 · 6 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@xaionaro
Copy link
Contributor

xaionaro commented Mar 9, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/xaionaro/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/xaionaro/go"
GOPROXY=""
GORACE=""
GOROOT="/home/xaionaro/.gimme/versions/go1.12.linux.amd64"
GOTMPDIR=""
GOTOOLDIR="/home/xaionaro/.gimme/versions/go1.12.linux.amd64/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build465181040=/tmp/go-build -gno-record-gcc-switches"

What did you do?

go get github.com/xaionaro-go/atomicmap
cd "$(go env GOPATH)"/src/github.com/xaionaro-go/atomicmap
git pull # just in case
git checkout performance_experiments
go get ./...
go test ./ -bench=Benchmark_atomicmap_Get_intKeyType_blockSize16777216_keyAmount1048576_trueThreadSafety -benchmem -benchtime 5s -timeout 60s -cpuprofile /tmp/cpu.prof
go tool pprof /tmp/cpu.prof
(pprof) web
(pprof) web increaseReadersStage0Sub0
(pprof) web increaseReadersStage0Sub0Sub0
(pprof) web increaseReadersStage0Sub0Sub1

What did you expect to see?

I expect to see any information about what is utilizing CPU

What did you see instead?

I see an empty function (that doesn't do anything by itself) which utilizes ~50% CPU. Or to be more specific instruction "NOPL" utilizes the CPU. It doesn't make any sense.

Screenshot from 2019-03-10 00-57-55
Screenshot from 2019-03-10 00-58-22
Screenshot from 2019-03-10 00-58-37

The method:

func (slot *storageItem) increaseReadersStage0Sub0() isSet {
    slot.increaseReadersStage0Sub0Sub0()
    return slot.increaseReadersStage0Sub0Sub1()
}

According to pprof (see screenshots) both of this sub-calls doesn't utilizes anything essential. But the method utilizes about 50% of CPU.

I separated this functions this way intentionally to demonstrate the problem. The problem exists if I remove this extra calling levels (even if I manually inline that code to method getByHashValue), too.

         .          .    107:func (slot *storageItem) increaseReadersStage0Sub0() isSet {
         .       30ms    108:   slot.increaseReadersStage0Sub0Sub0()
     9.05s      9.10s    109:   return slot.increaseReadersStage0Sub0Sub1()
         .          .    110:}

Or another try (I separated the return line on two and removed type isSet) -- the same result:

         .          .    107:func (slot *storageItem) increaseReadersStage0Sub0() uint32 {
         .       10ms    108:   slot.increaseReadersStage0Sub0Sub0()
     8.16s      8.89s    109:   r := slot.increaseReadersStage0Sub0Sub1()
         .          .    110:   return r
         .          .    111:}

disasm

ROUTINE ======================== github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders
     4.61s     14.29s (flat, cum) 191.04% of Total
         .          .     4bb370: MOVQ FS:0xfffffff8, CX                  ;storage.go:146
      10ms       10ms     4bb379: CMPQ 0x10(CX), SP                       ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:146
         .          .     4bb37d: JBE 0x4bb3ef                            ;storage.go:146
         .          .     4bb37f: SUBQ $0x18, SP
         .          .     4bb383: MOVQ BP, 0x10(SP)
         .          .     4bb388: LEAQ 0x10(SP), BP
      10ms      4.59s     4bb38d: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:147
         .      4.58s     4bb38e: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0 storage.go:113
         .       10ms     4bb38f: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0Sub0 storage.go:108
         .          .     4bb390: MOVL $0x1, AX                           ;storage.go:147
      10ms       10ms     4bb395: MOVQ 0x20(SP), CX                       ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0Sub0Sub0 storage.go:102
         .          .     4bb39a: LOCK XADDL AX, 0x4(CX)                  ;storage.go:102
     4.07s      4.57s     4bb39f: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0Sub0 storage.go:109
     490ms      500ms     4bb3a0: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0Sub0Sub1 storage.go:105
      10ms       10ms     4bb3a1: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).IsSet storage.go:41
         .          .     4bb3a2: MOVL 0(CX), AX                          ;storage.go:21
         .          .     4bb3a4: TESTL AX, AX                            ;storage.go:117
         .          .     4bb3a6: JNE 0x4bb3d4
         .          .     4bb3a8: MOVL $-0x1, DX                          ;storage.go:118
         .          .     4bb3ad: LOCK XADDL DX, 0x4(CX)
         .          .     4bb3b2: CMPL $0xa, AX                           ;storage.go:148
         .          .     4bb3b5: JE 0x4bb3c5
      10ms       10ms     4bb3b7: MOVL AX, 0x28(SP)                       ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:151
         .          .     4bb3bb: MOVQ 0x10(SP), BP                       ;storage.go:151
         .          .     4bb3c0: ADDQ $0x18, SP
         .          .     4bb3c4: RET
         .          .     4bb3c5: MOVQ CX, 0(SP)                          ;storage.go:149
         .          .     4bb3c9: CALL github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage1(SB)
         .          .     4bb3ce: MOVL 0x8(SP), AX
         .          .     4bb3d2: JMP 0x4bb3b7
         .          .     4bb3d4: CMPL $0x1, AX                           ;storage.go:115
         .          .     4bb3d7: JE 0x4bb3b2
         .          .     4bb3d9: CMPL $0x4, AX                           ;storage.go:117
         .          .     4bb3dc: JE 0x4bb3a8
         .          .     4bb3de: MOVL $-0x1, DX                          ;storage.go:121
         .          .     4bb3e3: LOCK XADDL DX, 0x4(CX)
         .          .     4bb3e8: MOVL $0xa, AX
         .          .     4bb3ed: JMP 0x4bb3b2                            ;storage.go:147
         .          .     4bb3ef: CALL runtime.morestack_noctxt(SB)       ;storage.go:146
         .          .     4bb3f4: ?
         .          .     4bb3f5: JA 0x4bb3f6
         .          .     4bb3f7: ?
@xaionaro xaionaro changed the title Something (effectively empty method) utilizes ~50% CPU and pprof doesn't show what it is NOPL utilizes ~50% CPU and pprof explain that Mar 10, 2019
@xaionaro xaionaro changed the title NOPL utilizes ~50% CPU and pprof explain that Instruction "NOPL" utilizes ~50% CPU and pprof explain that Mar 10, 2019
@xaionaro xaionaro changed the title Instruction "NOPL" utilizes ~50% CPU and pprof explain that CPU instruction "NOPL" utilizes ~50% CPU and pprof doesn't explain that Mar 10, 2019
@xaionaro xaionaro changed the title CPU instruction "NOPL" utilizes ~50% CPU and pprof doesn't explain that CPU instruction "NOP" utilizes ~50% CPU and pprof doesn't explain that Mar 10, 2019
@xaionaro
Copy link
Contributor Author

xaionaro commented Mar 10, 2019

Just in case:
The same problem if I remove LOCK XADDL:

ROUTINE ======================== github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders
     4.48s     13.27s (flat, cum) 175.99% of Total
         .          .     4bb370: MOVQ FS:0xfffffff8, CX                  ;storage.go:147
      10ms       10ms     4bb379: CMPQ 0x10(CX), SP                       ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:147
         .          .     4bb37d: JBE 0x4bb3e8                            ;storage.go:147
         .          .     4bb37f: SUBQ $0x18, SP
         .          .     4bb383: MOVQ BP, 0x10(SP)
      10ms       10ms     4bb388: LEAQ 0x10(SP), BP                       ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:147
      20ms      4.42s     4bb38d: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:148
         .      4.39s     4bb38e: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0 storage.go:114
         .          .     4bb38f: NOPL                                    ;storage.go:109
         .          .     4bb390: MOVQ 0x20(SP), AX                       ;storage.go:148
         .          .     4bb395: INCL 0x4(AX)                            ;storage.go:102
     4.39s      4.39s     4bb398: NOPL                                    ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0Sub0 storage.go:110
         .          .     4bb399: NOPL                                    ;storage.go:106
         .          .     4bb39a: NOPL                                    ;storage.go:41
         .          .     4bb39b: MOVL 0(AX), CX                          ;storage.go:21
      10ms       10ms     4bb39d: TESTL CX, CX                            ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage0 storage.go:118
         .          .     4bb39f: JNE 0x4bb3cd                            ;storage.go:118
         .          .     4bb3a1: MOVL $-0x1, DX                          ;storage.go:119
         .          .     4bb3a6: LOCK XADDL DX, 0x4(AX)
         .          .     4bb3ab: CMPL $0xa, CX                           ;storage.go:149
         .          .     4bb3ae: JE 0x4bb3be
         .          .     4bb3b0: MOVL CX, 0x28(SP)                       ;storage.go:152
         .          .     4bb3b4: MOVQ 0x10(SP), BP
         .          .     4bb3b9: ADDQ $0x18, SP
      40ms       40ms     4bb3bd: RET                                     ;github.com/xaionaro-go/atomicmap.(*storageItem).increaseReaders storage.go:152
         .          .     4bb3be: MOVQ AX, 0(SP)                          ;storage.go:150
         .          .     4bb3c2: CALL github.com/xaionaro-go/atomicmap.(*storageItem).increaseReadersStage1(SB)
         .          .     4bb3c7: MOVL 0x8(SP), CX
         .          .     4bb3cb: JMP 0x4bb3b0
         .          .     4bb3cd: CMPL $0x1, CX                           ;storage.go:116
         .          .     4bb3d0: JE 0x4bb3ab
         .          .     4bb3d2: CMPL $0x4, CX                           ;storage.go:118
         .          .     4bb3d5: JE 0x4bb3a1
         .          .     4bb3d7: MOVL $-0x1, DX                          ;storage.go:122
         .          .     4bb3dc: LOCK XADDL DX, 0x4(AX)
         .          .     4bb3e1: MOVL $0xa, CX
         .          .     4bb3e6: JMP 0x4bb3ab                            ;storage.go:148
         .          .     4bb3e8: CALL runtime.morestack_noctxt(SB)       ;storage.go:147
         .          .     4bb3ed: ?

@ALTree ALTree changed the title CPU instruction "NOP" utilizes ~50% CPU and pprof doesn't explain that runtime: CPU instruction "NOP" utilizes ~50% CPU and pprof doesn't explain that Mar 10, 2019
@ALTree ALTree added this to the Go1.13 milestone Mar 10, 2019
@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 10, 2019
@JAicewizard
Copy link

JAicewizard commented Apr 19, 2019

I have a similar situation in os.openFileNolog where is shows 1.92s in a NOPL.
The same function also shows a different cumulative time in disasm mode than the normal list mode. (a 1.94 seconds difference, which is the same as all the combined NOPLs)

`
ROUTINE ======================== os.openFileNolog

  40ms      5.03s (flat, cum) 13.38% of Total
     .          .     4c7f70: MOVQ FS:0xfffffff8, CX                  ;file_unix.go:190
  10ms       10ms     4c7f79: CMPQ 0x10(CX), SP                       ;os.openFileNolog file_unix.go:190
     .          .     4c7f7d: JBE 0x4c814b                            ;file_unix.go:190
     .          .     4c7f83: SUBQ $0x58, SP
     .          .     4c7f87: MOVQ BP, 0x50(SP)
     .          .     4c7f8c: LEAQ 0x50(SP), BP
     .      1.92s     4c7f91: NOPL                                    ;os.openFileNolog file_unix.go:201
     .          .     4c7f92: NOPL                                    ;file_unix.go:201
     .          .     4c7f93: NOPL                                    ;file_posix.go:38
     .          .     4c7f94: MOVQ $-0x64, 0(SP)                      ;file_unix.go:201
     .          .     4c7f9c: MOVQ 0x60(SP), AX                       ;syscall_linux.go:138
     .          .     4c7fa1: MOVQ AX, 0x8(SP)
     .          .     4c7fa6: MOVQ 0x68(SP), CX
     .          .     4c7fab: MOVQ CX, 0x10(SP)
     .          .     4c7fb0: MOVQ 0x70(SP), DX                       ;file_unix.go:201
     .          .     4c7fb5: BTSQ $0x13, DX
     .          .     4c7fba: MOVQ DX, 0x18(SP)                       ;syscall_linux.go:138
     .          .     4c7fbf: MOVL 0x78(SP), DX                       ;types.go:105
     .          .     4c7fc3: MOVL DX, BX
     .          .     4c7fc5: ANDL $0x1ff, DX
     .          .     4c7fcb: MOVL DX, SI                             ;file_posix.go:40
     .          .     4c7fcd: BTSL $0xb, DX
                                                                      ;file_posix.go:39
     .          .     4c7fd1: TESTL $gitlab.com/golang-commonmark/linkify.match+24016(SB), BX
     .          .     4c7fd7: CMOVNE DX, SI                           ;file_unix.go:201
     .          .     4c7fda: MOVL SI, DX                             ;file_posix.go:43
     .          .     4c7fdc: BTSL $0xa, SI
     .          .     4c7fe0: TESTL $0x400000, BX                     ;file_posix.go:42
     .          .     4c7fe6: CMOVNE SI, DX                           ;file_unix.go:201
     .          .     4c7fe9: MOVL DX, SI                             ;file_posix.go:46
     .          .     4c7feb: BTSL $0x9, DX
     .          .     4c7fef: TESTL $0x100000, BX                     ;file_posix.go:45
     .          .     4c7ff5: CMOVNE DX, SI                           ;file_unix.go:201
     .          .     4c7ff8: MOVL SI, 0x20(SP)                       ;syscall_linux.go:138
     .      1.92s     4c7ffc: CALL syscall.openat(SB)                 ;syscall.Open syscall_linux.go:138
     .          .     4c8001: MOVQ 0x38(SP), AX                       ;file_unix.go:201
  10ms       10ms     4c8006: MOVQ 0x30(SP), CX                       ;os.openFileNolog file_unix.go:201
     .          .     4c800b: MOVQ 0x28(SP), DX                       ;file_unix.go:201
     .          .     4c8010: TESTQ CX, CX                            ;file_unix.go:202
     .          .     4c8013: JE 0x4c8103                             ;file_posix.go:39
     .          .     4c8019: MOVQ AX, 0x48(SP)                       ;file_unix.go:201
     .          .     4c801e: MOVQ CX, 0x40(SP)
     .          .     4c8023: LEAQ 0x4a3eb6(IP), AX                   ;file_unix.go:213
     .          .     4c802a: MOVQ AX, 0(SP)
     .          .     4c802e: CALL runtime.newobject(SB)
     .          .     4c8033: MOVQ 0x8(SP), DI
     .          .     4c8038: MOVQ $0x4, 0x8(DI)
     .          .     4c8040: CMPL $0x0, runtime.writeBarrier(SB)
     .          .     4c8047: JNE 0x4c80f2
     .          .     4c804d: LEAQ 0x4f8509(IP), CX
     .          .     4c8054: MOVQ CX, 0(DI)
     .          .     4c8057: MOVQ 0x68(SP), CX
     .          .     4c805c: MOVQ CX, 0x18(DI)
     .          .     4c8060: CMPL $0x0, runtime.writeBarrier(SB)
     .          .     4c8067: JNE 0x4c80d6
     .          .     4c8069: MOVQ 0x60(SP), CX
     .          .     4c806e: MOVQ CX, 0x10(DI)
     .          .     4c8072: MOVQ 0x40(SP), CX
     .          .     4c8077: MOVQ CX, 0x20(DI)
     .          .     4c807b: CMPL $0x0, runtime.writeBarrier(SB)
     .          .     4c8082: JNE 0x4c80ba
     .          .     4c8084: MOVQ 0x48(SP), AX
     .          .     4c8089: MOVQ AX, 0x28(DI)
     .          .     4c808d: MOVQ $0x0, 0x80(SP)
     .          .     4c8099: LEAQ go.itab.*os.PathError,error(SB), AX
     .          .     4c80a0: MOVQ AX, 0x88(SP)
     .          .     4c80a8: MOVQ DI, 0x90(SP)
     .          .     4c80b0: MOVQ 0x50(SP), BP
     .          .     4c80b5: ADDQ $0x58, SP
     .          .     4c80b9: RET
     .          .     4c80ba: LEAQ 0x28(DI), CX
     .          .     4c80be: MOVQ DI, AX
     .          .     4c80c1: MOVQ CX, DI
     .          .     4c80c4: MOVQ AX, DX
     .          .     4c80c7: MOVQ 0x48(SP), AX
     .          .     4c80cc: CALL runtime.gcWriteBarrier(SB)
     .          .     4c80d1: MOVQ DX, DI
     .          .     4c80d4: JMP 0x4c808d
     .          .     4c80d6: LEAQ 0x10(DI), CX
     .          .     4c80da: MOVQ DI, AX
     .          .     4c80dd: MOVQ CX, DI
     .          .     4c80e0: MOVQ AX, DX
     .          .     4c80e3: MOVQ 0x60(SP), AX
     .          .     4c80e8: CALL runtime.gcWriteBarrier(SB)
     .          .     4c80ed: MOVQ DX, DI
     .          .     4c80f0: JMP 0x4c8072
     .          .     4c80f2: LEAQ 0x4f8464(IP), AX
     .          .     4c80f9: CALL runtime.gcWriteBarrier(SB)
     .          .     4c80fe: JMP 0x4c8057
  10ms       10ms     4c8103: MOVQ DX, 0(SP)                          ;os.openFileNolog file_unix.go:227
     .          .     4c8107: MOVQ 0x60(SP), AX                       ;file_unix.go:227
     .          .     4c810c: MOVQ AX, 0x8(SP)
     .          .     4c8111: MOVQ 0x68(SP), AX
     .          .     4c8116: MOVQ AX, 0x10(SP)
     .          .     4c811b: MOVQ $0x1, 0x18(SP)
     .      1.15s     4c8124: CALL os.newFile(SB)                     ;os.openFileNolog file_unix.go:227
     .          .     4c8129: MOVQ 0x20(SP), AX                       ;file_unix.go:227
     .          .     4c812e: MOVQ AX, 0x80(SP)
  10ms       10ms     4c8136: XORPS X0, X0                            ;os.openFileNolog file_unix.go:227
     .          .     4c8139: MOVUPS X0, 0x88(SP)                     ;file_unix.go:227
     .          .     4c8141: MOVQ 0x50(SP), BP
     .          .     4c8146: ADDQ $0x58, SP
     .          .     4c814a: RET
     .          .     4c814b: CALL runtime.morestack_noctxt(SB)       ;file_unix.go:190
     .          .     4c8150: ?
     .          .     4c8151: SBBL SI, DI
     .          .     4c8153: ?

`
EDIT: github formatting issues???

@josharian
Copy link
Contributor

Speculation: those nops are inlining marks, the subroutine is growing the stack (slow) and the newstack call attribution isn’t working in conjunction with the inlining marks.

cc @hyangah @randall77

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@exapsy
Copy link

exapsy commented Jul 27, 2021

Same!
image
It's 20seconds process with NOPL.

I can't exactly pin down why it would write NOPL. I would love to see an explanation over why this happens and how to interpret this information.

@prattmic
Copy link
Member

See #41338 and #36821.

Due it its use of ITIMER_PROF, pprof results have a skid of up to several instructions. We cannot correct the skid in Go because the amount of skid is not deterministic. We could perhaps document the limitation better.

In #30708 (comment), the usage is almost certainly actually on the LOCK XADD. When #30708 (comment) removed the atomic, the usage is still almost certainly on the INCL. Despite not being locked, the x86 memory model is pretty strong, so the cross-core contention on the counter is likely slowing down the operation significantly.

@prattmic
Copy link
Member

Duplicate of #41338.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

9 participants