Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "fatal: morestack on g0" on Linux #23360

Closed
shenli opened this issue Jan 7, 2018 · 41 comments
Closed

runtime: "fatal: morestack on g0" on Linux #23360

shenli opened this issue Jan 7, 2018 · 41 comments
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@shenli
Copy link

shenli commented Jan 7, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.9.1 linux/amd64

Does this issue reproduce with the latest release?

I do not know how to reproduce it. But it occurs twice.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build898823429=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

Run a binary as daemon service.

What did you expect to see?

The process runs forever.

What did you see instead?

I get "fatal: morestack on g0" from stderr. The process is still there but does not respond anymore. When I use curl http://ip:port/debug/pprof/goroutine?debug=1 to check the stack, but it halts. There is nothing useful in stderr or dmesg.

I tried pstack and got the following result. It is very strange that there is only one thread alive. I have googled it but get nothing useful.

Thread 1 (process 12230):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:439
#1  0x00000000004293f2 in runtime.futexsleep (addr=0x1b0a950 <runtime.m0+272>, val=0, ns=-1) at /usr/local/go/src/runtime/os_linux.go:45
#2  0x000000000040f9ab in runtime.notesleep (n=0x1b0a950 <runtime.m0+272>) at /usr/local/go/src/runtime/lock_futex.go:151
#3  0x00000000004315f5 in runtime.stopm () at /usr/local/go/src/runtime/proc.go:1680
#4  0x00000000004327d2 in runtime.findrunnable (gp=0x45cbd1 <runtime.goexit+1>, inheritTime=false) at /usr/local/go/src/runtime/proc.go:2135
#5  0x000000000043328c in runtime.schedule () at /usr/local/go/src/runtime/proc.go:2255
#6  0x00000000004335a6 in runtime.park_m (gp=0xca01842780) at /usr/local/go/src/runtime/proc.go:2318
#7  0x0000000000459ffb in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:286
#8  0x0000000001b0a100 in ?? ()
#9  0x00007ffdc5be26e0 in ?? ()
#10 0x0000000001b0a100 in ?? ()
#11 0x00007ffdc5be26d0 in ?? ()
#12 0x0000000000430544 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1152
#13 0x0000000000459ec1 in runtime.rt0_go () at /usr/local/go/src/runtime/asm_amd64.s:186
#14 0x000000000000000a in ?? ()
#15 0x00007ffdc5be2718 in ?? ()
#16 0x000000000000000a in ?? ()
#17 0x00007ffdc5be2718 in ?? ()
#18 0x0000000000000000 in ?? ()
@shenli
Copy link
Author

shenli commented Jan 7, 2018

My Linux kernel is "Linux 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux"

futex(unsafe.Pointer(addr), _FUTEX_WAIT, val, nil, nil, 0)

	// Some Linux kernels have a bug where futex of
	// FUTEX_WAIT returns an internal error code
	// as an errno. Libpthread ignores the return value
	// here, and so can we: as it says a few lines up,
	// spurious wakeups are allowed.
	if ns < 0 {
		futex(unsafe.Pointer(addr), _FUTEX_WAIT, val, nil, nil, 0)
		return
	}

Could I say that I meet the bug of Linux kernel?

@shenli shenli changed the title Meet "fatal morestack on g0" on Linux Meet "fatal: morestack on g0" on Linux Jan 7, 2018
@cznic
Copy link
Contributor

cznic commented Jan 7, 2018

Could I say that I meet the bug of Linux kernel?

Hard to tell w/o a reproduction case.

@shenli
Copy link
Author

shenli commented Jan 7, 2018

@cznic Thanks! I am trying to reproduce it. But I am not sure whether I could make it.

@odeke-em odeke-em changed the title Meet "fatal: morestack on g0" on Linux runtime: "fatal: morestack on g0" on Linux Jan 7, 2018
@odeke-em
Copy link
Member

odeke-em commented Jan 8, 2018

I'll also page some other folks to be aware of this as we wait for more information
/cc @randall77 @aclements

@odeke-em odeke-em added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 8, 2018
@shenli
Copy link
Author

shenli commented Jan 8, 2018

I upgrade my gdb to 8.0 and find that there are 74 threads but all of them are waiting here: runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:439

@davecheney
Copy link
Contributor

This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .

@shenli
Copy link
Author

shenli commented Jan 8, 2018

@davecheney Thanks!
It occurs in the production environment. We could not enable race detection. I will try to build a testing environment to reproduce it.

@zhexuany
Copy link

zhexuany commented Jan 8, 2018

We just figured out how to print goroutine using gdb. Below is partial stack of all goroutine:

---Type <return> to continue, or q <return> to quit---
  208832207 waiting  runtime.gopark
  208816967 waiting  runtime.gopark
  208824056 waiting  runtime.gopark
  208825674 waiting  runtime.gopark
  208813050 waiting  runtime.gopark
  208821762 waiting  runtime.gopark
  208845379 waiting  runtime.gopark
  208803855 waiting  runtime.gopark
  208819750 waiting  runtime.gopark
  208820173 waiting  runtime.gopark
  208830287 waiting  runtime.gopark
  208835489 waiting  runtime.gopark
  208818500 waiting  runtime.gopark
  208844090 waiting  runtime.gopark
  208803883 waiting  runtime.gopark
  208815260 waiting  runtime.gopark
  208843403 waiting  runtime.gopark
  208817095 waiting  runtime.gopark
  208819772 waiting  runtime.gopark
  208830692 waiting  runtime.gopark
  208808392 waiting  runtime.gopark
  208830801 waiting  runtime.gopark
  208843133 waiting  runtime.gopark
  208839727 waiting  runtime.gopark
  208826235 waiting  runtime.gopark
  208836382 waiting  runtime.gopark
  208828454 waiting  runtime.gopark
  208817075 waiting  runtime.gopark
  208829301 waiting  runtime.gopark

All goroutines are wating on gopark. In. addition, all os threads are waiting on futex.
We came up with two possible reasons: 1. deadlock, 2. g0 is crashed which implies there is no more goroutine can schedule other goroutines.

@zhexuany
Copy link

zhexuany commented Jan 8, 2018

Deadlock could be caused from kernel bug because our binary in running against linux 3.10 which has a kernel bug causing potential deadlock. For second reason, we are confused why all thread's schedulers are crashed. Per my understanding, all threads have their own g0. The process is still running. Any suggestions about how to debug this will be really appreciated. Of course, tell us what is reason behind this will be also really appreciated.

@ianlancetaylor
Copy link
Contributor

It's normal for an inactive goroutine to be sitting in gopark. It's normal for an inactive thread to be waiting on a futex.

Your original report, morestack on g0, is a clear sign of a bug or of memory corruption. A deadlock could simply be the result of a program bug.

You suggest that a g0 could crash, causing deadlock, but in general that can't happen. If a g0 crashes, the entire program crashes, as in your original report.

@aclements
Copy link
Member

@shenli, it would be useful if you could dump all of the thread stacks from gdb using thread apply all bt.

Unfortunately, I wouldn't expect the info goroutines command to do much useful since the g0 isn't a regular user goroutine and hence isn't printed by info goroutines.

@shenli
Copy link
Author

shenli commented Jan 9, 2018

@aclements The output of thread apply all bt is here.

@shenli
Copy link
Author

shenli commented Jan 9, 2018

@ianlancetaylor In our scenario, g0 is gone but the whole process is still there.

@ianlancetaylor
Copy link
Contributor

I don't know what you mean when you say that g0 is gone. Note that g0 is not printed by info goroutines. How are you determining that it is gone?

@shenli
Copy link
Author

shenli commented Jan 9, 2018

@ianlancetaylor I meet fatal: morestack on g0 error. So I guess g0 is gone. Please correct me if this is wrong.

@ianlancetaylor
Copy link
Contributor

That is a fatal error message that should crash the entire program. It is not intended to leave the program running without g0.

Thanks for the stack trace. It does most threads waiting for a lock on runtime.mheap_. The interesting thread stack is this one:

Thread 2 (LWP 12232):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:439
#1  0x00000000004293f2 in runtime.futexsleep (addr=0x1b13d20 <runtime.mheap_>, val=2, ns=-1)
    at /usr/local/go/src/runtime/os_linux.go:45
#2  0x000000000040f721 in runtime.lock (l=0x1b13d20 <runtime.mheap_>) at /usr/local/go/src/runtime/lock_futex.go:102
#3  0x0000000000414771 in runtime.allocmcache (~r0=0x0) at /usr/local/go/src/runtime/mcache.go:78
#4  0x000000000042c005 in runtime.startpanic_m () at /usr/local/go/src/runtime/panic.go:652
#5  0x000000000045a0cb in runtime.systemstack () at /usr/local/go/src/runtime/asm_amd64.s:360
#6  0x000000000042bb8e in runtime.startpanic () at /usr/local/go/src/runtime/panic.go:578
#7  0x000000000044197d in runtime.sighandler (sig=5, info=0xc42006bbf0, ctxt=0xc42006bac0, gp=0xc420000480)
    at /usr/local/go/src/runtime/signal_sighandler.go:81
#8  0x000000000044234d in runtime.sigtrampgo (sig=5, info=0xc42006bbf0, ctx=0xc42006bac0)
    at /usr/local/go/src/runtime/signal_unix.go:334
#9  0x000000000045de03 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_amd64.s:265
#10 0x000000000045def0 in ?? () at /usr/local/go/src/runtime/sys_linux_amd64.s:349
#11 0x0000000000000001 in ?? ()
#12 0x0000000000000000 in ?? ()

This shows that the program got signal 5 (SIGTRAP). Nothing was prepared to handle that signal, so the program started to crash. At the start of the panic code, this condition was true:

	} else if _g_.m.mcache == nil { // can happen if called from signal handler or throw
		_g_.m.mcache = allocmcache()
	}

This makes me wonder whether the signal occurred while holding the lock. I don't know if that can happen while _g_.m.mcache == nil.

I don't understand how to relate this to fatal: morestack on g0. That should not cause a SIGTRAP signal to occur. That only thing that should cause SIGTRAP would be running the program under a debugger (in which case the debugger ought to handle the signal) or using kill to send SIGTRAP to the program.

Unfortunately nothing in the stack trace tells me where the fatal: morestack on g0 is coming from.

@shenli
Copy link
Author

shenli commented Jan 9, 2018

@ianlancetaylor We use gdb to get the thread info. So I think it may cause the SIGTRAP signal.

@ianlancetaylor
Copy link
Contributor

If gdb triggers the SIGTRAP signal, which I think should only happen if you try to use gdb's step or next commands, then it ought to catch the SIGTRAP signal before it is delivered to the program. So while it is certainly possible that gdb is causing it, it's not obvious that it is.

In any case I now see that this is something of a distraction. After the program gets fatal: morestack on g0, there may be some bug that is causing the program to hang instead of crash. The backtrace suggests that this bug has something to do with the error occurring while holding the memory lock. But this is all really just a minor side issue. The question is why the program got fatal: morestack on g0 in the first place. That's the problem that we need to identify and fix.

That brings me back to thinking about memory corruption. Have you been able to run your test program compiled with the race detector?

@aclements
Copy link
Member

morestack performs an INT $3 after calling badmorestackg0 to print the "morestack on g0" message. That's probably where the SIGTRAP is coming from. Though, again, that means the SIGTRAP, thread 2's stack, and the deadlock attempting to allocmcache from the signal handler are just side-effects of the real problem.

It does mean that whatever went wrong went wrong on thread 2. I'm surprised gdb couldn't walk over the signal stack switch, but we might be able to get enough from the signal context to unwind the stack ourselves.

@shenli, I assume you're doing this all from a core file? Or do you still have the process around? Either way, could you try running:

thread 2
frame 7
print/x *info
print/x *ctxt
print/x *gp

@aclements aclements added this to the Go1.10 milestone Jan 9, 2018
@shenli
Copy link
Author

shenli commented Jan 9, 2018

@aclements The result is here:

(gdb) thread 2
[Switching to thread 2 (LWP 12232)]
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:439
439	/usr/local/go/src/runtime/sys_linux_amd64.s: No such file or directory.
(gdb) frame 7
#7  0x000000000044197d in runtime.sighandler (sig=5, info=0xc42006bbf0, ctxt=0xc42006bac0, gp=0xc420000480)
    at /usr/local/go/src/runtime/signal_sighandler.go:81
81	/usr/local/go/src/runtime/signal_sighandler.go: No such file or directory.
(gdb) print/x *info
$1 = {si_signo = 0x5, si_errno = 0x0, si_code = 0x80, si_addr = 0x0}
(gdb) print/x *ctxt
Attempt to dereference a generic pointer.
(gdb) print/x *gp
$2 = {stack = {lo = 0xc420070000, hi = 0xc420072000}, stackguard0 = 0xc420070370, stackguard1 = 0xc420070370, _panic = 0x0, 
  _defer = 0x0, m = 0xc420062000, sched = {sp = 0xc420071fc0, pc = 0xffffffffffffffff, g = 0xc420000480, ctxt = 0x0, ret = 0x0, 
    lr = 0x0, bp = 0xc420071fd8}, syscallsp = 0x0, syscallpc = 0x0, stktopsp = 0x0, param = 0x0, atomicstatus = 0x0, 
  stackLock = 0x0, goid = 0x0, waitsince = 0x0, waitreason = "", schedlink = 0x0, preempt = 0x0, paniconfault = 0x0, 
  preemptscan = 0x0, gcscandone = 0x0, gcscanvalid = 0x0, throwsplit = 0x0, raceignore = 0x0, sysblocktraced = 0x0, 
  sysexitticks = 0x0, traceseq = 0x0, tracelastp = 0x0, lockedm = 0x0, sig = 0x0, writebuf =  []uint8, sigcode0 = 0x0, 
  sigcode1 = 0x0, sigpc = 0x0, gopc = 0x0, startpc = 0x0, racectx = 0x0, waiting = 0x0, cgoCtxt =  []uintptr, labels = 0x0, 
  timer = 0x0, gcAssistBytes = 0xfffffffffffff980}
(gdb) 

@shenli
Copy link
Author

shenli commented Jan 9, 2018

@ianlancetaylor This happened in a production environment. So we could not enable race detection. I built a test environment and try to produce the same workload. But have not got the error.

@shenli
Copy link
Author

shenli commented Jan 15, 2018

We met this issue a few times. We decided to build two binaries: one with Golang 1.7 and another one with Golang 1.10. We will run the two binaries in the production environment and check if the fatal still there.

@shenli
Copy link
Author

shenli commented Jan 17, 2018

We build a new binary with Golang 1.10 and put it into the production environment. It has been running for more than two days w/o the fatal problem. I'm keeping my eyes on it.

@aclements
Copy link
Member

I took another look at the thread stack dump and there's something I can't figure out. What's holding mheap_.lock? It's unfortunate that the signal handler attempted to acquire it while crashing the system and deadlocked, but what's holding it? Every single other thread is either trying to acquire it, or is happily parked. That leaves only the thread that crashed (thread 2), which we can't see the whole stack of. It's possible it self-deadlocked, but we should only be in allocmcache if the crashing thread doesn't have an mcache, and I don't see how a thread could both hold the mheap_.lock and not have an mcache.

@aclements
Copy link
Member

@shenli, thanks for the gdb output. If you still have the crashed process or core dump, could you run the following in gdb?

thread 2
frame 7
x/64a ctxt
set language c
x/1024a *(uint64*)(ctxt+160)
set language go

@aclements
Copy link
Member

I will do those operations when the issue occurs.

Thanks. You'll need to tweak the instructions for a different crash instance. To find the right thread number, do thread apply all bt again and look for the stack containing sighandler (also post the new thread stack dumps if you are able to reproduce). Then the frame number I'm interested in is the sighandler frame.

@shenli
Copy link
Author

shenli commented Jan 18, 2018

@aclements Thanks! Got it!

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/88835 mentions this issue: runtime: never allocate during an unrecoverable panic

gopherbot pushed a commit that referenced this issue Jan 23, 2018
Currently, startpanic_m (which prepares for an unrecoverable panic)
goes out of its way to make it possible to allocate during panic
handling by allocating an mcache if there isn't one.

However, this is both potentially dangerous and unnecessary.
Allocating an mcache is a generally complex thing to do in an already
precarious situation. Specifically, it requires obtaining the heap
lock, and there's evidence that this may be able to deadlock (#23360).
However, it's also unnecessary because we never allocate from the
unrecoverable panic path.

This didn't use to be the case. The call to allocmcache was introduced
long ago, in CL 7388043, where it was in preparation for separating Ms
and Ps and potentially running an M without an mcache. At the time,
after calling startpanic, the runtime could call String and Error
methods on panicked values, which could do anything including
allocating. That was generally unsafe even at the time, and CL 19792
fixed this be pre-printing panic messages before calling startpanic.
As a result, we now no longer allocate after calling startpanic.

This CL not only removes the allocmcache call, but goes a step further
to explicitly disallow any allocation during unrecoverable panic
handling, even in situations where it might be safe. This way, if
panic handling ever does an allocation that would be unsafe in unusual
circumstances, we'll know even if it happens during normal
circumstances.

This would help with debugging #23360, since the deadlock in
allocmcache is currently masking the real failure.

Beyond all.bash, I manually tested this change by adding panics at
various points in early runtime init, signal handling, and the
scheduler to check unusual panic situations.

Change-Id: I85df21e2b4b20c6faf1f13fae266c9339eebc061
Reviewed-on: https://go-review.googlesource.com/88835
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@shenli
Copy link
Author

shenli commented Jan 27, 2018

@aclements The related PR is merged. So I think we will not face the deadlock issue anymore. I will wait for the new release.
Thanks for all of you!

@shenli shenli closed this as completed Jan 27, 2018
@ALTree
Copy link
Member

ALTree commented Jan 27, 2018

@shenli in the commit message, Austin wrote

This would help with debugging #23360, since the deadlock in allocmcache is currently masking the real failure.

It doesn't say that the CL fixed this issue. So let's not close it (yet).

@ALTree ALTree reopened this Jan 27, 2018
@shenli
Copy link
Author

shenli commented Jan 28, 2018

@ALTree Yes, I will update this issue when we could figure out the real problem. I'm waiting for the new Golang release.

@CAFxX
Copy link
Contributor

CAFxX commented May 28, 2018

I don't know if this is related or not, but I can reproduce this quite reliably on Mac OS 10.12.6 and go version go1.10.2 darwin/amd64 by doing:

go tool objdump <some go executable> | less

and then quitting less (q) without reaching the end of the assembler listing. In this case this produces this single line on stderr:

fatal: morestack on g0

Should I open a separate ticket?

@josharian
Copy link
Contributor

@CAFxX I believe that that is #25229

@aclements
Copy link
Member

aclements commented Jun 21, 2018

@shenli , have you been able to reproduce this with Go 1.10?

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 6, 2018
@shenli
Copy link
Author

shenli commented Jul 12, 2018

@aclements It never happens with Go 1.10. I guess it is fixed. But I will keep watching it. Thanks!

@aclements
Copy link
Member

Thanks for checking! I'll go ahead and close this, but let us know if you're able to reproduce it.

@ribx
Copy link

ribx commented Aug 22, 2018

I have this occasionally on some Gentoo servers with go 1.10.1, kernel 4.8.17-hardened and prometheus-node-exporter. I cannot reproduce it. And it happens only after some weeks of running.

@liuyanhit
Copy link

I also met the issue ,fatal: morestack on g0 , it happens only after some weeks of running. GO VERSION is 1.11.4 .Kernel is 4.9.0-040900-generic

@ahjdzx
Copy link

ahjdzx commented May 2, 2019

I also meet the same issue,

  • go version: 1.10
  • kernel version: 3.10.0-862.el7.x86_64

@davecheney
Copy link
Contributor

@liuyanhit @ahjdzx i'm sorry to hear you are having issues. This issue has been closed for over six months, please file a new issue including as much detail as possible. Thank you

@golang golang locked as resolved and limited conversation to collaborators May 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests