-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/tools/gopls: regtest flakes due to hanging go commands #54461
Comments
Two observations:
|
The only reason the kill system call can fail (at least in this situation) is when the child process has already exited, so failure of kill is unlikely to be the culprit. More likely kill terminated the go process itself, but not the tree of processes rooted at it. If one of them (a test?) retains an open file descriptor to the stdout pipe created by os/exec then the cmd.Run operation will hang indefinitely. To dig further, we could add logic to run during the failure (on linux) that does |
Interesting, I was debugging this in https://go.dev/cl/424075. On windows, our call to Process.Kill() is failing with "invalid argument": A bit of googling suggests that this is because we can't kill subprocesses on windows. @bcmills any advice for how to properly kill the go command on windows? |
After reading the source a bit more: this is EINVAL, which appears to mean that the Process.wait() has exited and the handle released, so this is a race, although it is surprising that we hit it so reliably. |
Change https://go.dev/cl/424075 mentions this issue: |
Can't be done without creating a whole extra process group, unfortunately. (Probably we should add a side-channel — perhaps an open file descriptor or a pidfile? — to request clean shutdown on Windows.) |
When a go command hangs during gopls regression tests, print out additional information about processes and file descriptors. For golang/go#54461 Change-Id: I92aa4665e9056d15a274c154fce2783bed79718e Reviewed-on: https://go-review.googlesource.com/c/tools/+/424075 gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Alan Donovan <adonovan@google.com> Run-TryBot: Robert Findley <rfindley@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/431075 mentions this issue: |
Add a TODO and wait for a shorter period of time following Kill, per post-submit advice from bcmills on CL 424075. For golang/go#54461 Change-Id: Ia0e388c0119660844dad32629ebca4f122fded12 Reviewed-on: https://go-review.googlesource.com/c/tools/+/431075 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Robert Findley <rfindley@google.com> gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
Nice. Well, that test process seems very much alive, falsifying my hypothesis. |
Looks like the hanging go command is in the middle of a compile. Wish we had the full subprocesss command line -- I'll look into that. Not sure how to interpret the fstat output. |
That's a dead cmd/compile process: there's no command because argv has been destroyed along with the rest of the address space. Perhaps the go list parent simply hasn't called waitpid yet, so the process table entry has to be retained. I suspect the problem is in go list. |
Aha, thanks (excuse my ps noobness). Note that we instrumented this panic in two places: once before |
That one is |
2022-09-17T02:56:51-4d18923-cc1b20e/netbsd-amd64-9_0 Still only netbsd. Posting the |
Ooh, nice! https://go.dev/issue/55323#issuecomment-1254107802 has a |
Found new dashboard test flakes for:
2024-12-23 17:52 x_tools-gotip-openbsd-amd64 tools@b75baa00 go@b9955f0a x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2024-12-23 19:36 x_tools-go1.24-openbsd-amd64 tools@851152fe release-branch.go1.24@16afa6a7 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2024-12-27 16:23 x_tools-gotip-openbsd-amd64 tools@412b5e97 go@d7c3e93c x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-02 15:44 x_tools-go1.24-openbsd-amd64 tools@bf516ba1 release-branch.go1.24@16afa6a7 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-06 17:23 x_tools-go1.24-openbsd-amd64 tools@1743d1a7 release-branch.go1.24@16afa6a7 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-06 23:35 x_tools-go1.22-windows-amd64-race tools@ee69ea24 release-branch.go1.22@8f3f22ee x/tools/gopls/internal/telemetry.TestTelemetry/default#01 [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-07 20:06 x_tools-gotip-openbsd-amd64 tools@fc2161a7 go@b2aa18b9 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-09 17:24 x_tools-gotip-openbsd-amd64 tools@6efe0f4b go@c7c4420a x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
When a go command hangs during gopls regression tests, print out additional information about processes and file descriptors. For golang/go#54461 Change-Id: I92aa4665e9056d15a274c154fce2783bed79718e Reviewed-on: https://go-review.googlesource.com/c/tools/+/424075 gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Alan Donovan <adonovan@google.com> Run-TryBot: Robert Findley <rfindley@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Add a TODO and wait for a shorter period of time following Kill, per post-submit advice from bcmills on CL 424075. For golang/go#54461 Change-Id: Ia0e388c0119660844dad32629ebca4f122fded12 Reviewed-on: https://go-review.googlesource.com/c/tools/+/431075 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Robert Findley <rfindley@google.com> gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
We're almost certain that the go process shown by ps is not the one that we're waiting for in runCmdContext, but only by printing the pid can we be sure. Updates golang/go#54461 Change-Id: I1ce9580de6ee6bc4557d76c2a6b05f5a3e2eb6c2 Reviewed-on: https://go-review.googlesource.com/c/tools/+/434637 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Alan Donovan <adonovan@google.com> gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Robert Findley <rfindley@google.com>
As suggested by bcmills, this change increases the time between SIGINT and SIGKILL to 5s (was 1s), and also suppresses the process dump if SIGKILL returned "already done". Updates golang/go#57999 Updates golang/go#54461 Change-Id: Ie0e55a69d3bbfb4224e5f4ea272c7c2f3210e342 Reviewed-on: https://go-review.googlesource.com/c/tools/+/483215 Run-TryBot: Alan Donovan <adonovan@google.com> gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Robert Findley <rfindley@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
If ctx is done, that means (by definition) that the caller is no longer interested in the output of the command. Since 'go list' output can be quite long, we shouldn't waste time copying it if we no longer care about it. If the command doesn't shut down on its own in response to the initial Interrupt signal, or if it triggers a bug in the kernel leading to a deadlock, or I/O is otherwise extremely slow (perhaps due to swapping), the savings from this earlier shutdown could be significant. Even if not, the impact (or lack thereof) of this change could tell us more about what's going on. For golang/go#54461. Change-Id: I0df067cb77496dacd29497f0995da6fc1a4a37f1 Reviewed-on: https://go-review.googlesource.com/c/tools/+/484741 gopls-CI: kokoro <noreply+kokoro@google.com> Reviewed-by: Alan Donovan <adonovan@google.com> Auto-Submit: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Commit-Queue: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
For golang/go#54461 Change-Id: I2de5a10673345342e30e50cb39359c10e8eb7319 Reviewed-on: https://go-review.googlesource.com/c/tools/+/589956 Auto-Submit: Robert Findley <rfindley@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>
Found new dashboard test flakes for:
2025-01-13 20:35 x_tools-go1.23-openbsd-amd64 tools@8a5a6d75 release-branch.go1.23@1dde0b48 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-22 14:05 x_tools-go1.23-windows-amd64-race tools@38d06313 release-branch.go1.23@ab44565b x/tools/gopls/internal/test/integration/codelens.TestUpgradeCodelens_Workspace/Upgrade_transitive_dependencies/default [ABORT] (log)
2025-01-22 14:05 x_tools-go1.23-windows-amd64-race tools@38d06313 release-branch.go1.23@ab44565b x/tools/gopls/internal/test/integration/completion.TestFuzzFunc/default [ABORT] (log)
2025-01-22 15:58 x_tools-gotip-linux-ppc64le_power10 tools@9f4a509f go@f6d17c54 x/tools/gopls/internal/test/integration/completion.TestDefinition/default [ABORT] (log)
2025-01-22 15:58 x_tools-gotip-linux-ppc64le_power10 tools@9f4a509f go@f6d17c54 x/tools/gopls/internal/test/integration/diagnostics.TestNoMod/without_workspace_module/default [ABORT] (log)
2025-01-22 15:58 x_tools-gotip-linux-ppc64le_power10 tools@9f4a509f go@f6d17c54 x/tools/gopls/internal/test/integration/misc.TestGoToStdlibDefinition_Issue37045/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-22 17:02 x_tools-go1.22-openbsd-amd64 tools@e4adc385 release-branch.go1.22@c3c6a500 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Change https://go.dev/cl/643915 mentions this issue: |
Existing debug information has not proven useful for understanding hanging go commands. On posix systems we can send SIGQUIT, and peek at stderr, to try to see what the go command is doing. Tested locally by setting the timeout to 10ms. Updates golang/go#54461 Change-Id: I1bc38d6da82b0dffb55081364b0af8f20a3afcfc Reviewed-on: https://go-review.googlesource.com/c/tools/+/643915 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com> Auto-Submit: Robert Findley <rfindley@google.com>
Found new dashboard test flakes for:
2025-01-23 17:02 x_tools-go1.24-openbsd-amd64 tools@7a015ab4 release-branch.go1.24@8a4c24f9 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Found new dashboard test flakes for:
2025-01-24 16:52 x_tools-go1.22-openbsd-amd64 tools@3e68f53e release-branch.go1.22@c3c6a500 x/tools/gopls/internal/test/integration/codelens.TestRegenerateCgo/default [ABORT] (log)
|
Change https://go.dev/cl/644016 mentions this issue: |
We encountered a hanging go command on openbsd, which unfortunately did not have logic to SIGQUIT. Add openbsd to the special set of GOOS in handleHandingGoCommand. Updates golang/go#54461 Change-Id: I36e32559f23a3ace28a1088a1f910642eb0074ec Reviewed-on: https://go-review.googlesource.com/c/tools/+/644016 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com> Auto-Submit: Robert Findley <rfindley@google.com>
Found new dashboard test flakes for:
2025-01-24 18:50 x_tools-go1.23-windows-amd64-race tools@1c9f0026 release-branch.go1.23@ab44565b x/tools/gopls/internal/test/integration/codelens.TestUpgradeCodelens_Workspace/Upgrade_direct_dependencies/default [ABORT] (log)
2025-01-24 21:09 x_tools-go1.22-windows-amd64-race tools@114ac827 release-branch.go1.22@c3c6a500 x/tools/gopls/internal/test/integration/completion.TestFuzzFunc/default [ABORT] (log)
2025-01-24 21:09 x_tools-go1.22-windows-amd64-race tools@114ac827 release-branch.go1.22@c3c6a500 x/tools/gopls/internal/test/integration/misc.TestCodeActionsInGeneratedFiles/default [ABORT] (log)
|
Here's a SIGQUIT dump from a hanging Go routine. FYI @matloob @samthanawalla this goroutine was hanging for a minute when invoked from one of our test repos (which should be very small). |
2022-08-15T17:42:12-987de34-1f833e4/darwin-amd64-12_0
2022-08-14T00:06:23-35f806b-59865f1/netbsd-amd64-9_0
2022-08-12T20:40:05-bebd890-2f6783c/netbsd-386-9_0
2022-08-12T18:15:28-bebd890-b6f87b0/netbsd-amd64-9_0
2022-08-12T12:39:26-88d981e-f67c766/netbsd-amd64-9_0
2022-08-12T00:04:29-c4ec74a-a5cd894/netbsd-386-9_0
2022-08-11T19:05:54-c4ec74a-62654df/netbsd-amd64-9_0
2022-08-11T17:53:50-37a81b6-a526ec1/netbsd-amd64-9_0
2022-08-11T16:19:14-37a81b6-2340d37/netbsd-amd64-9_0
2022-08-11T12:53:59-b2156b5-3c200d6/netbsd-386-9_0
2022-08-10T22:22:48-b2156b5-6b80b62/netbsd-amd64-9_0
2022-08-10T17:41:25-0ad49fd-f19f6c7/netbsd-amd64-9_0
2022-08-10T15:08:24-3950865-c81dfdd/netbsd-amd64-9_0
2022-08-10T02:14:09-6fa767d-5531838/plan9-386-0intro
2022-08-09T14:33:24-92d58ea-0981d9f/openbsd-386-70
2022-08-09T14:12:01-92d58ea-662a729/netbsd-amd64-9_0
2022-08-09T13:39:27-92d58ea-9e8020b/netbsd-386-9_0
2022-08-09T11:28:56-92d58ea-0f8dffd/netbsd-amd64-9_0
2022-08-08T18:10:56-fff6d6d-4bcc138/netbsd-amd64-9_0
2022-08-08T15:33:45-06d96ee-0581d69/netbsd-amd64-9_0
2022-08-08T15:07:46-06d96ee-cd54ef1/netbsd-amd64-9_0
2022-08-08T14:12:21-06d96ee-e761556/netbsd-amd64-9_0
2022-08-08T06:16:59-06d96ee-0f6ee42/darwin-amd64-11_0
2022-08-08T06:16:59-06d96ee-0f6ee42/netbsd-386-9_0
2022-08-06T15:20:00-06d96ee-0c4db1e/plan9-386-0intro
2022-08-05T19:51:08-06d96ee-4fb7e22/plan9-386-0intro
2022-08-04T20:05:03-81c7dc4-39728f4/netbsd-386-9_0
2022-08-04T20:05:03-81c7dc4-39728f4/netbsd-amd64-9_0
2022-08-04T20:04:16-3519aa2-39728f4/netbsd-386-9_0
2022-08-04T19:57:25-763f65c-39728f4/netbsd-386-9_0
2022-08-04T18:51:46-99fd76f-39728f4/openbsd-386-70
2022-08-04T17:05:18-3e0a503-fb1bfd4/netbsd-amd64-9_0
2022-08-04T15:50:11-3e0a503-fcdd099/netbsd-386-9_0
2022-08-04T15:50:11-3e0a503-44ff9bf/netbsd-amd64-9_0
2022-08-04T15:31:49-87f47bb-44ff9bf/plan9-386-0intro
2022-08-04T14:58:59-87f47bb-4345620/netbsd-386-9_0
2022-08-04T10:32:51-3e0a503-a10afb1/linux-386-buster
2022-08-03T21:02:27-8b9a1fb-f28fa95/plan9-386-0intro
2022-08-03T21:02:27-8b9a1fb-4345620/netbsd-386-9_0
2022-08-03T18:07:40-d08f5dc-fcdd099/netbsd-386-9_0
2022-08-03T13:50:38-ddb90ec-c6a2dad/dragonfly-amd64-622
2022-08-03T13:50:38-ddb90ec-c6a2dad/plan9-386-0intro
2022-08-03T12:09:24-ddb90ec-29b9a32/plan9-386-0intro
2022-08-02T18:52:36-0d04f65-29b9a32/plan9-386-0intro
2022-08-02T18:19:01-d025cce-be59153/netbsd-amd64-9_0
2022-08-02T18:16:22-10cb435-d723df7/netbsd-amd64-9_0
2022-08-02T18:07:14-4d0b383-d723df7/netbsd-386-9_0
2022-08-02T18:07:14-4d0b383-d723df7/netbsd-amd64-9_0
2022-08-02T17:23:42-4d0b383-1b7e71e/darwin-amd64-nocgo
2022-08-02T16:05:48-4d0b383-f2a9f3e/netbsd-amd64-9_0
2022-07-29T20:19:23-9580c84-9240558/windows-arm64-11
2022-07-28T20:06:00-8ea5687-d9242f7/darwin-amd64-nocgo
2022-07-27T15:04:58-39a4e36-4248146/freebsd-386-13_0
2022-07-26T18:43:08-6c8a6c4-d9242f7/aix-ppc64
2022-07-25T20:44:49-2a6393f-24dc27a/darwin-amd64-10_14
2022-07-25T18:11:01-4375b29-795a88d/plan9-386-0intro
2022-07-25T14:16:17-178fdf9-64f2829/plan9-386-0intro
2022-07-22T20:12:19-7b605f4-c5da4fb/plan9-386-0intro
2022-07-21T20:11:06-ec1f924-c4a6d30/plan9-386-0intro
2022-07-15T15:11:26-22d1494-2aa473c/windows-386-2008-newcc
2022-07-15T14:27:36-1a4e02f-4651ebf/windows-arm64-10
2022-07-15T14:20:24-db8f89b-4651ebf/windows-arm64-10
2022-07-14T21:03:14-db8f89b-783ff7d/windows-arm64-11
2022-07-14T21:01:58-db8f89b-aa80228/darwin-arm64-11
2022-07-14T19:05:09-db8f89b-a906d3d/windows-arm64-10
2022-07-14T15:54:36-db8f89b-266c70c/windows-arm64-10
2022-07-14T01:47:39-db8f89b-558785a/windows-arm64-11
We recently started waiting for all go command invocations when shutting down gopls regtests. It appears that sometimes we kill the go command and still don't get a result from
cmd.Wait()
. For example, here:https://build.golang.org/log/00046e0b005c7660d676a3a415561950048f756a
In that failure, the test runner looks otherwise healthy (other tests ran fast), and yet the goroutine stack clearly shows a go command hanging for 9 minutes here:
https://cs.opensource.google/go/x/tools/+/master:internal/gocommand/invoke.go;l=260;drc=f38573358cbedf46d64c5759ef41b72afcf0c5c0
@bcmills do you happen to have any idea of what might cause this?
The text was updated successfully, but these errors were encountered: