-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os: (*Process).Wait sometimes hangs on netbsd #50138
Comments
Could also be related to #48789. |
The stuck calls appear to be running That would at least help us to determine whether the hang is in the subprocess or the parent process. (I think @aclements and @mknyszek were working on retrofitting that logic to various tests?) |
The arm and arm64 ones may be due to slow machine. Yeah, sending a SIGQUIT at timeout is probably a good idea. |
I have CL 370665 to apply timeouts to nearly every subprocess invocation in the runtime test (though wasn't planning to land that until the tree opens). These failures are all in |
Looks like the same failure mode in
|
|
Broadening the regexp to search for
|
2022-04-18T22:07:54-f49e802/netbsd-arm-bsiegert
|
@bsiegert, @coypoop: given that |
Is this still a suspicion? I assume the netbsd/arm builder is extra slow. |
Do you have steps to reproduce? I'm having a little trouble following the initial report, because it seems to cover several operating systems and architectures. Does this happen every time on NetBSD, or on NetBSD/arm, or only sometimes, or what? If it happens only sometimes, how long does it take successful test runs on the machines where it fails? |
I can't speak for @cherrymui, but given the similar failures on the |
Unfortunately no. The failures listed above were found organically in the Go build dashboard — the repro rate is high enough to be significant but not high enough to reproduce on demand.
The freebsd failure in
Intermittently, on NetBSD across all of the architectures for which we have builders. |
(Note that we also tried to use |
|
Very curious that these recent ones seem to occur in pairs. 🤔 (attn @golang/netbsd) |
Then again, the pairings might just be a coincidence.
|
|
Change https://go.dev/cl/409595 mentions this issue: |
…50138 Since a large fraction of Go tests invoke commands, this issue causes noise on the builders that cannot be easily bypassed or filtered out. Failures matching this issue have been observed on all four of the current NetBSD builders. (The last such failure observed on a non-NetBSD builder was on freebsd-amd64-11_4, and that builder is no longer used; no matching failures have been observed on more recent FreeBSD builders.) Updates golang/go#50138. Change-Id: Ied687a63a55407d19c5f1905e79111d302087937 Reviewed-on: https://go-review.googlesource.com/c/build/+/409595 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
Change https://go.dev/cl/442575 mentions this issue: |
Dragonfly and FreeBSD both used numerical values for these constants chosen to be the same as on Solaris. For some reason, NetBSD did not, and happens to interpret value 0 as P_ALL instead of P_PID (see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44). Using the correct value for P_PID should cause wait6 to wait for the correct process, which may help to avoid the deadlocks reported in For #50138. Updates #13987. Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39 Reviewed-on: https://go-review.googlesource.com/c/go/+/442478 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Bryan Mills <bcmills@google.com>
https://go.dev/cl/442478 seems like a plausible fix for this issue. Waiting to see if there's more. The TryBots for that change hit a deadlock that looks plausibly related to the missed |
I've filed the new failure mode separately as #56180. |
The fprintf issue you hit may be a bug in the interaction between libpthread and the dynamic loader rtld which we have since fixed in HEAD and netbsd-9 (but not in netbsd-9.0): https://nxr.netbsd.org/xref/src/lib/libpthread/pthread.c?r=1.181#418 Please let us know if you can still reproduce it on a current system.
Surely that wouldn't affect the deadlock you saw with wait4, would it? If Go uses wait4 instead of wait6, do you still see the deadlock? (Forgive me if I missed something -- there have been a lot of updates in quick succession which I didn't follow all of.) |
The |
Updating to a more recent NetBSD release is #54773. |
Unfortunately the libpthread/rtld fix didn't make it into 9.3! I didn't realize it until a couple days after 9.3 went out, sorry. But if you can reproduce the C program's fprintf hang, just a regular NetBSD install on a VM without all the golang test harness, that would be helpful -- especially if you can do it on a system with the debug.tgz (or debug.tar.xz) set installed so we get full stack traces. I'll see if I can reproduce it, but half an hour of running it, both on a NetBSD<=9.3 library without the fix and on a current library with the fix, hasn't turned anything up yet. |
Instead of updating the builder VMs to 9.3, should we be updating them to a more recent snapshot of NetBSD-9 then?
|
This will dump more goroutines if the test happens to fail. For #50138. Change-Id: Ifae30b5ba8bddcdaa9250dd90be8d8ba7d5604d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/442476 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
If we use the "pipetest" helper command instead of "sleep", we can use its stdout pipe to determine when the process is ready to handle a SIGSTOP, and we can additionally check that sending a SIGCONT actually causes the process to continue. This also allows us to remove the "sleep" helper command, making the test file somewhat more concise. Noticed while looking into #50138. Change-Id: If4fdee4b1ddf28c6ed07ec3268c81b73c2600238 Reviewed-on: https://go-review.googlesource.com/c/go/+/442576 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
There are getting to be enough special cases in this wrapper that the increase in clarity from having a single file is starting to be outweighed by the complexity from chained conditionals. Updates #50138. Updates #13987. Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438 Reviewed-on: https://go-review.googlesource.com/c/go/+/442575 Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
The new failure mode in #56180 is consistent with the aforementioned I'm going to close this issue as “one bug fixed”, with the remaining hang tracked in #56180. (If there are other failure modes after the builders are upgraded in #54773, we can open new issues for those.) |
Dragonfly and FreeBSD both used numerical values for these constants chosen to be the same as on Solaris. For some reason, NetBSD did not, and happens to interpret value 0 as P_ALL instead of P_PID (see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44). Using the correct value for P_PID should cause wait6 to wait for the correct process, which may help to avoid the deadlocks reported in For golang#50138. Updates golang#13987. Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39 Reviewed-on: https://go-review.googlesource.com/c/go/+/442478 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Bryan Mills <bcmills@google.com>
This will dump more goroutines if the test happens to fail. For golang#50138. Change-Id: Ifae30b5ba8bddcdaa9250dd90be8d8ba7d5604d2 Reviewed-on: https://go-review.googlesource.com/c/go/+/442476 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
If we use the "pipetest" helper command instead of "sleep", we can use its stdout pipe to determine when the process is ready to handle a SIGSTOP, and we can additionally check that sending a SIGCONT actually causes the process to continue. This also allows us to remove the "sleep" helper command, making the test file somewhat more concise. Noticed while looking into golang#50138. Change-Id: If4fdee4b1ddf28c6ed07ec3268c81b73c2600238 Reviewed-on: https://go-review.googlesource.com/c/go/+/442576 Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Bryan Mills <bcmills@google.com> Auto-Submit: Bryan Mills <bcmills@google.com>
There are getting to be enough special cases in this wrapper that the increase in clarity from having a single file is starting to be outweighed by the complexity from chained conditionals. Updates golang#50138. Updates golang#13987. Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438 Reviewed-on: https://go-review.googlesource.com/c/go/+/442575 Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
Change https://go.dev/cl/454755 mentions this issue: |
The netbsd-386 and netbsd-amd64 builders with 9.3 contain various bugfixes, including to libpthread, that prevent test flakes. Remove the older version now. While here, remove issue golang/go#50138 (fixed) from the netbsd-arm* builders. Fixes golang/go#54773. Change-Id: Ibccf0817a69a3dd74651bd5a3f50ab77c3a92beb Reviewed-on: https://go-review.googlesource.com/c/build/+/454755 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Auto-Submit: Carlos Amedee <carlos@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com>
greplogs --dashboard -md -l -e 'panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait(?:.*\n)+FAIL\s+cmd/link'
2021-12-12T06:14:07-9c6e8f6/netbsd-386-9_0-n2
2021-10-29T18:34:24-903f313/netbsd-amd64-9_0
2021-10-01T15:59:38-e5ad363/netbsd-arm-bsiegert
2021-09-21T20:39:31-48cf96c/netbsd-arm-bsiegert
2021-09-14T14:27:57-181e8cd/netbsd-arm-bsiegert
2021-04-29T15:47:16-12eaefe/freebsd-amd64-11_4
2021-04-28T13:49:52-4fe324d/netbsd-386-9_0
2021-03-05T02:30:31-b62da08/netbsd-386-9_0
2021-02-19T00:40:05-95a44d2/netbsd-arm64-bsiegert
2019-09-04T21:52:18-aae0b5b/linux-ppc64le-power9osu
#44801 may be closely related.
Note that many of this failures are on architectures not believed to be affected by #49209.
@bsiegert, @coypoop: any ideas?
The text was updated successfully, but these errors were encountered: