-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestGcSys is still flaky #37331
Comments
Shall we disable the test for now? |
2020-03-15T08:14:24-dc32553/freebsd-amd64-race @aclements, @mknyszek: what do you want to do about this test? (Do we understand the cause of these flakes?) |
@bcmills I'll take a look. This is usually due to some GC pacing heuristic doing something weird. A GC trace should get us part of the way there. |
OK sorry for the delay, finally looking into this now. |
Ugh, OK. So this definitely looks like another
You'll notice that in The last time I debugged this, it was a problem with the fact that we didn't fall back to the hard goal even when we were doing more scan work than expected. It's hard for me to see how this is the case again, so there's likely something else going on. A thought: |
Hmm, this batch includes a Windows builder too: |
I think we are also running into this on armv7 and armhf on Alpine Linux edge when building go 1.14.3: |
Having dug into this before, I suspect this is related to #42430 but I'm not sure in what way. Going back to the thought I had in March (#37331 (comment)) I did change a bunch of details in this release regarding how |
Oh, I also want to note #40460 which is probably related. I could likely prove it with an execution trace. Will look into this soon. |
I am getting this fairly consistently with:
and
Running |
Another flake on s390x builder: https://build.golang.org/log/1e93bd84feb5952dc11d951cd7acd6d949ce0277 |
I see this error intermittently when doing test builds on our ppc64le machines both power8 and power9. |
Looking at this again. I was able to reproduce once on linux/amd64. |
Of course, I've been unable to reproduce since that last failure. I will try again on the s390x builder since it seems to be more common there. |
OK finally coming back to this. What's nice is this reproduces very readily on s390x. Hopefully I'll get somewhere. |
This is definitely #40460. General observations:
Every single failure always looks like the following situation :
So, what's really going wrong here? A couple things:
I'm going to try to figure out if there's anything specifically wrong with the assist ratio next, or if its working as intended and there's something more fundamental wrong. The former issue is much harder to fix (and wouldn't resolve this 100%). |
One other thing to notice about this is the failure always has the mark assist + fractional GC help happen at the start of a GC cycle. So what if the situation is something like this: the fractional GC does a whole bunch of work such that there's very little work left. So the assist ratio (computed as "heap remaining / scan work remaining") ends up very large since we still have ~ 1 MiB of runway but there's very little scan work left. This is WAI and it would just be a coincidence, but unfortunately it does mean that with Here's a quick fix idea: what if a goroutine could never have more credit than the amount of heap runway left (basically the difference between the heap goal and the heap size at the point the assist finishes)? Then by construction a goroutine could never allocate past the heap goal without going into assist first and finishing off the GC cycle. The downside is you could have many goroutines try to end GC at the same time (go into @prattmic @randall77 what do you two think? |
By the way, my previous suggestion probably can't land in the freeze, even though it's fixing a bug. I don't know what to do about the flaky test until then. I think the issues causing this test to be flaky are somewhat fundamental. Do we change the test to be less flaky, then write a new one that's consistently failing due to this behavior (but turn it off)? @bcmills any ideas? |
@mknyszek, it's probably fine to add a call to If you have an idea for a non-flaky test that checks a similar behavior, you could maybe add the new test as a backstop against more severe regressions along with the |
This test is constantly flaking for me. Does it have any value? I'm ready to mark it as broken. |
Change https://golang.org/cl/336349 mentions this issue: |
I don't know what this test is doing, but it very frequently flakes for me while testing mundane compiler CLs. According to the issue log, it's been flaky for ~3 years. Updates #37331. Change-Id: I81c43ad646ee12d4c6561290a54e4bf637695bc6 Reviewed-on: https://go-review.googlesource.com/c/go/+/336349 Trust: Matthew Dempsky <mdempsky@google.com> Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
This currently fails more often on unusual platforms than on first-class ones, but note that Marking as release-blocker for Go 1.19. (@golang/runtime can decide to fix the test, skip it, or delete it entirely.) |
I thought we were already skipping this test? It suffers greatly from #52433. |
By that I mean skipping as of https://go.dev/cl/336349. |
Just checked on tip; the test is still skipped. How did it fail on linux-s390x-ibm at all? I'm not sure what else to do here. |
Ah, looks like the recent failures are on |
Yeah, that skip should be backported. |
@gopherbot, please backport to Go 1.17. This test is skipped on the main branch, but still failing semi-regularly on the release branch. |
Backport issue(s) opened: #52826 (for 1.17). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/406974 mentions this issue: |
I don't know what this test is doing, but it very frequently flakes for me while testing mundane compiler CLs. According to the issue log, it's been flaky for ~3 years. Updates #37331. Fixes #52826. Change-Id: I81c43ad646ee12d4c6561290a54e4bf637695bc6 Reviewed-on: https://go-review.googlesource.com/c/go/+/336349 Trust: Matthew Dempsky <mdempsky@google.com> Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> (cherry picked from commit d8ceb13) Reviewed-on: https://go-review.googlesource.com/c/go/+/406974 Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
I think we can close this again for 1.19, since the issue was on builders for an older release. |
@mknyszek The test I think this issue should be open (and can be in Backlog, unassigned state) if it's a goal to eventually figure out how the test can be made non-flaky, or the test can be deleted if it's not needed and then it's fine to close this issue. What do you think? |
Ah good point @dmitshur. Reopening. |
See previously #28574, #27636, #27156, #23343.
CC @mknyszek @aclements
2020-02-15T16:40:12-6917529/freebsd-amd64-race
2020-02-05T18:27:48-702226f/freebsd-amd64-race
2020-01-31T15:04:07-f2a4ab3/freebsd-amd64-race
2020-01-07T19:53:19-7d98da8/darwin-amd64-10_15
2019-12-31T12:11:24-bbd25d2/solaris-amd64-oraclerel
2019-12-11T00:01:17-9c8c27a/solaris-amd64-oraclerel
2019-11-04T15:18:34-d3660e8/plan9-386-0intro
2019-09-12T14:08:16-3d522b1/solaris-amd64-smartosbuildlet
The text was updated successfully, but these errors were encountered: