-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterministic(?) SIGSEGV&co. in CI #90812
Comments
One more is #90041. |
#90589 in x86_64-gnu-distcheck |
Going systematically through the bors CI failures of the last 24 hours: Rustc segfaults:
8 other (inconspicuous) failures. |
Once again: dist-armhf fail
|
This is way too often for cosmic rays to be the answer at this point... |
#90881 (comment) failed building memchr for stage0 std (mingw-check). I investigated a bit to see what, if anything has changed. stage0 rustc has not changed in a long while. The docker images haven't changed, either. The first error appears to be in #89167 (https://github.com/rust-lang-ci/rust/actions/runs/1447510130) at around 2021-11-11T05:46. The only thing that I've see that has changed recently is the outer GitHub Ubuntu 18 image was updated at 2021-11-09T11:16. The new image is https://github.com/actions/virtual-environments/releases/tag/ubuntu18%2F20211108.1. I don't see anything in there that would be suspicious. There's also about 43 hours of successful builds between the image being updated and the first SIGSEGV appearing. It would also be surprising that the outer image would introduce SIGSEGV. |
We can try backing out #89167, but I'd really rather not. It introduced very little code that's even getting compiled at stage0 though (since it has cfg(not(bootstrap)) on the whole module it adds, IIRC). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Running under valgrind with the stage0 compiler, I am seeing a bunch of the following (or similar) on thread_local compilation as part of bootstrap (due to #90812 (comment))
|
Hmm, that looks more like a valgrind instrumentation issue - the valgrind version of |
Yeah, it's possible. I have a suspicion that the problem may actually lie sort of "close by" in the sense that LLVM seems to be using the g++ stl malloc, even though it's built with clang -- new_op.cc comes from But I can't explain what changed around November 11th to cause this to start causing problems, more work to be done there. |
#90821 failed again, this time with a double free in stage 0 cargo: #90821 (comment) |
Probably another case in #90926 (comment), though that one looks even more interesting -- seems to be corruption of something with no Rust code at all, so further suggests some kind of environmental change perhaps... |
The segfault spam on CI definitely started before #89167 was merged and against PRs which did not include it. It may have exacerbated the problem somewhat by adding more pressure on afflicted parts of LLVM/Rust, but there is no way it traveled back in time and started breaking CI before it appeared in the compiler. It's not impossible that it affects memchr specifically in a bad way, but I feel it is somewhat unlikely. |
Yeah, I agree this is not likely to be caused by that PR (which is why I have not posted a revert yet), given that we had "broken" builds due to this issue after the first attempt to merge it but before that succeeded. It seems quite unlikely we managed to accidentally break out of all the sandboxes and taint the CPU/memory somehow :) |
#90938 (comment) is another failure, yesterday. |
I have a little script which greps through the CI logs for 'SIGSEGV' and there was no new segfault since this one. No idea why it was happening or why it stopped. |
That is absolutely wizard, thank you. |
Ah, it looks like this has not recurred for an entire season?! |
See #89167, getting lovely stuff like core dumps here:
x86_64 -> ppc64le dist fail, first couple of tries
x86_64 -> ppc64le dist fail, last try
And in #90792:
x86_64 -> s390x dist fail, first few times
x86_64 -> s390x dist fail, last time
I don't know if this is going to be a useful issue, but I figured I would note this down as happening over the last day-ish.
For all I know it may have been cosmic rays.
The text was updated successfully, but these errors were encountered: