Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tons of spurious failures due to build scripts failing to run rustc #663

Open
RalfJung opened this issue Aug 10, 2022 · 11 comments
Open

Tons of spurious failures due to build scripts failing to run rustc #663

RalfJung opened this issue Aug 10, 2022 · 11 comments

Comments

@RalfJung
Copy link
Member

RalfJung commented Aug 10, 2022

The report at https://crater-reports.s3.amazonaws.com/pr-100043/index.html is 100% spurious regressions. Almost all of them are due to the libc build script failing to detect the rustc version, a bunch are due to other build scripts failing to run rustc, and then there are a few other random strange things (in syn some trait is supposedly not implemented any more, and the rest is handled by #664).

@ehuss
Copy link
Contributor

ehuss commented Aug 10, 2022

I don't recall exactly how crater/rustwide work, but I believe they use rustup. If so, and if the crater machine has been updated to rustup 1.25, then this is likely caused by rust-lang/rustup#3031.

@RalfJung
Copy link
Member Author

That sounds like a deterministic problem, whereas this build script failure only occurs for a tiny fraction of all crates that use libc. (At least I would be very surprised if only 600 crates used libc.)

@Mark-Simulacrum
Copy link
Member

Yes, and I manually verified that libc itself built fine on master and on try, so it's definitely not fully deterministic. It might be worth trying to run these through again to try and re-confirm and see if failures are deterministic for this subset of ~600.

@Mark-Simulacrum
Copy link
Member

Interestingly, based on unpacking the all.tar.gz from both #99447 and #100043, and running rg 'Failed to get rustc version' | rg -v 'try#', it seems like only try# builds are affected. We run the start toolchain "check" task on the same worker thread & machine as the end toolchain, and in that order.

One run failed twice in libc (https://crater-reports.s3.amazonaws.com/pr-100043/try%23a8425075041d4b9af9bf32e2a267dfb9db2cfffb/gh/dalloriam.worktimer/log.txt), presumably there's a build-dep on libc and a regular dependency there? But not sure there's much information we can gain from that.

@RalfJung
Copy link
Member Author

In the rerun, the libc issue still occurs a few times, but the majority of crates passes now. So, definitely non-deterministic.

To me this really looks like it just failed to spawn the process due to resource limits...

@Mark-Simulacrum
Copy link
Member

We run on a large number of machines and don't regularly collect detailed logs from each one right now, but I'm not sure what kind of resource limit could have caused this to just recently start happening. We are adding new machine types to our build fleet -- but those aren't being oversubscribed to any great extent, at least from my perspective.

I'm going to try and locate which machines are having this error and pull logs, but our machines are ephemeral, so that might be a little hard.

@Mark-Simulacrum
Copy link
Member

Mark-Simulacrum commented Aug 11, 2022

I located one machine which had a few cases of this, but I don't see any adjacent suspicious events in the syslog (e.g., OOM kill or similar). It's possible it's another kind of resource exhaustion (file descriptor limits or whatever), though.

I also queried the all.tar logs for which workers had problems, and there is a suspicious correlation there, presuming I'm getting reasonable data. (Note that these workers may be on different machines: I'm not sure there's any good way to recover that from the current logs and other data we have, though it seems like a good idea to add some support for that).

worker-9 I think has to be on the larger, oversubscribed, Azure machines -- the GCP machines only have 8 workers. I definitely found 3 cases on one of the GCP machines though, so it's not strictly limited.

In any case, given these results for now I've turned off the Azure capacity -- currently it looks like I can't directly ssh onto those machines, but they are definitely oversubscribed so I'd not be too surprised if there's some resourcing issue there. I don't think it fully explains the problem (we should see errors on both master# and try#, IMO, and have seen a similar rate historically if not lower now since we have more GCP capacity than we did).

$ for file in $(rg 'Failed to get rustc version' -l); do rg -o 'worker-\d+' $file; done | sort | uniq -c
     97 worker-5
   3503 worker-6
   2394 worker-9

Edit: and the worker counts don't seem related to the number of jobs that worker executed, at least for worker-9 -- so I don't think we can explain this with just "that worker is executing more work". worker-6 though is more of an unknown.

$ rg --no-filename -o 'worker-\d+' | sort | uniq -c | sort -nr
 660359 worker-7
 658404 worker-4
 658289 worker-6
 658284 worker-0
 657983 worker-5
 657168 worker-3
 656586 worker-1
 656361 worker-2
  40224 worker-9
  39600 worker-8
  38991 worker-14
  38157 worker-13
  37966 worker-10
  37608 worker-11
  37066 worker-12

@Mark-Simulacrum
Copy link
Member

rust-lang/rust#100046 (comment) is another crater run, mostly run after we shut down the Azure (oversubscribed) instances. It looks like it had ~9 libc issues; I didn't look for other cases.

@ehuss
Copy link
Contributor

ehuss commented Aug 13, 2022

My script has a few other patterns that will match other build scripts and found 38 instances in pr-100046:

build-fail/gh/charlag/rusty-bird/5d388381d1f36b4bd3182c3b41ecc030716e07df/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/charlag/rusty-bird/5d388381d1f36b4bd3182c3b41ecc030716e07df/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/danpersa/rusty-team/af69f4c11df3e4fe555c31b0eac218414e4d0c73/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/dkastner/vault_to_env/5a4ac46ea4a5a0302a27369d6ab29b9f8b1bedec/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/fenollp/nnrs/0ecc83d6b83c12552ce1a667aaf1cbef213015d2/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/ikruglov/grpcdemo2rust/112b0f17048d2a3a51a7095e74e706d9da24b55c/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/nuts/scale_server_rust/acc37275442877fc69ab9aa802d11f4054828600/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/nuts/scale_server_rust/acc37275442877fc69ab9aa802d11f4054828600/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/onur/at-rust/275281522d3eafc747466ea29aac31f990ca05ec/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/portaloffreedom/remote-music/94867f8d5f42ab604c02ab357eec3150b63ae01a/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/portaloffreedom/remote-music/94867f8d5f42ab604c02ab357eec3150b63ae01a/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/rrader/EventBaboon/0ba91ee13bbc19ed871f6aea2cc780f8e9538d39/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/rrader/EventBaboon/0ba91ee13bbc19ed871f6aea2cc780f8e9538d39/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/rust-los-angeles/stockfighter-skeleton-app-rs/a1a092c937ed0c977fa105f6a0b54facb18be410/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/rust-los-angeles/stockfighter-skeleton-app-rs/a1a092c937ed0c977fa105f6a0b54facb18be410/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/schultyy/avm/82ab688d2f0053bf7f4bd7edda7ba1c85b51371e/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/schultyy/avm/82ab688d2f0053bf7f4bd7edda7ba1c85b51371e/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/wraithan/prime-challenge/325a0214cc7cc7a7f69db60e58e381f350a90bf5/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/wraithan/prime-challenge/325a0214cc7cc7a7f69db60e58e381f350a90bf5/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/zackify/silo/1550cb7cd7ad759b9ab29c0fb3726cc0767ac1ff/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/zackify/silo/1550cb7cd7ad759b9ab29c0fb3726cc0767ac1ff/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/gh/zcash-hackworks/pay-to-sudoku/e80e165ab50495606c59edded993e193dc506dc0/master#59d99f56d8888b0e5a5866d6418df3ca1bb90013.txt: panicked at 'unexpected -vV format
build-fail/gh/zcash-hackworks/pay-to-sudoku/e80e165ab50495606c59edded993e193dc506dc0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: panicked at 'unexpected -vV format
build-fail/reg/provwasm-tutorial/0.1.2/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/pwrs/1.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/replicante_agent/0.5.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/rlp/0.5.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/safety-rules/0.2.2/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
build-fail/reg/skywalking_rust/0.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/gh/a3ch/jureba_bot/343c04dd4b945f728be41506a34335ede871c37f/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/qecp/0.1.6/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/rawbson/0.2.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/redislock/1.3.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/rust_ledger/0.8.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/secp256kfun/0.7.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/shopless-types/0.10.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/silver_animation/0.1.0/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc
regressed/reg/smart-wallet/0.11.1/try#d76952bc2a74b981479554f7e1e2c6f8a8c3e3bf.txt: Failed to get rustc

Which is on-par with what was seen previously.

Copying some comments from the zulip thread for posterity:


I scanned the old logs. This seems to have been an issue since at least February (the oldest one I checked).

One strange thing I noticed in pr-100043 (Ralf's PR) and pr-99880 is that there is one build script which prints the stderr when it fails to run rustc and it printed:

/opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/bin/rustc: symbol lookup error: /opt/rustwide/rustup-home/toolchains/a8425075041d4b9af9bf32e2a267dfb9db2cfffb/lib/librustc_driver-c25a516c3d8f3eb8.so: undefined symbol: _ZN9hashbrown3raw11Fallibility17capacity_overflow17hf1df634019334dcdE

In all 7 cases it is the same symbol error.

matthiaskrgr was also observing that same exact error at https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/master.20toolchain.20broken.3F/near/291873065 with icemaker.


Number of "failed to launch rustc -V" -like errors over time:

February is just the last one you gave me in the list, I assume it has been happening before then.

Here's the frequency over time:

2022-02-10 31
2022-02-15 22
2022-02-18 30
2022-02-21 34
2022-02-26 23
2022-03-01 15
2022-03-03 20
2022-03-07 29
2022-03-10 18
2022-03-14 37
2022-03-18 25
2022-03-25 21
2022-03-27 9
2022-04-02 32
2022-04-05 26
2022-04-07 16
2022-04-10 21
2022-04-13 24
2022-04-16 25
2022-04-18 16
2022-04-21 19
2022-04-23 22
2022-04-27 28
2022-04-30 29
2022-05-05 34
2022-05-10 19
2022-05-11 21
2022-05-15 17
2022-05-19 22
2022-05-23 30
2022-05-26 23
2022-05-29 25
2022-06-02 32
2022-06-05 34
2022-06-13 29
2022-06-14 28
2022-06-17 29
2022-06-20 29
2022-06-22 28
2022-06-23 31
2022-06-25 27
2022-06-28 24
2022-07-01 34
2022-07-05 31
2022-07-08 22
2022-07-12 13
2022-07-16 37
2022-07-17 16
2022-07-20 31
2022-07-22 13
2022-07-24 9
2022-07-26 8
2022-07-30 27
2022-08-01 114
2022-08-04 26
2022-08-06 388
2022-08-07 30
2022-08-08 22
2022-08-09 965
2022-08-10 53


Raw per-worker occurrence data: https://gist.github.com/ehuss/2a0ebce9fe9bc1440705f6fe86fe3094

@RalfJung
Copy link
Member Author

One unfortunate side-effect of this is that it can mask actual regressions -- e.g. in https://crater-reports.s3.amazonaws.com/pr-104429-1/index.html, skreborn.bluetooth-sys.9d5b3dcba1a98824bb765ca5f69a1a6c789589b4 should be listed as a regular regression but instead its build script failed.

tatref pushed a commit to tatref/libc that referenced this issue Mar 20, 2023
bors added a commit to rust-lang/libc that referenced this issue Apr 3, 2023
build.rs: print rustc stderr if exit status != 0

I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout.

I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
bors added a commit to rust-lang/libc that referenced this issue Apr 3, 2023
build.rs: print rustc stderr if exit status != 0

I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout.

I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
bors added a commit to rust-lang/libc that referenced this issue Apr 4, 2023
build.rs: print rustc stderr if exit status != 0

I was trying to run benchmarks locally with rustc-perf and found that many of them failed to build with a message from libc's build.rs "Failed to get rustc version." I made this change locally to help debug, and I think it would be generally useful. In my case it quickly revealed that rustc was failing to find libLLVM and so `rustc --version` was emitting nothing on stdout.

I think this may have been part of what was intended with #3000 and might help debug rust-lang/crater#663.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants