Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd #48192

Closed
kennytm opened this issue Feb 13, 2018 · 10 comments
Assignees
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@kennytm
Copy link
Member

kennytm commented Feb 13, 2018

Symptom: One of the dist-*-linux job timed out after 3 hours. The stage0-std phase needs over 1000 seconds to compile.

Typically all 3 compilers (host stage0, host stage1, target stage1) are completely built, but the RLS build or final deployment doesn't finish in time.

The log should be inspected to ensure this is not caused by network error (e.g. cloning a submodule taking over 1 hour to complete).

Examples:

  • PowerPC (Build completed at 2:57:22, but 3 minutes isn't enough for deployment)
  • MIPS (Timed out before completing RLS)
  • PowerPC 64 (Timed out before completing RLS) (not 20-minute stage0-libstd)
  • i686 (Timed out before completing RLS) (not 20-minute stage0-libstd)
@kennytm kennytm added A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Feb 13, 2018
@Mark-Simulacrum
Copy link
Member

I'm going to try and investigate at some point this week so assigning myself.

@Mark-Simulacrum Mark-Simulacrum self-assigned this Feb 13, 2018
@kennytm
Copy link
Member Author

kennytm commented Feb 13, 2018

Timing breakdown for these 4 logs
  i686 ppc64 mips ppc
make-prepare 69.19 74.98 60.87 52.38
pytest/bootstrap 0.00 0.00 0.00 0.00
stage0-std 91.80 104.70 1260.49 1250.99
stage0-test 20.93 24.50 22.87 17.52
stage0-rustc 1478.91 1616.18 1511.29 1234.51
llvm 223.83 253.40 379.83 320.81
stage0-trans-llvm 205.13 244.81 206.33 168.54
llvm-emscripten 136.07      
stage0-trans-emscripten 116.82      
stage1-std 200.31 101.66 86.61 68.47
stage1-test 23.13 27.84 23.53 18.37
stage1-rustc 1727.49 1871.25 1667.80 1325.40
stage1-trans-llvm 217.83 255.12 221.41 177.87
stage1-trans-emscripten 126.75      
stage1-std 89.00 99.63 85.85 65.50
stage0-unstable-book-gen 13.26 14.43 14.54 11.37
stage0-rustbook 240.16 251.63 208.86 196.78
stage2-rustdoc 287.74 306.87 289.94 239.42
doc/std 15.35 15.90 14.23 12.92
doc/compiler 22.85 26.30 24.32 18.23
stage1-rustc 1701.37 1885.18 1566.31 1342.24
doc/proc_macro 1.35 1.38 1.43 1.19
stage2-error_index_generator 8.24 8.74 8.56 6.48
stage0-fabricate 129.00 137.81 139.86 109.77
llvm 223.84 228.95 368.32 287.82
stage1-trans-llvm 222.48 281.78 218.72 196.22
llvm-emscripten 133.85      
stage1-trans-emscripten 131.67      
stage2-rustdoc 304.44 327.96 275.49 236.46
dist/rustsrc 171.99      
stage2-cargo 589.43 629.74 498.94 406.81
stage2-rustfmt 338.24 361.24 314.88 264.20
stage2-cargo-fmt 0.00 0.00 0.00 0.00
stage2-rls   665.78 523.11 440.94

The MIPS and PowerPC spent >20 minutes on stage0-std which may be a different bug (maybe related to rust-lang/cargo#5030 as suggested by @alexcrichton)

PowerPC64 looks most suspicious, as everything looks so uniform.

Graphed

screenshot_2018-02-14 04 03 24_rbckfy

@alexcrichton
Copy link
Member

Looking at some of those with the travis output:

  • PowerPC
    • The travis container was rebuilt here which took ~20m, this would be an anomaly
    • This also took 20m to build libstd in stage0 (!)
  • MIPS
    • Took 20m to build libstd in stage0 (!)

Not sure what happened with PowerPC 64 and i686... I'm looking into the 20m libstd stage0 bug

@kennytm kennytm changed the title Spurious 3-hour timeout for dist-* jobs Spurious 3-hour timeout for dist-* jobs due to 20 minutes spent on compiling stage0-libstd Feb 14, 2018
@kennytm
Copy link
Member Author

kennytm commented Feb 14, 2018

I've narrowed down this bug to focus on the 20-minute libstd builds.

@alexcrichton
Copy link
Member

So some information about why libstd is taking 20 minutes to build. I randomly ran across this the other day and was very surprised to see what was happening. The behavior that I was seeing is that building libstd hung in stage0 and when looking at top Cargo was just sitting there eating tons of CPU.

Looking into why Cargo was eating CPU I discovered that Cargo's read2 function wasn't quite right, namely that when one of the fds it was interested in was closed it would just spin infinitely without blocking until the other fd was closed. The fix (rust-lang/cargo#5030) should fix the spin behavior but I don't believe it will block the "cargo hangs for 20 minutes" bug.

I have no idea why cago is waiting 20 minutes for child file descriptors to get closed. The fix in rust-lang/cargo#5030, if I understand it correctly, will basically just cause cargo to block correctly rather than spin, but it won't help actually get the condition Cargo is waiting on to come about any sooner.

I've got no idea why one fd of a child process would get closed and the other wouldn't for 20 minutes. I wonder, though if this is related to the sccache and build scripts change as sccache could in theory hold a descriptor open for a long time by accident. I haven't been able to confirm this locally though.

@alexcrichton
Copy link
Member

Ok I'm like 99% sure it's the sccache change now. The sccache server is starting in the libstd/libcore build scripts and is somehow inheriting too many fds. My current assumption for what's happening is:

  • Cargo spawns a build script with stdout/stderr fds
  • The build script then spawns a subprocess like the configure script
  • The configure script now does stuff that doesn't use CLOEXEC. For example the configure script may dup its own stdout/stderr onto separate fds
  • The configure script eventually invokes sccache to test it
  • Sccache realizes no server is running so it daemonizes
  • Sccache's daemon inherits the original stdout/stderr fds, dup'd by the configure script as not CLOEXEC, and keeps them open
  • Everything exits except for sccache's daemon which stays alive for a 10m idle time
  • Cargo now attempts to finish reading the original stdout/sdterr fds. One is closed but one leaked into sccache's server, so Cargo thinks it never gets closed.

I'm not entirely sure why a configure script (aka sh) would dup a stdio fd into a separate location without CLOEXEC, but it's currently the only way that I can explain this. I think we've got two fixes: either revert the sccache patch or try to start the sccache server earlier on in the compilation process. I vote we revert the sccache patch.

@kennytm I'm gonna be tight on time soon, mind doing the revert for me?

@alexcrichton
Copy link
Member

#48209 includes a revert of the sccache commit, presumably if things stabilize after that lands we can close this.

bors added a commit that referenced this issue Feb 14, 2018
Try to fix 48116 and 48192

The bug #48116 happens because of a misoptimization of the `import_path_to_string` function, where a `names` slice is empty but the `!names.is_empty()` branch is executed.

https://github.com/rust-lang/rust/blob/4d2d3fc5dadf894a8ad709a5860a549f2c0b1032/src/librustc_resolve/resolve_imports.rs#L1015-L1042

Yesterday, @eddyb had locally reproduced the bug, and [came across the `position` function](https://mozilla.logbot.info/rust-infra/20180214#c14296834) where the `assume()` call is found to be suspicious. We have *not* concluded that this `assume()` causes #48116, but given [the reputation of `assume()`](#45501 (comment)), this seems higher relevant. Here we try to see if commenting it out can fix the errors.

Later @alexcrichton has bisected and found a potential bug [in the LLVM side](#48116 (comment)). We are currently testing if reverting that LLVM commit is enough to stop the bug. If true, this PR can be reverted (keep the `assume()`) and we could backport the LLVM patch instead.

(This PR also includes an earlier commit from #48127 for help debugging ICE happening in compile-fail/parse-fail tests.)

The PR also reverts #48059, which seems to cause #48192.

r? @alexcrichton
cc @eddyb, @arthurprs (#47333)
@alexcrichton
Copy link
Member

omg cargo clobbers cloexec

@alexcrichton
Copy link
Member

Hm no nevermind, those flags don't include the cloexec flag

@alexcrichton
Copy link
Member

alexcrichton commented Feb 14, 2018

Ok I'm now 100% sure that this is a "bug" or rather simply a fact of how sh works.

Jemalloc's configure script has an innocous line which I had to look up b/c I have no idea what that line does. Apparently it duplicates stdout onto file descriptor 6! It also apparently doesn't set CLOEXEC in the shell, meaning that file descriptor (aka stdin) will now leak into all further processes.

I created a build script that looks like this:

use std::process::Command;

fn main() {
    let s = Command::new("sh")
        .arg("-c")
        .arg("exec 3>&1; exec 2>&-; sleep 600 &")
        .status()
        .unwrap();
    println!("{}", s);
}

and I can deterministically reproduce the hanging behavior. The exec 3>&1 says "put file descriptor 1, stdout, on file descriptor 3". The shell now also doesn't set that to cloexec apparently. The next line says 'close file descriptor 2' which helps show Cargo spinning. The last one is "go sleep in some process for awhile, but return from the shell"

Overall this build script exits immediately but Cargo sits there spinning burning CPU. After rust-lang/cargo#5030 then Cargo is "properly blocking" waiting for the file descriptor to be closed, but it's still not getting closed.

tl;dr; I think this issue is for sure now fixed with #48209 landed. In the meantime I think we just can't use sccache unless we start up the server outside of the main build process where we're sure that fds won't leak in. From what I can tell I think the shell is "at fault" here but I'm also like 99% sure that's a feature of the shell as well, so that's probably not gonna change. There's not much that Cargo/sccache can do here as they're basically just bystanders.

Closing...

alexcrichton added a commit to alexcrichton/rust that referenced this issue Mar 2, 2018
This is a re-attempt at rust-lang#48192 hopefully this time with 100% less randomly
[blocking builds for 20 minutes][block]. To work around rust-lang#48192 the sccache
server is started in the `run.sh` script very early on in the compilation
process.

[block]: rust-lang#48192
bors added a commit that referenced this issue Mar 4, 2018
rustbuild: Pass `ccache` to build scripts

This is a re-attempt at #48192 hopefully this time with 100% less randomly
[blocking builds for 20 minutes][block]. To work around #48192 the sccache
server is started in the `run.sh` script very early on in the compilation
process.

[block]: #48192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

3 participants