-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
close_output test is randomly failing #8564
Comments
Ugh, I just encountered a failure with it running in isolation without any threads... |
I can't seem to reproduce this locally, at least not when running all tests in a loop that contain Otherwise my only real guess is that the pipe fills up before returning an error, but local tests show that doesn't happen so I don't think this is right. Especially if it happens without any threads that is indeed very odd... One thing that may help is to |
I have instrumented it by writing to a file to verify what is happening. It seems to be working in the sequence I expect, and Cargo is sometimes successfully writing to stdout, even though I think it should be closed. Unless I have some misunderstanding about how pipes work (I've been assuming that after a successful close, the other side should be closed immediately assuming there aren't any duplicates of the fd). I keep thinking the most likely cause is a duplicate fd somewhere, but I can't find it. To repro, here's a change that can make it occur much faster: diff --git a/tests/testsuite/build.rs b/tests/testsuite/build.rs
index 7bcd6215a..f78354db4 100644
--- a/tests/testsuite/build.rs
+++ b/tests/testsuite/build.rs
@@ -4971,7 +4971,7 @@ fn close_output() {
let mut buf = [0];
drop(socket.read_exact(&mut buf));
let use_stderr = std::env::var("__CARGO_REPRO_STDERR").is_ok();
- for i in 0..10000 {
+ for i in 0..1 {
if use_stderr {
eprintln!("{}", i);
} else { The actual count there shouldn't matter. After stdout is closed, any write should fail. And because the proc-macro and test are synchronized, it is guaranteed that the write comes after the close. For doing single-threaded testing, I build the test and run this:
For me, this can fail anywhere from a few seconds to a few minutes. If you want a near-guaranteed failure, run several copies in parallel. I have a macro in my editor which adds something like the following: diff --git a/tests/testsuite/build.rs b/tests/testsuite/build.rs
index 7bcd6215a..d44093d2a 100644
--- a/tests/testsuite/build.rs
+++ b/tests/testsuite/build.rs
@@ -4931,6 +4931,25 @@ fn user_specific_cfgs_are_filtered_out() {
p.process(&p.bin("foo")).run();
}
+#[test] fn close_output0() { close_output(); }
+#[test] fn close_output1() { close_output(); }
+#[test] fn close_output2() { close_output(); }
+#[test] fn close_output3() { close_output(); }
+#[test] fn close_output4() { close_output(); }
+#[test] fn close_output5() { close_output(); }
+#[test] fn close_output6() { close_output(); }
+#[test] fn close_output7() { close_output(); }
+#[test] fn close_output8() { close_output(); }
+#[test] fn close_output9() { close_output(); }
+#[test] fn close_output10() { close_output(); }
+#[test] fn close_output11() { close_output(); }
+#[test] fn close_output12() { close_output(); }
+#[test] fn close_output13() { close_output(); }
+#[test] fn close_output14() { close_output(); }
+#[test] fn close_output15() { close_output(); }
+#[test] fn close_output16() { close_output(); }
+
+ I usually use a few hundred copies, but 16 seems to be enough here. Run Here's a copy of the instrumentation from a failed run:
Hopefully that makes sense. Compared to successful runs (where Cargo returns the expected error), the logging looks mostly the same. Some messages are rearranged due to process/thread scheduling, but the important lines are in the expected order. Thanks for helping to take a look! I keep feeling like I am missing something obvious, or have some fundamental misunderstanding about pipes, or how the |
Oh, and I've actually been doing most of my testing on a fast linux machine, which seems to repro a little more easily than on my mac laptop (which also can fail, but not as easily). I just ran the test on my 12-thread mac and it failed 10 out of 400 runs, whereas on my 32-thread linux system, it failed 298 out of 400 runs. (This is with the |
Ok so I have some explanations, but not others. If you change it to only print one item, e.g. change the loop to That explains why when changing to So to confirm, can you reproduce with |
Hm, sorry for the misleading direction. I wasn't aware of that interaction with My original intent with the 10,000 was just to ensure there wasn't any silliness with I'll continue poking on this. |
I'm returning to my original hypothesis that the issue is the file descriptor staying open because other threads are forking processes in the background, and there's a small window where the fd is duplicated. I'm able to repro with the following patch, running 1 test at a time (though it takes a while): diff --git a/tests/testsuite/build.rs b/tests/testsuite/build.rs
index 01c16b515..1e00e3543 100644
--- a/tests/testsuite/build.rs
+++ b/tests/testsuite/build.rs
@@ -4939,6 +4939,14 @@ fn user_specific_cfgs_are_filtered_out() {
fn close_output() {
// What happens when stdout or stderr is closed during a build.
+ std::thread::spawn(|| {
+ let _test_guard = cargo_test_support::paths::init_root();
+ loop {
+ let _output = cargo_test_support::cargo_process("help")
+ .exec_with_output().unwrap();
+ }
+ });
+
// Server to know when rustc has spawned.
let listener = std::net::TcpListener::bind("127.0.0.1:0").unwrap();
let addr = listener.local_addr().unwrap();
@@ -4975,7 +4983,7 @@ fn close_output() {
let mut buf = [0];
drop(socket.read_exact(&mut buf));
let use_stderr = std::env::var("__CARGO_REPRO_STDERR").is_ok();
- for i in 0..10000 {
+ for i in 0..100 {
if use_stderr {
eprintln!("{}", i);
} else { I'm not able to repro without other threads running in the background. @alexcrichton I have two questions:
|
I do actually always forget that CLOEXEC is, well, on exec not fork. Before we dive into this hypothesis though I'm curious:
|
That was when I had
Hm, good point. I did some tests, and I'm able to write 64KB to stdout or stderr before it starts blocking. I would have expected 4K. TIL, according to pipe(7), the buffer is 16 pages. So, maybe the answer here is to just print more than 64KB of data? |
Ok, that makes sense with |
TLDR: Should we run some flaky tests single-threaded?(Nope)The
build::close_output
test is randomly failing on CI. There were some fixes applied in #8286 in May 26, but there appears to be more recent failures:rust-lang/rust#74312 (comment)
rust-lang/rust#74408 (comment)
rust-lang/rust#74908 (comment)
rust-lang/rust#74923 (https://github.com/rust-lang-ci/rust/runs/924743383)
The failure is:
I am uncertain how this is possible, so maybe someone could double check that what I wrote makes sense. The test covers what happens when stdout or stderr is closed in the middle of the build. It uses a proc-macro as a sync point so that the test can know when compilation has started, and to emit data to stdout or stderr during the build. It should follow this sequence:
For some reason, at step 8, it successfully writes to stdout, and step 9 returns success.
I've been doing a few tests, and it gets worse based on the number of concurrent tests running.
When run single threaded, I cannot get it to fail (even with the system under heavy load).I'm feeling this is somewhat related to #7858. Is there still a race condition, even with atomic O_CLOEXEC? That is, AIUI, the file descriptors are still inherited across
fork
, and only closed whenexec
is called. If so, then there is a small window where the file descriptors have extra duplicates which prevent them from fully closing immediately.I'm thinking a simple solution would be to isolate these tests into a separate test executable which runs with(Testing shows this probably won't fix this test.)--test-threads=1
(or maybe a simple no-harness test?). This should prevent concurrent tests from interfering with one another. The downside is that this makes it more cumbersome to run all of the test suite.The text was updated successfully, but these errors were encountered: