Spurious test failures #19120

alexcrichton · 2014-11-19T22:49:48Z

Our builders are susceptible to a lot of spurious test failures, and I've been keeping some of this knowledge in my head for far too long, so I'd like to write it down. I'm writing this to be a metabug, but I expect other bugs to be spawned off from it as well. Sometimes it's just a lot easier to add a bullet than a whole new bug!

If you add to this, please don't link directly to buildbot logs as they disappear over time. Instead please link to a gist with the relevant information pasted in. I also typically peruse http://buildbot.rust-lang.org/grid?branch=auto&width=10 for build failures

Spurious test failures

stdtest deadlocks on windows once a TCP test fails (logs) (test)
stdtest spuriously segfaults on freebsd (logs). The segfaults decreased in frequency when freebsd was upgraded, it may just need to be upgraded again.
stdtest deadlocks (logs)
- thread_local::dtors_in_dtors_in_dtors deadlocks on OSX (logs)
io::process::tests::test_kill spuriously fails (logs)
run-pass-fulldeps/issue-13560.rs (logs)
wait-forked-but-child-failed (logs)
src/test/run-pass/issue-12684 (logs)

Spurious failure behvaior

windows git failures lead to corruption of all future builds. I think @brson fixed this as I haven't seen it recently.
windows thinks failed processes exit with a 0 exit status compile-fail tests sometimes considered sucessfully compiled by the test harness #18981
windows will sometimes leave a test process running after the build finishes, preventing any future builds from overwriting the executable. This is not specific to any particular test. (logs)
windows make clean randomly fails due to some arbitrary object file not being able to be overwritten. (logs)
windows bots fail to upload stability metrics randomly sometimes Failed stability upload causes remainder of build to not execute #19079

The text was updated successfully, but these errors were encountered:

aturon · 2014-11-19T22:53:38Z

cc @jakub- since you've been doing rollups lately.

aturon · 2014-11-26T16:54:33Z

Added run-pass-fulldeps/issue-13560.rs, which seems to be failing intermittently on various completely unrelated branches.

alexcrichton · 2014-12-06T00:33:39Z

cc @nodakai, looks like the new wait-forked-but-child-failed test failed recently (linked above). I have a feeling that the [rustc] defunct process is just a zombie from the system in general, but it may mean that the test is somewhat flaky as a result.

nodakai · 2014-12-06T01:03:27Z

@alexcrichton I'll change the logic so that the test examines only the direct children of the test executable. I should have filtered processes by Parent PID rather than Session ID.

nodakai · 2014-12-06T03:50:38Z

By the way, one reason for "stdtest deadlocks" seems to be socket recv(2) without timeout. I could reproduce such a case under strace(1):

https://dl.dropboxusercontent.com/u/107479221/test-03-13-14.log.xz

I ran while sleep 1; do strace -f -o /tmp/test-$(date +%H-%M-%S).log ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu; done for about an hour until I got it. (In retrospect, I might as well have set RUST_TEST_TASKS=1 for catching them one by one, but the bug might depend on multi threading)

ps fx output at that time was:

25174 pts/20   Ss     0:06  \_ /bin/zsh
 9886 pts/20   S+     0:00  |   \_ strace -f -o /tmp/test-03-13-14.log ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu
 9891 pts/20   Sl+    0:02  |       \_ ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu
10895 pts/20   Z+     0:00  |           \_ [sleep] <defunct>
10921 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>
10971 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>
11001 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>

(Those "defunct" should have been gone with my patch.)

Among the last 10 lines of the above strace log,

11535 _exit(0)                          = ?
11535 +++ exited with 0 +++
9891  <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
9891  --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL, si_value={int=1, ptr=0x1}} ---
11031 +++ killed by SIGINT +++
11007 +++ killed by SIGINT +++
10800 +++ killed by SIGINT +++
10793 +++ killed by SIGINT +++
10792 +++ killed by SIGINT +++
9891  +++ killed by SIGINT +++

only the first two had been output before I hit Ctrl-C to terminate the test. So, we need to find out why five threads 11031, 11007, 10800, 10793, and 10792 of the test process 9891 got stuck from the log.

In fact 10792 and 10793 and 10800 are all related. Grepping for them will show

28336:9891  <... clone resumed> child_stack=0x7f71ebdfdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71ebdfe9d0, tls=0x7f71ebdfe700, child_tidptr=0x7f71ebdfe9d0) = 10792
28350:10792 set_robust_list(0x7f71ebdfe9e0, 24 <unfinished ...>
28358:10792 <... set_robust_list resumed> )   = 0
28359:10792 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f722c6cc000
28360:10792 sigaltstack({ss_sp=0x7f722c6cc000, ss_flags=0, ss_size=8192}, NULL) = 0
28361:10792 sched_getaffinity(10792, 32, {ff, 0, 0, 0}) = 32
28362:10792 clone(child_stack=0x7f71eafeafb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71eafeb9d0, tls=0x7f71eafeb700, child_tidptr=0x7f71eafeb9d0) = 10793
28363:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
28366:10793 set_robust_list(0x7f71eafeb9e0, 24 <unfinished ...>
28538:10793 <... set_robust_list resumed> )   = 0
28543:10793 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
28676:10793 <... mmap resumed> )              = 0x7f722c6be000
28679:10793 sigaltstack({ss_sp=0x7f722c6be000, ss_flags=0, ss_size=8192} <unfinished ...>
28683:10793 <... sigaltstack resumed> , NULL) = 0
28686:10793 sched_getaffinity(10793, 32,  <unfinished ...>
28690:10793 <... sched_getaffinity resumed> {ff, 0, 0, 0}) = 32
28696:10793 getcwd( <unfinished ...>
28699:10793 <... getcwd resumed> "/home/nodakai/src/rust-HEAD", 2048) = 28
28703:10793 getcwd( <unfinished ...>
28706:10793 <... getcwd resumed> "/home/nodakai/src/rust-HEAD", 2048) = 28
28710:10793 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
28714:10793 <... socket resumed> )            = 12
28717:10793 bind(12, {sa_family=AF_INET, sin_port=htons(9640), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
28720:10793 <... bind resumed> )              = 0
28724:10793 clone( <unfinished ...>
28729:10793 <... clone resumed> child_stack=0x7f71ebbfcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71ebbfd9d0, tls=0x7f71ebbfd700, child_tidptr=0x7f71ebbfd9d0) = 10800
28731:10793 select(13, [12], NULL, NULL, {0, 20000} <unfinished ...>
29248:10800 set_robust_list(0x7f71ebbfd9e0, 24 <unfinished ...>
29463:10800 <... set_robust_list resumed> )   = 0
30101:10800 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
30744:10800 <... mmap resumed> )              = 0x7f722c6d2000
31039:10793 <... select resumed> )            = 0 (Timeout)
31041:10793 select(13, [12], NULL, NULL, {0, 0} <unfinished ...>
31043:10793 <... select resumed> )            = 0 (Timeout)
31045:10793 sendto(12, "\0", 1, 0, {sa_family=AF_INET, sin_port=htons(9641), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
31047:10793 <... sendto resumed> )            = 1
31049:10793 recvfrom(12,  <unfinished ...>
31291:10800 sigaltstack({ss_sp=0x7f722c6d2000, ss_flags=0, ss_size=8192} <unfinished ...>
31747:10800 <... sigaltstack resumed> , NULL) = 0
32023:10800 sched_getaffinity(10800, 32,  <unfinished ...>
32039:10800 <... sched_getaffinity resumed> {ff, 0, 0, 0}) = 32
32056:10800 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
32062:10800 <... socket resumed> )            = 8
32072:10800 bind(8, {sa_family=AF_INET, sin_port=htons(9641), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
32078:10800 <... bind resumed> )              = 0
32082:10800 recvfrom(8,  <unfinished ...>
52140:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
52143:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
63288:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
63290:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
75761:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
75765:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
75802:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
75804:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
88756:10793 <... recvfrom resumed> 0x7f71eafea27f, 1, 0, 0x7f71eafea0b8, 0x7f71eafea0ac) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
88758:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
88761:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
89145:10793 recvfrom(12,  <unfinished ...>
172341:10800 +++ killed by SIGINT +++
172342:10793 +++ killed by SIGINT +++
172343:10792 +++ killed by SIGINT +++

Here we see 10793 got stuck at recvfrom(12, <unfinished ...> and 10800 at recvfrom(8, <unfinished ...>, and they are SOCK_DGRAM sockets.

Since all tests in libstd/io/net/udp.rs use next_test_ip4() to choose a port to bind a socket to, it's hard to find which test the ports 9640 and 9641 came from. But thanks to 10793 select(13, [12], NULL, NULL, {0, 20000} <unfinished ...>, we can conclude they came from #[test] fn recv_from_timeout().

I think the combination of rx.recv() followed by send_to() and tx.send() followed by recv_from() doesn't really guarantee that recv_from() is called later than send_to(). We need to fine control the timings of UDP socket calls perhaps by introducing more channel communications.

I hope you see how using self-descriptive constants in unit tests helps debugging with strace now (that's why I posted #19432 )

nodakai · 2014-12-06T05:19:29Z

The other two are harder ones: 11007 was blocking infinitely at 11007 select(31, [25 30], NULL, NULL, NULL <unfinished ...>. By following all the calls of clone(2) related to 11007, we can construct the following Linux thread (process) tree:

The test driver 9891
  +- 10998
       +- 11000
            +- 11003
                 +- 11004 (fork) -> exec sleep 1000
                 +- 11007
            +- 11005
                 +- 11006 (fork) -> exec sleep 1000

These sleep 1000 must have come from fn sleeper() in libstd/io/process.rs. There is only one test in process.rs which calls sleeper() twice, so it must be #[test] fn wait_timeout2() which got stucked. Then 11007 must be fn waitpid_helper() which was triggered by p.wait() following p.set_timeout(Some(10)).

As for 11031, the thread/process tree is like this:

The test driver 9891
  +- 10991
       +- 10992
            +- 11003
                 +- 11009 (fork) -> exec sleep 1000
                 +- 11031

and it was blocking indefinitely again at 11031 select(12, [11], NULL, NULL, NULL <unfinished ...>. 11003 calls kill(11009, SIGKILL <unfinished ...> BEFORE kill(11009, SIG_0 <unfinished ...> and this pattern is unique to #[test] fn test_zero(). So the above select(2) call w/o timeout paramter must be fn waitpid_helper() again which was triggered by assert!(!p.wait().unwrap().success());.

I suspect there might be a bug or a race around HELPER.boot(register_sigchld, waitpid_helper) which made let min = active.iter() ... .min_by(...) to be None.

nodakai · 2014-12-06T05:28:39Z

By the way, I'm against calling them "spurious" failures. We can't guarantee they are not, well, "genuine" failures.

Reported as a part of rust-lang#19120 The logic of rust-lang/rust@74fb798 was flawed because when a CI tool run the test parallely with other tasks, they all belong to a single session family and the test may pick up irrelevant zombie processes before they are reaped by the CI tool depending on timing. Also, panic! inside a loop over all children makes the logic simpler. By not destructing the return values of Command::spawn() until find_zombies() finishes, I believe we can conduct a slightly stricter test. Signed-off-by: NODA, Kai <nodakai@gmail.com>

alexcrichton · 2014-12-08T02:34:47Z

@nodakai thank you for quite the thorough investigation! I'll try to hone in on these failures soon, but feel free to beat me to it!

…inder Reported as a part of rust-lang#19120 The logic of rust-lang/rust@74fb798 was flawed because when a CI tool run the test parallely with other tasks, they all belong to a single session family and the test may pick up irrelevant zombie processes before they are reaped by the CI tool depending on timing.

This test would read with a timeout and then send a UDP message, expecting the message to be received. The receiving port, however, was bound in the child thread so it could be the case that the timeout and send happens before the child thread runs. To remedy this we just bind the port before the child thread runs, moving it into the child later on. cc rust-lang#19120

This test would read with a timeout and then send a UDP message, expecting the message to be received. The receiving port, however, was bound in the child thread so it could be the case that the timeout and send happens before the child thread runs. To remedy this we just bind the port before the child thread runs, moving it into the child later on. cc #19120

This test would read with a timeout and then send a UDP message, expecting the message to be received. The receiving port, however, was bound in the child thread so it could be the case that the timeout and send happens before the child thread runs. To remedy this we just bind the port before the child thread runs, moving it into the child later on. cc rust-lang#19120

klutzy · 2015-01-07T09:33:59Z

#20677 auto-win-32-nopt-t test failed. The only difference between the output list and current src/test/run-pass is unwind-resource, so I assume it hanged.

Collected some more command timed out: 1200 seconds without output, attempting to kill -> collect2.exe: error: ld returned 1 exit status logs. (The latter occurs because the hanged process is still alive during next build)

auto-win-32-opt 2649/2650: extern-stress, extern-yield, tcp-accept-stress, threads
auto-win-32-opt 2646/2647: extern-stress, tcp-accept-stress, threads, unique-send-2
auto-win-32-opt 2603: extern-yield, issue-9396, tcp-accept-stress, threads
auto-win-32-nopt-t 2609/2610: drop-trait-enum

All hanged test uses Thread::spawn().

alexcrichton · 2015-01-07T17:21:20Z

@klutzy yes I've reached the same conclusion as well, and that set of tests all look quite familiar! When I removed std::task I was hoping this would all "go away" because Thread::spawn implicitly joined, but then we changed it again!

I'll switch those tests over to scoped.

These tests have all been failing spuroiusly on Windows from time to time, and one suspicion is that the shilc thread outliving the main thread somehow causes the problem. Switch all the tests over to using Thread::scoped instead of Thread::spawn to see if it helps the issue. cc rust-lang#19120

…ss-spurious These tests have all been failing spuroiusly on Windows from time to time, and one suspicion is that the shilc thread outliving the main thread somehow causes the problem. Switch all the tests over to using Thread::scoped instead of Thread::spawn to see if it helps the issue. cc rust-lang#19120

Manishearth · 2015-02-21T14:00:48Z

Filed a separate bug for the rather common windows failure

#22628

alexcrichton · 2015-02-23T23:03:04Z

New windows spurious failure I've seen:

stdtest deadlocks on windows once a TCP test fails (logs) (test)

(added to list above)

Manishearth · 2015-04-05T19:00:41Z

From #24053

thread '<unnamed>' panicked at 'assertion failed: !p.wait().unwrap().success()', /Users/rustbuild/src/rust-buildbot/slave/auto-mac-64-nopt-t/build/src/libstd/old_io/process.rs:1139
test old_io::process::tests::test_kill ... FAILED

http://buildbot.rust-lang.org/builders/auto-mac-64-nopt-t/builds/4378

alexcrichton · 2015-09-08T18:12:36Z

I think that this bug has somewhat outlived its lifespan, any new spurious failures can have follow-up bugs.

alexcrichton added the metabug Issues about issues themselves ("bugs about bugs") label Nov 19, 2014

nodakai mentioned this issue Dec 6, 2014

Fix false positive alerts from a run-pass test on Command. #19588

Merged

alexcrichton mentioned this issue Dec 15, 2014

std: Bind port early to make a test more reliable #19859

Merged

barosl mentioned this issue Jan 6, 2015

std: prevent CreateProcess() race on Windows #20650

Merged

alexcrichton mentioned this issue Jan 7, 2015

test: Move some tests to scoped instead of spawn #20704

Merged

klutzy mentioned this issue Feb 21, 2015

Intermittent "Permission denied" on Windows #22628

Closed

alexcrichton mentioned this issue Mar 17, 2015

Intermittent panic on old_io::process::tests::test_kill #23440

Closed

alexcrichton closed this as completed Sep 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious test failures #19120

Spurious test failures #19120

alexcrichton commented Nov 19, 2014

aturon commented Nov 19, 2014

aturon commented Nov 26, 2014

alexcrichton commented Dec 6, 2014

nodakai commented Dec 6, 2014

nodakai commented Dec 6, 2014

nodakai commented Dec 6, 2014

nodakai commented Dec 6, 2014

alexcrichton commented Dec 8, 2014

klutzy commented Jan 7, 2015

alexcrichton commented Jan 7, 2015

Manishearth commented Feb 21, 2015

alexcrichton commented Feb 23, 2015

Manishearth commented Apr 5, 2015

alexcrichton commented Sep 8, 2015

Spurious test failures #19120

Spurious test failures #19120

Comments

alexcrichton commented Nov 19, 2014