Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious test failures #19120

Closed
9 of 13 tasks
alexcrichton opened this issue Nov 19, 2014 · 14 comments
Closed
9 of 13 tasks

Spurious test failures #19120

alexcrichton opened this issue Nov 19, 2014 · 14 comments
Labels
metabug Issues about issues themselves ("bugs about bugs")

Comments

@alexcrichton
Copy link
Member

Our builders are susceptible to a lot of spurious test failures, and I've been keeping some of this knowledge in my head for far too long, so I'd like to write it down. I'm writing this to be a metabug, but I expect other bugs to be spawned off from it as well. Sometimes it's just a lot easier to add a bullet than a whole new bug!

If you add to this, please don't link directly to buildbot logs as they disappear over time. Instead please link to a gist with the relevant information pasted in. I also typically peruse http://buildbot.rust-lang.org/grid?branch=auto&width=10 for build failures

Spurious test failures

  • stdtest deadlocks on windows once a TCP test fails (logs) (test)
  • stdtest spuriously segfaults on freebsd (logs). The segfaults decreased in frequency when freebsd was upgraded, it may just need to be upgraded again.
  • stdtest deadlocks (logs)
    • thread_local::dtors_in_dtors_in_dtors deadlocks on OSX (logs)
  • io::process::tests::test_kill spuriously fails (logs)
  • run-pass-fulldeps/issue-13560.rs (logs)
  • wait-forked-but-child-failed (logs)
  • src/test/run-pass/issue-12684 (logs)

Spurious failure behvaior

@alexcrichton alexcrichton added the metabug Issues about issues themselves ("bugs about bugs") label Nov 19, 2014
@aturon
Copy link
Member

aturon commented Nov 19, 2014

cc @jakub- since you've been doing rollups lately.

@aturon
Copy link
Member

aturon commented Nov 26, 2014

Added run-pass-fulldeps/issue-13560.rs, which seems to be failing intermittently on various completely unrelated branches.

@alexcrichton
Copy link
Member Author

cc @nodakai, looks like the new wait-forked-but-child-failed test failed recently (linked above). I have a feeling that the [rustc] defunct process is just a zombie from the system in general, but it may mean that the test is somewhat flaky as a result.

@nodakai
Copy link
Contributor

nodakai commented Dec 6, 2014

@alexcrichton I'll change the logic so that the test examines only the direct children of the test executable. I should have filtered processes by Parent PID rather than Session ID.

@nodakai
Copy link
Contributor

nodakai commented Dec 6, 2014

By the way, one reason for "stdtest deadlocks" seems to be socket recv(2) without timeout. I could reproduce such a case under strace(1):

I ran while sleep 1; do strace -f -o /tmp/test-$(date +%H-%M-%S).log ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu; done for about an hour until I got it. (In retrospect, I might as well have set RUST_TEST_TASKS=1 for catching them one by one, but the bug might depend on multi threading)

ps fx output at that time was:

25174 pts/20   Ss     0:06  \_ /bin/zsh
 9886 pts/20   S+     0:00  |   \_ strace -f -o /tmp/test-03-13-14.log ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu
 9891 pts/20   Sl+    0:02  |       \_ ./x86_64-unknown-linux-gnu/stage2/test/stdtest-x86_64-unknown-linux-gnu
10895 pts/20   Z+     0:00  |           \_ [sleep] <defunct>
10921 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>
10971 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>
11001 pts/20   Z+     0:00  |           \_ [stdtest-x86_64-] <defunct>

(Those "defunct" should have been gone with my patch.)

Among the last 10 lines of the above strace log,

11535 _exit(0)                          = ?
11535 +++ exited with 0 +++
9891  <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
9891  --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL, si_value={int=1, ptr=0x1}} ---
11031 +++ killed by SIGINT +++
11007 +++ killed by SIGINT +++
10800 +++ killed by SIGINT +++
10793 +++ killed by SIGINT +++
10792 +++ killed by SIGINT +++
9891  +++ killed by SIGINT +++

only the first two had been output before I hit Ctrl-C to terminate the test. So, we need to find out why five threads 11031, 11007, 10800, 10793, and 10792 of the test process 9891 got stuck from the log.

In fact 10792 and 10793 and 10800 are all related. Grepping for them will show

28336:9891  <... clone resumed> child_stack=0x7f71ebdfdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71ebdfe9d0, tls=0x7f71ebdfe700, child_tidptr=0x7f71ebdfe9d0) = 10792
28350:10792 set_robust_list(0x7f71ebdfe9e0, 24 <unfinished ...>
28358:10792 <... set_robust_list resumed> )   = 0
28359:10792 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f722c6cc000
28360:10792 sigaltstack({ss_sp=0x7f722c6cc000, ss_flags=0, ss_size=8192}, NULL) = 0
28361:10792 sched_getaffinity(10792, 32, {ff, 0, 0, 0}) = 32
28362:10792 clone(child_stack=0x7f71eafeafb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71eafeb9d0, tls=0x7f71eafeb700, child_tidptr=0x7f71eafeb9d0) = 10793
28363:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
28366:10793 set_robust_list(0x7f71eafeb9e0, 24 <unfinished ...>
28538:10793 <... set_robust_list resumed> )   = 0
28543:10793 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
28676:10793 <... mmap resumed> )              = 0x7f722c6be000
28679:10793 sigaltstack({ss_sp=0x7f722c6be000, ss_flags=0, ss_size=8192} <unfinished ...>
28683:10793 <... sigaltstack resumed> , NULL) = 0
28686:10793 sched_getaffinity(10793, 32,  <unfinished ...>
28690:10793 <... sched_getaffinity resumed> {ff, 0, 0, 0}) = 32
28696:10793 getcwd( <unfinished ...>
28699:10793 <... getcwd resumed> "/home/nodakai/src/rust-HEAD", 2048) = 28
28703:10793 getcwd( <unfinished ...>
28706:10793 <... getcwd resumed> "/home/nodakai/src/rust-HEAD", 2048) = 28
28710:10793 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
28714:10793 <... socket resumed> )            = 12
28717:10793 bind(12, {sa_family=AF_INET, sin_port=htons(9640), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
28720:10793 <... bind resumed> )              = 0
28724:10793 clone( <unfinished ...>
28729:10793 <... clone resumed> child_stack=0x7f71ebbfcfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f71ebbfd9d0, tls=0x7f71ebbfd700, child_tidptr=0x7f71ebbfd9d0) = 10800
28731:10793 select(13, [12], NULL, NULL, {0, 20000} <unfinished ...>
29248:10800 set_robust_list(0x7f71ebbfd9e0, 24 <unfinished ...>
29463:10800 <... set_robust_list resumed> )   = 0
30101:10800 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
30744:10800 <... mmap resumed> )              = 0x7f722c6d2000
31039:10793 <... select resumed> )            = 0 (Timeout)
31041:10793 select(13, [12], NULL, NULL, {0, 0} <unfinished ...>
31043:10793 <... select resumed> )            = 0 (Timeout)
31045:10793 sendto(12, "\0", 1, 0, {sa_family=AF_INET, sin_port=htons(9641), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
31047:10793 <... sendto resumed> )            = 1
31049:10793 recvfrom(12,  <unfinished ...>
31291:10800 sigaltstack({ss_sp=0x7f722c6d2000, ss_flags=0, ss_size=8192} <unfinished ...>
31747:10800 <... sigaltstack resumed> , NULL) = 0
32023:10800 sched_getaffinity(10800, 32,  <unfinished ...>
32039:10800 <... sched_getaffinity resumed> {ff, 0, 0, 0}) = 32
32056:10800 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP <unfinished ...>
32062:10800 <... socket resumed> )            = 8
32072:10800 bind(8, {sa_family=AF_INET, sin_port=htons(9641), sin_addr=inet_addr("127.0.0.1")}, 16 <unfinished ...>
32078:10800 <... bind resumed> )              = 0
32082:10800 recvfrom(8,  <unfinished ...>
52140:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
52143:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
63288:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
63290:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
75761:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
75765:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
75802:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
75804:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
88756:10793 <... recvfrom resumed> 0x7f71eafea27f, 1, 0, 0x7f71eafea0b8, 0x7f71eafea0ac) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
88758:10792 <... futex resumed> )             = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
88761:10792 futex(0x7f722ac2da8c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
89145:10793 recvfrom(12,  <unfinished ...>
172341:10800 +++ killed by SIGINT +++
172342:10793 +++ killed by SIGINT +++
172343:10792 +++ killed by SIGINT +++

Here we see 10793 got stuck at recvfrom(12, <unfinished ...> and 10800 at recvfrom(8, <unfinished ...>, and they are SOCK_DGRAM sockets.

Since all tests in libstd/io/net/udp.rs use next_test_ip4() to choose a port to bind a socket to, it's hard to find which test the ports 9640 and 9641 came from. But thanks to 10793 select(13, [12], NULL, NULL, {0, 20000} <unfinished ...>, we can conclude they came from #[test] fn recv_from_timeout().

I think the combination of rx.recv() followed by send_to() and tx.send() followed by recv_from() doesn't really guarantee that recv_from() is called later than send_to(). We need to fine control the timings of UDP socket calls perhaps by introducing more channel communications.

I hope you see how using self-descriptive constants in unit tests helps debugging with strace now (that's why I posted #19432 )

@nodakai
Copy link
Contributor

nodakai commented Dec 6, 2014

The other two are harder ones: 11007 was blocking infinitely at 11007 select(31, [25 30], NULL, NULL, NULL <unfinished ...>. By following all the calls of clone(2) related to 11007, we can construct the following Linux thread (process) tree:

The test driver 9891
  +- 10998
       +- 11000
            +- 11003
                 +- 11004 (fork) -> exec sleep 1000
                 +- 11007
            +- 11005
                 +- 11006 (fork) -> exec sleep 1000

These sleep 1000 must have come from fn sleeper() in libstd/io/process.rs. There is only one test in process.rs which calls sleeper() twice, so it must be #[test] fn wait_timeout2() which got stucked. Then 11007 must be fn waitpid_helper() which was triggered by p.wait() following p.set_timeout(Some(10)).

As for 11031, the thread/process tree is like this:

The test driver 9891
  +- 10991
       +- 10992
            +- 11003
                 +- 11009 (fork) -> exec sleep 1000
                 +- 11031

and it was blocking indefinitely again at 11031 select(12, [11], NULL, NULL, NULL <unfinished ...>. 11003 calls kill(11009, SIGKILL <unfinished ...> BEFORE kill(11009, SIG_0 <unfinished ...> and this pattern is unique to #[test] fn test_zero(). So the above select(2) call w/o timeout paramter must be fn waitpid_helper() again which was triggered by assert!(!p.wait().unwrap().success());.

I suspect there might be a bug or a race around HELPER.boot(register_sigchld, waitpid_helper) which made let min = active.iter() ... .min_by(...) to be None.

@nodakai
Copy link
Contributor

nodakai commented Dec 6, 2014

By the way, I'm against calling them "spurious" failures. We can't guarantee they are not, well, "genuine" failures.

nodakai added a commit to nodakai/rust that referenced this issue Dec 6, 2014
Reported as a part of rust-lang#19120

The logic of rust-lang/rust@74fb798 was
flawed because when a CI tool run the test parallely with other tasks,
they all belong to a single session family and the test may pick up
irrelevant zombie processes before they are reaped by the CI tool
depending on timing.

Also, panic! inside a loop over all children makes the logic simpler.

By not destructing the return values of Command::spawn() until
find_zombies() finishes, I believe we can conduct a slightly stricter
test.

Signed-off-by: NODA, Kai <nodakai@gmail.com>
@alexcrichton
Copy link
Member Author

@nodakai thank you for quite the thorough investigation! I'll try to hone in on these failures soon, but feel free to beat me to it!

alexcrichton added a commit to alexcrichton/rust that referenced this issue Dec 9, 2014
…inder

Reported as a part of rust-lang#19120

The logic of rust-lang/rust@74fb798 was
flawed because when a CI tool run the test parallely with other tasks,
they all belong to a single session family and the test may pick up
irrelevant zombie processes before they are reaped by the CI tool
depending on timing.
alexcrichton added a commit to alexcrichton/rust that referenced this issue Dec 15, 2014
This test would read with a timeout and then send a UDP message, expecting the
message to be received. The receiving port, however, was bound in the child
thread so it could be the case that the timeout and send happens before the
child thread runs. To remedy this we just bind the port before the child thread
runs, moving it into the child later on.

cc rust-lang#19120
brson added a commit that referenced this issue Dec 16, 2014
This test would read with a timeout and then send a UDP message, expecting the
message to be received. The receiving port, however, was bound in the child
thread so it could be the case that the timeout and send happens before the
child thread runs. To remedy this we just bind the port before the child thread
runs, moving it into the child later on.

cc #19120
alexcrichton added a commit to alexcrichton/rust that referenced this issue Dec 17, 2014
This test would read with a timeout and then send a UDP message, expecting the
message to be received. The receiving port, however, was bound in the child
thread so it could be the case that the timeout and send happens before the
child thread runs. To remedy this we just bind the port before the child thread
runs, moving it into the child later on.

cc rust-lang#19120
@klutzy
Copy link
Contributor

klutzy commented Jan 7, 2015

#20677 auto-win-32-nopt-t test failed. The only difference between the output list and current src/test/run-pass is unwind-resource, so I assume it hanged.

Collected some more command timed out: 1200 seconds without output, attempting to kill -> collect2.exe: error: ld returned 1 exit status logs. (The latter occurs because the hanged process is still alive during next build)

All hanged test uses Thread::spawn().

@alexcrichton
Copy link
Member Author

@klutzy yes I've reached the same conclusion as well, and that set of tests all look quite familiar! When I removed std::task I was hoping this would all "go away" because Thread::spawn implicitly joined, but then we changed it again!

I'll switch those tests over to scoped.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 7, 2015
These tests have all been failing spuroiusly on Windows from time to time, and
one suspicion is that the shilc thread outliving the main thread somehow causes
the problem. Switch all the tests over to using Thread::scoped instead of
Thread::spawn to see if it helps the issue.

cc rust-lang#19120
alexcrichton added a commit to alexcrichton/rust that referenced this issue Jan 8, 2015
…ss-spurious

These tests have all been failing spuroiusly on Windows from time to time, and
one suspicion is that the shilc thread outliving the main thread somehow causes
the problem. Switch all the tests over to using Thread::scoped instead of
Thread::spawn to see if it helps the issue.

cc rust-lang#19120
@Manishearth
Copy link
Member

Filed a separate bug for the rather common windows failure

#22628

@alexcrichton
Copy link
Member Author

New windows spurious failure I've seen:

  • stdtest deadlocks on windows once a TCP test fails (logs) (test)

(added to list above)

@Manishearth
Copy link
Member

From #24053

thread '<unnamed>' panicked at 'assertion failed: !p.wait().unwrap().success()', /Users/rustbuild/src/rust-buildbot/slave/auto-mac-64-nopt-t/build/src/libstd/old_io/process.rs:1139
test old_io::process::tests::test_kill ... FAILED

http://buildbot.rust-lang.org/builders/auto-mac-64-nopt-t/builds/4378

@alexcrichton
Copy link
Member Author

I think that this bug has somewhat outlived its lifespan, any new spurious failures can have follow-up bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metabug Issues about issues themselves ("bugs about bugs")
Projects
None yet
Development

No branches or pull requests

5 participants