Run ADIOS2 CI without test reruns #3825

eisenhauer · 2023-09-26T08:02:22Z

No description provided.

scottwittenburg · 2023-09-27T16:57:22Z

@eisenhauer Was this just a test to see if test retries are necessary? Looking through some of the failed jobs, it seems there were a handful of failures and timeouts on each that presumably would have passed on a subsequent try.

scottwittenburg · 2023-09-27T17:24:44Z

it seems there were a handful of failures and timeouts on each that presumably would have passed on a subsequent try.

Actually, that doesn't quite seem to be what happened. Looking at the raw output of one of the tests, at least one test passed several times and then finally failed due to a timeout. Search the log above for Engine.BP.*/BPStepsInSituGlobalArrayParameters.EveryOtherStep/*.BP4.MPI to see this. That makes sense given the description of REPEAT and UNTIL_FAIL argument here: https://cmake.org/cmake/help/latest/command/ctest_test.html

Still this is a little unnerving, isn't it? That if we just run enough times (4, in the case above), some of the mpi tests will eventually fail?

eisenhauer · 2023-09-27T17:29:34Z

It is a bit unnerving. I started running this (submitted automatically every week), because of fear that the test retries might be hiding bugs or race conditions that we should be addressing. The InSitu engine seems to be particularly prone to failures, though it's not the only place we see them. Unfortunately when I've had time to go looking for failure modes I've had a hard time reproducing the failure...

scottwittenburg · 2023-09-27T17:41:31Z

because of fear that the test retries might be hiding bugs or race conditions that we should be addressing.

I'm wondering about this too. When I try to run the adios2 tests locally, having built against mpich, I see a ton of tests timing out, but there's no evidence of it happening in CI. The fact that we don't see that happening in CI may be because ctest doesn't print out the failed attempts when we use REPEAT with UNTIL_PASS. I'm currently looking for a way to know if and how often this could be happening.

Quick note of context: I'm currently trying to speed up mpich builds in CI (see #3616), with one goal of that work being to eventually replace most OpenMPI builds with mpich (see #3617).

eisenhauer · 2023-09-27T17:57:13Z

Speeding up CI is a good goal. I don't know that I can offer a lot of insight into the failures, as my attempts to kill them haven't gotten anywhere. (I did kill a few testing bugs early on, but mostly it was multiple tests using the same output filename, which caused issues when they were run concurrently. I haven't been able to blame that for any of the regular failures no-rerun failures.

I will, however, fess up to being responsible for some of the longest running tests. There are some SST tests where we spawn multiple readers and randomly kill old ones or spawn new ones to make sure that the writer will survive such things. (That was the sort of situation where I was worried we might be hiding occasional failures, but I haven't seen evidence of that.). Those tests take minutes, simply because we want to make sure new guys have time to start up, connect, etc. So there are several tests with 300 second timeouts.

scottwittenburg · 2023-09-27T18:10:35Z

Thanks for sharing that info. Even though there may be some long-running tests, it seems something else is going on with mpich. The test run with mpich in the name regularly takes around an hour in CI, all the test runs with ompi in the name take roughly half that.

test suite with mpich: https://open.cdash.org/viewTest.php?onlypassed&buildid=9020462
test suite with ompi: https://open.cdash.org/viewTest.php?onlypassed&buildid=9020467

Same OS and compiler, and the ompi suite ran 1276 tests, which is 2 more than the mpich suite. That's odd, but probably explainable and unrelated. But the mpich tests took 58 minutes, the ompi tests took 28 minutes.

scottwittenburg · 2023-09-27T18:11:48Z

And I'm wondering if the difference in times could be explained by "invisible retries" in the case of mpich.

eisenhauer · 2023-09-27T19:32:42Z

Just to toss out two things that might also play a role. I think that Vicente's MPI data plane in SST is only enabled on MPICH, but when it is enabled it is used by default. If there are startup costs or something that are different than when we use the sockets-based dataplane, that might cause systematic differences between the MPI implementations. Also, the SST tests most always involve at least one reader and writer, with those being separate executables (maybe each is an MPI job, maybe a single process). When the MPI implementation is capable of MPMD mode (that is, to launch different MPI ranks with different executables) we try to use that because it speeds up the testing. (The alternative is to launch 2 MPI jobs, one for the reader and one for the writer.) The ability to use MPI in mpmd mode might also explain some speed differences for a different MPI implementation.

scottwittenburg · 2023-09-27T19:51:34Z

Those are two great suggestions, giving me a couple of new lines of investigation which are quite welcome. Thanks 😁

I can see where the variable controlling whether the mpi data plane is included depends on the mpi implementation being mpich, so I'll dig in from that angle a little. Regarding mpmd-mode, just a quick look makes it seem both mpich and openmpi support that, so I'll see if it can be enabled for mpich if it's not already.

Thanks again @eisenhauer!

eisenhauer · 2023-09-27T20:10:51Z

Well, I'm fully cognizant that SST tends to be a problem child and I will apologize for the complexity of the CMake in staging-common. There may be a better way to do what it does...

Test ADIOS2 with test run disabled

819d8f0

eisenhauer requested a review from vicentebolea as a code owner September 26, 2023 08:02

eisenhauer self-assigned this Sep 26, 2023

eisenhauer added the never merge label Sep 26, 2023

eisenhauer closed this Sep 27, 2023

eisenhauer deleted the NoTestRerun branch September 27, 2023 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run ADIOS2 CI without test reruns #3825

Run ADIOS2 CI without test reruns #3825

eisenhauer commented Sep 26, 2023

scottwittenburg commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023 •

edited

Loading

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

Run ADIOS2 CI without test reruns #3825

Run ADIOS2 CI without test reruns #3825

Conversation

eisenhauer commented Sep 26, 2023

scottwittenburg commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023 • edited Loading

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023

eisenhauer commented Sep 27, 2023

scottwittenburg commented Sep 27, 2023 •

edited

Loading