Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run ADIOS2 CI without test reruns #3825

Closed
wants to merge 1 commit into from

Conversation

eisenhauer
Copy link
Member

No description provided.

@scottwittenburg
Copy link
Collaborator

@eisenhauer Was this just a test to see if test retries are necessary? Looking through some of the failed jobs, it seems there were a handful of failures and timeouts on each that presumably would have passed on a subsequent try.

@scottwittenburg
Copy link
Collaborator

scottwittenburg commented Sep 27, 2023

it seems there were a handful of failures and timeouts on each that presumably would have passed on a subsequent try.

Actually, that doesn't quite seem to be what happened. Looking at the raw output of one of the tests, at least one test passed several times and then finally failed due to a timeout. Search the log above for Engine.BP.*/BPStepsInSituGlobalArrayParameters.EveryOtherStep/*.BP4.MPI to see this. That makes sense given the description of REPEAT and UNTIL_FAIL argument here: https://cmake.org/cmake/help/latest/command/ctest_test.html

Still this is a little unnerving, isn't it? That if we just run enough times (4, in the case above), some of the mpi tests will eventually fail?

@eisenhauer
Copy link
Member Author

It is a bit unnerving. I started running this (submitted automatically every week), because of fear that the test retries might be hiding bugs or race conditions that we should be addressing. The InSitu engine seems to be particularly prone to failures, though it's not the only place we see them. Unfortunately when I've had time to go looking for failure modes I've had a hard time reproducing the failure...

@scottwittenburg
Copy link
Collaborator

because of fear that the test retries might be hiding bugs or race conditions that we should be addressing.

I'm wondering about this too. When I try to run the adios2 tests locally, having built against mpich, I see a ton of tests timing out, but there's no evidence of it happening in CI. The fact that we don't see that happening in CI may be because ctest doesn't print out the failed attempts when we use REPEAT with UNTIL_PASS. I'm currently looking for a way to know if and how often this could be happening.

Quick note of context: I'm currently trying to speed up mpich builds in CI (see #3616), with one goal of that work being to eventually replace most OpenMPI builds with mpich (see #3617).

@eisenhauer
Copy link
Member Author

Speeding up CI is a good goal. I don't know that I can offer a lot of insight into the failures, as my attempts to kill them haven't gotten anywhere. (I did kill a few testing bugs early on, but mostly it was multiple tests using the same output filename, which caused issues when they were run concurrently. I haven't been able to blame that for any of the regular failures no-rerun failures.

I will, however, fess up to being responsible for some of the longest running tests. There are some SST tests where we spawn multiple readers and randomly kill old ones or spawn new ones to make sure that the writer will survive such things. (That was the sort of situation where I was worried we might be hiding occasional failures, but I haven't seen evidence of that.). Those tests take minutes, simply because we want to make sure new guys have time to start up, connect, etc. So there are several tests with 300 second timeouts.

@scottwittenburg
Copy link
Collaborator

Thanks for sharing that info. Even though there may be some long-running tests, it seems something else is going on with mpich. The test run with mpich in the name regularly takes around an hour in CI, all the test runs with ompi in the name take roughly half that.

test suite with mpich: https://open.cdash.org/viewTest.php?onlypassed&buildid=9020462
test suite with ompi: https://open.cdash.org/viewTest.php?onlypassed&buildid=9020467

Same OS and compiler, and the ompi suite ran 1276 tests, which is 2 more than the mpich suite. That's odd, but probably explainable and unrelated. But the mpich tests took 58 minutes, the ompi tests took 28 minutes.

@scottwittenburg
Copy link
Collaborator

And I'm wondering if the difference in times could be explained by "invisible retries" in the case of mpich.

@eisenhauer
Copy link
Member Author

Just to toss out two things that might also play a role. I think that Vicente's MPI data plane in SST is only enabled on MPICH, but when it is enabled it is used by default. If there are startup costs or something that are different than when we use the sockets-based dataplane, that might cause systematic differences between the MPI implementations. Also, the SST tests most always involve at least one reader and writer, with those being separate executables (maybe each is an MPI job, maybe a single process). When the MPI implementation is capable of MPMD mode (that is, to launch different MPI ranks with different executables) we try to use that because it speeds up the testing. (The alternative is to launch 2 MPI jobs, one for the reader and one for the writer.) The ability to use MPI in mpmd mode might also explain some speed differences for a different MPI implementation.

@scottwittenburg
Copy link
Collaborator

Those are two great suggestions, giving me a couple of new lines of investigation which are quite welcome. Thanks 😁

I can see where the variable controlling whether the mpi data plane is included depends on the mpi implementation being mpich, so I'll dig in from that angle a little. Regarding mpmd-mode, just a quick look makes it seem both mpich and openmpi support that, so I'll see if it can be enabled for mpich if it's not already.

Thanks again @eisenhauer!

@eisenhauer
Copy link
Member Author

Well, I'm fully cognizant that SST tends to be a problem child and I will apologize for the complexity of the CMake in staging-common. There may be a better way to do what it does...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants