-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run ESPResSo tests in parallel #18517
Run ESPResSo tests in parallel #18517
Conversation
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1671346565 processed Message to humans: this is just bookkeeping information for me, |
@boegelbot please test @ jsc-zen2 |
@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1671349638 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @boegelbot |
Test report by @casparvl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Test report by @boegel |
@casparvl As my failing test report shows, this is making the tests fail for me, for some reason... Without these changes, |
@jngrad Any idea why some tests fail with a timeout when they're being run in parallel, while the tests pass when being run sequentially?
|
These are statistical tests. They are CPU-intensive and slow down every other tests running concurrently. In our CI pipelines, we run If compute time is not an issue, you could add the CMake option |
Ok, figured it out (I think):
As a result of (2) and (3), all 4 MPI ranks get bound to the first core, meaning your heavily oversubscribing. Worse, if you enable paralellism, you're launching multiple MPI based tests, in parallel, each of which result in 4 processes that get bound to the same core. I understand why The most foolproof and reasonably performant way is probably to disable OpenMPI binding, and leave it up to the OS. I'm testing now with setting |
Much more reasonable timing:
I'll upload some new test reports. |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1674626326 processed Message to humans: this is just bookkeeping information for me, |
The |
Just completed the tests for the What struck me was that one test ( |
Test report by @boegelbot |
|
We set pretestopts = "unset OMP_PROC_BIND && " + local_OMPI_test_vars |
Test report by @casparvl |
Test report by @casparvl |
Discussed with @boegel today. We'll close this for now, as it's not worth investing more time in: parallel tests are a nice to have, not a nescessity. As a final note: the last test result shows that it can succeed. The failure for
Which could potentially be a race condition in creating that directory (multiple ranks trying to create the same dir?). I'm not sure, and as said, it's not really worth digging further into for now. I'll close this PR, if we want, we can pick it up later in a new PR, or re-open this. Edit: I'm unable to close the PR. It tells me 'you can't comment at this time'. If someone can close it: please do :) |
From @jngrad :
I'll check if |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1706989878 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @casparvl |
Test report by @casparvl |
Test report by @casparvl |
Most errors seem file permission errors (some conflict between different workers, maybe?), but there are some other errors that are not shown in the gist:
|
Increasing the line count to a value larger than 500 would help. CTest will often schedule the |
Test report by @casparvl |
@casparvl Didn't we conclude here that running the tests in parallel isn't going to work out? |
Closing this. Running tests in parallel causes too many issues for now, not worth the effort. We might pick up from here later. |
Small update to #18486 and #18485
(created using
eb --new-pr
)