Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run ESPResSo tests in parallel #18517

Closed

Conversation

casparvl
Copy link
Contributor

@casparvl casparvl commented Aug 9, 2023

Small update to #18486 and #18485
(created using eb --new-pr)

@casparvl
Copy link
Contributor Author

casparvl commented Aug 9, 2023

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18517 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18517 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11411

Test results coming soon (I hope)...

- notification for comment with ID 1671346565 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor Author

casparvl commented Aug 9, 2023

@boegelbot please test @ jsc-zen2

@casparvl casparvl added the change label Aug 9, 2023
@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=18517 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_18517 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3110

Test results coming soon (I hope)...

- notification for comment with ID 1671349638 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/ca2d4f36475d1db298497bec18a68ec9 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/c5bfcd1b2485e8489ffba5bf4e01e5de for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Aug 9, 2023

Test report by @casparvl
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/d9225724f63c60b5370488b75fc5b856 for a full test report.

@boegel boegel added this to the next release (4.8.1?) milestone Aug 10, 2023
boegel
boegel previously approved these changes Aug 10, 2023
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Aug 10, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
node3107.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/df11bbf8993646ddcce2609ab37593ad for a full test report.

@boegel
Copy link
Member

boegel commented Aug 10, 2023

@casparvl As my failing test report shows, this is making the tests fail for me, for some reason...

Without these changes, ESPResSo-4.2.1-foss-2022a.eb and ESPResSo-4.2.1-foss-2021a.eb work just fine (and takes ~27-29min on 18 cores in total).

@boegel
Copy link
Member

boegel commented Aug 10, 2023

@jngrad Any idea why some tests fail with a timeout when they're being run in parallel, while the tests pass when being run sequentially?

98% tests passed, 4 tests failed out of 201

Label Time Summary:
gpu             = 2274.63 sec*proc (56 tests)
long            = 4324.15 sec*proc (13 tests)
parallel        = 1546.16 sec*proc (45 tests)
parallel_odd    =  18.18 sec*proc (1 test)

Total Test time (real) = 1030.51 sec

The following tests FAILED:
	 76 - constant_pH_stats (Timeout)
	101 - integrator_npt_stats (Timeout)
	116 - dpd_stats (Timeout)
	124 - collision_detection (Timeout)
Errors while running CTest

@jngrad
Copy link

jngrad commented Aug 10, 2023

These are statistical tests. They are CPU-intensive and slow down every other tests running concurrently. In our CI pipelines, we run make check_python_skip_long in instrumentation builds (code coverage, sanitizers, etc.) to skip these statistical tests, otherwise they would time out. We run make check_python in release builds only.

If compute time is not an issue, you could add the CMake option -DTEST_TIMEOUT=1200 to increase the wall time.

@casparvl
Copy link
Contributor Author

Ok, figured it out (I think):

  1. ESPResSo typically runs with 4 MPI ranks
  2. OpenMPI's default binding behaviour is binding to core for >= 2 processes
  3. ESPResSo runs with --oversubscribe

As a result of (2) and (3), all 4 MPI ranks get bound to the first core, meaning your heavily oversubscribing. Worse, if you enable paralellism, you're launching multiple MPI based tests, in parallel, each of which result in 4 processes that get bound to the same core.

I understand why --oversubscribe was used, as otherwise tests might not run if MPI thinks there are too few slots available. I actually encounter this occasionally with other software if I created an interactive allocation with 1 task (I only want one bash shell).

The most foolproof and reasonably performant way is probably to disable OpenMPI binding, and leave it up to the OS. I'm testing now with setting OMPI_MCA_hwloc_base_binding_policy=none before running the tests. I'll do this both for check_unit_tests and check_python, even though I think currently only the latter one does MPI tests. The environment variable doesn't hurt, and if there would ever be an MPI test in check_unit_tests in the future, at least we won't hit this issue again.

@casparvl
Copy link
Contributor Author

Much more reasonable timing:

== Temporary log file in case of crash /scratch-nvme/1/casparl/ebtmpdir/eb-hwdrjrsg/easybuild-ntdus400.log
== found valid index for /sw/noarch/RHEL8/2022/software/EasyBuild/4.8.0/easybuild/easyconfigs, so using it...
== processing EasyBuild easyconfig /gpfs/home4/casparl/easybuild/pr18517/ESPResSo-4.2.1-foss-2022a.eb
== building and installing ESPResSo/4.2.1-foss-2022a...
== fetching files...
== creating build dir, resetting environment...
== unpacking...
== patching...
== preparing...
== ... (took 5 secs)
== configuring...
== ... (took 16 secs)
== building...
== ... (took 4 mins 39 secs)
== testing...
== ... (took 5 mins 34 secs)
== installing...
== ... (took 6 secs)
== taking care of extensions...
== restore after iterating...
== postprocessing...
== sanity checking...
== ... (took 3 secs)
== cleaning up...
== ... (took 1 secs)
== creating module...
== ... (took 2 secs)
== permissions...
== packaging...
== COMPLETED: Installation ended successfully (took 10 mins 50 secs)
== Results of the build can be found in the log file(s) /scratch-nvme/1/casparl/generic/software/ESPResSo/4.2.1-foss-2022a/easybuild/easybuild-ESPResSo-4.2.1-20230811.135628.log
== Build succeeded for 1 out of 1
== Temporary log file(s) /scratch-nvme/1/casparl/ebtmpdir/eb-hwdrjrsg/easybuild-ntdus400.log* have been removed.
== Temporary directory /scratch-nvme/1/casparl/ebtmpdir/eb-hwdrjrsg has been removed.

I'll upload some new test reports.

@casparvl
Copy link
Contributor Author

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18517 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18517 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11455

Test results coming soon (I hope)...

- notification for comment with ID 1674626326 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@jngrad
Copy link

jngrad commented Aug 11, 2023

I understand why --oversubscribe was used, as otherwise tests might not run if MPI thinks there are too few slots available. I actually encounter this occasionally with other software if I created an interactive allocation with 1 task (I only want one bash shell).

The -oversubscribe is a band aid we introduced in the ESPResSo CMake logic to work around issues with core pinning. For example, on AMD Ryzen Threadripper 1950X 16-Core processors with hyperthreading enabled, running a process in parallel with more than 16 ranks fails in OpenMPI, since it counts physical cores instead of logical cores. It gets worse when the hardware information gets partially masked by Docker containers, or when using deprecated Docker overlays together with QEMU. This lead to a lot of convoluted and fragile code to properly support OpenMPI in our CI pipelines. While it works as intended from our side, it's not future-proof. The OMPI_MCA_hwloc_base_binding_policy=none you introduced in ce0960c is also the solution we adopted in ESPResSo 4.2.1.

@casparvl
Copy link
Contributor Author

Just completed the tests for the -CUDA version. Those took 44 minutes on my system. I'm assuming the GPU build enables more tests.

What struck me was that one test (lb_pressure_tensor.py) did have a presence on the GPU (i.e. I saw the process with nvidia-smi), but didn't seem to be using it at all - it just seemed to do multithreading on the 72 CPU cores in this node. And that took quite a bit (more than a 3 or 4 minutes - I didn't watch it after that). Is that expected @jngrad ? Or would it have taken so long because it was supposed to use the GPU, but didn't?

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
cns3 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/861616d7ed1c697fc9af6a3e30b0cc09 for a full test report.

@jngrad
Copy link

jngrad commented Aug 11, 2023

lb_pressure_tensor.py is also a statistical test. It should only take 1 CPU core and should take roughly 800 MiB of VRAM. If you run it on more than 1 core, it will be idle most of the time due to communication bottlenecks. At every time step, it has to gather particle information from all MPI ranks to MPI rank 0, where they will be sent to the GPU. Then the GPU sends back the forces to MPI rank 0, which scatters them to all MPI ranks. The test suite is composed of a CPU test case and a GPU test case that has 10 times more steps than then CPU case. The GPU test case runs 4 times, while the CPU case runs only once.

@boegel
Copy link
Member

boegel commented Aug 11, 2023

We set $OMP_PROC_BIND to TRUE by default, and that's resulting in pinning everything to a single 0 while running the tests, so we should probably add an unset for that (which is harmless if $OMP_PROC_BIND is not defined).

pretestopts = "unset OMP_PROC_BIND && " + local_OMPI_test_vars

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 14 out of 17 (4 easyconfigs in total)
gcn3.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/350b370dcfb68c8cce0930cb62eee0aa for a full test report.

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in total)
gcn3.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/47cdd9cd1671de100c156bca26ad6e3f for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

Discussed with @boegel today. We'll close this for now, as it's not worth investing more time in: parallel tests are a nice to have, not a nescessity.

As a final note: the last test result shows that it can succeed. The failure for ESPResSo-4.2.1-foss-2021a.eb was

 1/117 Test  #48: gather_buffer_test ................***Failed    3.07 sec
--------------------------------------------------------------------------
A call to mkdir was unable to create the desired directory:

  Directory: /scratch-nvme/1/casparl/ebtmpdir/eb-4e5h8plu/ompi.gcn3.45397
  Error:     File exists

Please check to ensure you have adequate permissions to perform
the desired operation.
--------------------------------------------------------------------------
[gcn3.local.snellius.surf.nl:323877] [[47636,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
[gcn3.local.snellius.surf.nl:323877] [[47636,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 346
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Which could potentially be a race condition in creating that directory (multiple ranks trying to create the same dir?). I'm not sure, and as said, it's not really worth digging further into for now.

I'll close this PR, if we want, we can pick it up later in a new PR, or re-open this.

Edit: I'm unable to close the PR. It tells me 'you can't comment at this time'. If someone can close it: please do :)

@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

From @jngrad :

This is indeed a race condition. We wrote a workaround at the CMake level. During project configuration, we pass CMake flag -D INSIDE_DOCKER=ON to configure OpenMPI to use a different temporary directory for each test. You may need to adapt it to your needs, since you don't use the default /tmp folder. More details here: https://github.com/espressomd/espresso/blob/4.2.1/CMakeLists.txt#L320-L335

I'll check if -DINSIDE_DOCKER helps. If so, I'll make a patch that makes an extra CMAKE flag to only activate that particular section (-DINSIDE_DOCKER also sets a -L that we don't want).

@easybuilders easybuilders deleted a comment from boegelbot Sep 5, 2023
@easybuilders easybuilders deleted a comment from boegelbot Sep 5, 2023
@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18517 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18517 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11645

Test results coming soon (I hope)...

- notification for comment with ID 1706989878 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/127696445a1e37912c9944c4debb9de2 for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

Test report by @casparvl
FAILED
Build succeeded for 1 out of 5 (4 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/9be5fb63631820e9930dc1a0298f61f4 for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

Test report by @casparvl
FAILED
Build succeeded for 2 out of 5 (4 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/f83361e17fb3afe18a22df863327d0e6 for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Sep 5, 2023

Test report by @casparvl
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/e65aaefccfc3d5e8559ff87bb4803cf8 for a full test report.

@boegel boegel modified the milestones: 4.8.1, release after 4.8.1 Sep 9, 2023
@casparvl
Copy link
Contributor Author

Most errors seem file permission errors (some conflict between different workers, maybe?), but there are some other errors that are not shown in the gist:

  3/201 Test  #31: constraint_shape_based ........................................***Failed   35.14 sec
[1693950743.947384] [gcn1:719471:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/719470/fd/46 flags=0x0) failed: Permission denied
[1693950743.947434] [gcn1:719471:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc000000b800afa6e: Shared memory error
[gcn1.local.snellius.surf.nl:719471] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[gcn1:719471] *** An error occurred in MPI_Init
[gcn1:719471] *** reported by process [2958229505,1]
[gcn1:719471] *** on a NULL communicator
[gcn1:719471] *** Unknown error
[gcn1:719471] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gcn1:719471] ***    and potentially your MPI job)
...
  4/201 Test  #27: tune_skin .....................................................***Failed   35.37 sec
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node gcn1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

        Start  38: dawaanr-and-bh-gpu
  5/201 Test  #30: cutoffs_1_core ................................................***Failed   35.35 sec
.
----------------------------------------------------------------------
Ran 1 test in 0.002s

OK
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node gcn1 exited on signal 9 (Killed).
--------------------------------------------------------------------------
...
 28/201 Test   #8: test_checkpoint__therm_lb__elc_gpu__lj__lb_gpu_ascii ..........***Failed   42.81 sec
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/ESPResSo/4.2.1/foss-2021a/easybuild_obj/testsuite/python/test_checkpoint.py", line 52, in <module>
    class CheckpointTest(ut.TestCase):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/ESPResSo/4.2.1/foss-2021a/easybuild_obj/testsuite/python/test_checkpoint.py", line 56, in CheckpointTest
    checkpoint.load(0)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/ESPResSo/4.2.1/foss-2021a/easybuild_obj/src/python/espressomd/checkpointing.py", line 232, in load
    with open(filename, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs/nvme1/1/casparl/ebbuildpath/ESPResSo/4.2.1/foss-2021a/easybuild_obj/testsuite/python/checkpoint_therm_lb__elc_gpu__lj__lb_gpu_ascii/0.ch
eckpoint'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[gcn1:719984:0:719984] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14dd8c182260)
[gcn1:719981:0:719981] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x14bf02b2e260)
[gcn1:719982:0:719982] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x15400eddb260)
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[46465,1],0]
  Exit code:    1
--------------------------------------------------------------------------
...

@jngrad
Copy link

jngrad commented Sep 18, 2023

Most errors seem file permission errors (some conflict between different workers, maybe?), but there are some other errors that are not shown in the gist:

Increasing the line count to a value larger than 500 would help. CTest will often schedule the save_checkpoint_* to run first, because they generate file artifacts that are read by test_checkpoint_* tests. If the save stage failed, the test stage will fail too. Maybe more recent versions of CMake allow skipping the test stage if the dependent save stage failed, but the older CMake version we chose to support didn't offer that possibility, iirc.

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 2 out of 4 (4 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 520.61.05, Python 3.6.8
See https://gist.github.com/casparvl/be80f80f7b8a7492d7f5cdfb17558239 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 27, 2023

@casparvl Didn't we conclude here that running the tests in parallel isn't going to work out?

@casparvl
Copy link
Contributor Author

Closing this. Running tests in parallel causes too many issues for now, not worth the effort. We might pick up from here later.

@casparvl casparvl closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants