Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phiGRAPE MPI support is broken #1090

Open
LourensVeen opened this issue Nov 4, 2024 · 3 comments
Open

phiGRAPE MPI support is broken #1090

LourensVeen opened this issue Nov 4, 2024 · 3 comments
Labels

Comments

@LourensVeen
Copy link
Collaborator

LourensVeen commented Nov 4, 2024

Describe the bug
While converting phiGRAPE to the new build system, I tried removing the non-MPI worker and just putting in the MPI worker as the default, seeing as we're assuming that MPI is available everywhere anyway. This caused some of the tests to fail, in particular test11, which removes particles and then adds new particles representing merged pairs.

The original tests run only the non-MPI worker, except for for tests 15, 16 and 19, which are run with multiple workers but don't remove or add particles. So it looks like this was never tested. I haven't been able to find the exact cause, but it looks like adding and removing particles was simply never implemented correctly for the MPI-enabled case.

To Reproduce
Modify TestPhigrape::test11 to run with two workers rather than one.

Expected behavior
The code should give correct results rather than getting its internal state messed up, and thus pass the test.

Environment (please complete the following information):

  • OS and version: Kubuntu 22.04
  • Compiler: GFortran 11.4.0

Additional context
I'll build the non-MPI worker as well for now in the new build system and disable the MPI one in interface.py to avoid giving people incorrect results.

Another thing to take away from this is that the tests should really be set up so that every test is run for every worker, rather than running most tests only on the default worker and then adding one or two for the other workers. The latter greatly reduces test coverage for the non-default workers, while not saving any work. (Except that you'd then have to fix all the additional bugs you find, but that's better than incorrect results.)

@LourensVeen LourensVeen added the bug label Nov 4, 2024
@christianboily
Copy link

@LourensVeen, I did what you suggested (after modifying the Makefile of phi grape to allow for multiple data formats),
I modified test11 as you suggested (number_of_workers = 2) and performed the suite of tests. The outcome is this :

OSX 15.2 / gcc14 / python3.12.8 from Macports
====>
(Amuse_15.2) cmb@FudiProM2(/opt/work/cmb/Amuse/Amuse_15.2){94} > mpiexec -n 1 --oversubscribe pytest --pest_phigrape
[obas-rech-cmb.astro.unistra.fr:24569] pmix_mca_base_component_repository_open: unable to open /opt/locaress_zlib.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_pcompress_zlib.so.sl not found (ignored)
....
============================================= test session starts ========================
platform darwin -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
rootdir: /System/Volumes/Data/Work/cmb/Amuse/Amuse_15.2
plugins: anyio-4.8.0
collected 26 items

[obas-rech-cmb.astro.unistra.fr:24593] pmix_mca_base_component_repository_open: unable to open /opt/locaress_zlib.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_pcompress_zlib.so.sl not found (ignored)
...
[obas-rech-cmb.astro.unistra.fr:24594] pmix_mca_base_component_repository_open: unable to open /opt/localurm.so: File /opt/local/lib/openmpi-mp/pmix/pmix_mca_prm_slurm.so.sl not found (ignored)
. ........... Set stopping condition 0
Set stopping condition 0
Set stopping condition 1
Set stopping condition 0
...............

===================================== warnings summary =========================
../../../../../../../opt/work/cmb/Amuse/Amuse_15.2/amuse-source/src/amuse/support/options.py:11
/opt/work/cmb/Amuse/Amuse_15.2/amuse-source/src/amuse/support/options.py:11: DeprecationWarning: pkg_re https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================= 26 passed, 1 warning in 2.67s ===================

So apart from addressing new file on the system (new OSX security issues ??) I don't have any test failure.

@LourensVeen
Copy link
Collaborator Author

Hm, interesting. It's been a while since I looked at this, and it may be that I misunderstood something. I'll have another go at it after the new release, see if I can either get it to work or get a better understanding of what the problem is. Thanks for testing!

@LourensVeen
Copy link
Collaborator Author

I ran into this again while testing the new installer more, and I think I've found a problem: the way I was injecting values into the phigrape Makefile didn't pass in the right flags, and that caused MPI and non-MPI code to get mixed up.

The behaviour of make changes a bit depending on whether variables were set in the environment or not, and I misunderstood how overriding works (make -C src VAR="something" turns out to be a lot more aggressive than I thought...).

So now I can run test11 with 2 workers, but on the other hand, if I set the MPI worker as the default it still fails some of the tests:

FAILED phigrape/tests/test_phigrape.py::TestMPIInterface::test7 - AssertionError: nan != 0.5
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test10 - AssertionError: False is not true
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test18 - AssertionError: 38.0 mass != 23 mass within 7 places
FAILED phigrape/tests/test_phigrape.py::TestPhigrape::test4 - AssertionError: 15.0 kg != 17.0 kg

Could you try any of these tests with more than one worker?

LourensVeen added a commit to LourensVeen/amuse that referenced this issue Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants