Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI and OpenMP memory allocation optimisations #262

Merged
merged 31 commits into from
Mar 23, 2022

Conversation

conradtchan
Copy link
Collaborator

@conradtchan conradtchan commented Mar 10, 2022

Type of PR:
modification to existing code

Description:
This PR combines several changes to memory allocation:

  • Rename mpi_stack.F90 to mpi_memory.F90 to reflect the fact that the structure is not strictly a memory stack
  • Move mpi memory allocation from initial into the main memory allocation subroutine
  • Set the MPI "stacksize" dynamically, based on npart and nprocs
  • Remove remote_export from the cell derived types, and adjust send/recv subroutines to match - this allows the maxprocs parameter to be removed
  • Remove maxprocs parameter and allocate arrays dynamically using nprocs
  • Automatically expand stacksize as required
  • Add check for if global node refinment exceeds ncellsmax (which is an issue for large numbers of MPI tasks)
  • Allocate a larger array for the global tree (using ncellsmaxglobal instead of ncellsmax)
  • Move dynamic memory size calculation inside the allocate_memory subroutine
  • Tidy up some unused imports
  • Fix setdisc test failure due to uninitialised (but unused) variables

Resolve #260
The crash is caused by large threadprivate arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug.

Two possible solutions:

  1. Dynamically allocate a separate xyzcache array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, since threadprivate already performs a "dynamic" allocation at runtime based on OMP_NUM_THREADS. After discussion with @danieljprice, this solution was not preferred because:
  • The cache size is fixed per thread (though the total memory allocation still varies with the number of threads)
  • Performance may be affected. However, a test of the polar benchmark showed no change in performance for 1 and 2 node hybrid jobs. Interestingly, the new version produced slightly faster run times, but not with any significance.
  • Additional code complexity
  1. Reduce the cache size to work around the bug. Performance may not be affected, based on discussion with @danieljprice:

xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty

Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required.

Testing:

  • Benchmark suite to assess performance impact
  • Test suite for correctness

Did you run the bots? yes

@dliptai dliptai linked an issue Mar 10, 2022 that may be closed by this pull request
@danieljprice
Copy link
Owner

the use of a threadprivate static array is VERY important for the code performance. I would be very reluctant to merge changes which attempt to make xyzcache dynamically allocatable in any way. It is also completely unnecessary since it's a small array with a size << npart, so I'm a bit confused on what this particular change is trying to achieve.

amend previous commit
bugfix: missing nprocs import
amend previous commit
Change back to using threadprivate for xyzcache
@conradtchan
Copy link
Collaborator Author

the use of a threadprivate static array is VERY important for the code performance. I would be very reluctant to merge changes which attempt to make xyzcache dynamically allocatable in any way. It is also completely unnecessary since it's a small array with a size << npart, so I'm a bit confused on what this particular change is trying to achieve.

Updated the PR description with a bit of explanation. We will go with decreasing the xyzcache size.

@conradtchan conradtchan marked this pull request as ready for review March 23, 2022 23:23
@conradtchan conradtchan merged commit c80a937 into danieljprice:master Mar 23, 2022
@conradtchan conradtchan deleted the memory-optimisation branch March 23, 2022 23:24
s-neilson pushed a commit to s-neilson/phantom that referenced this pull request Mar 18, 2023
Type of PR: 
modification to existing code

Description:
This PR combines several changes to memory allocation:
- Rename `mpi_stack.F90` to `mpi_memory.F90` to reflect the fact that the structure is not strictly a memory stack
- Move mpi memory allocation from initial into the main memory allocation subroutine
- Set the MPI "stacksize" dynamically, based on `npart` and `nprocs`
- Remove `remote_export` from the cell derived types, and adjust send/recv subroutines to match - this allows the `maxprocs` parameter to be removed
- Remove `maxprocs` parameter and allocate arrays dynamically using `nprocs`
- Automatically expand `stacksize` as required
- Add check for if global node refinment exceeds ncellsmax (which is an issue for large numbers of MPI tasks)
- Allocate a larger array for the global tree (using `ncellsmaxglobal` instead of `ncellsmax`) 
- Move dynamic memory size calculation inside the `allocate_memory` subroutine
- Tidy up some unused imports
- Fix setdisc test failure due to uninitialised (but unused) variables  

Resolve danieljprice#260
The crash is caused by large `threadprivate` arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug.

Two possible solutions:
1. Dynamically allocate a separate `xyzcache` array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, since `threadprivate` _already_ performs a "dynamic" allocation at runtime based on `OMP_NUM_THREADS`. After discussion with @danieljprice, this solution was not preferred because:

- The cache size is fixed per thread (though the total memory allocation still varies with the number of threads)
- Performance may be affected. However, a test of the polar benchmark showed no change in performance for 1 and 2 node hybrid jobs. Interestingly, the new version produced slightly faster run times, but not with any significance.
- Additional code complexity

2. Reduce the cache size to work around the bug. Performance may not be affected, based on discussion with @danieljprice:

_xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty_

Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required.

Testing:
- Benchmark suite to assess performance impact
- Test suite for correctness

Did you run the bots? yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

segfault with 8 MPI tasks and OMP_NUM_THREADS=8
3 participants