-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI and OpenMP memory allocation optimisations #262
MPI and OpenMP memory allocation optimisations #262
Conversation
…tom into memory-optimisation # Conflicts: # build/Makefile # src/main/memory.F90
because maxprocs is needed to get the right size, which we don't want to set as a parameter subroutines adjusted accordingly (replaced direction with remote_export)
the use of a threadprivate static array is VERY important for the code performance. I would be very reluctant to merge changes which attempt to make xyzcache dynamically allocatable in any way. It is also completely unnecessary since it's a small array with a size << npart, so I'm a bit confused on what this particular change is trying to achieve. |
amend previous commit bugfix: missing nprocs import amend previous commit
78725d4
to
25e4434
Compare
Updated the PR description with a bit of explanation. We will go with decreasing the |
necessary for test suite
to fix setdisc test failure due to uninitialised (but unused) variables
Type of PR: modification to existing code Description: This PR combines several changes to memory allocation: - Rename `mpi_stack.F90` to `mpi_memory.F90` to reflect the fact that the structure is not strictly a memory stack - Move mpi memory allocation from initial into the main memory allocation subroutine - Set the MPI "stacksize" dynamically, based on `npart` and `nprocs` - Remove `remote_export` from the cell derived types, and adjust send/recv subroutines to match - this allows the `maxprocs` parameter to be removed - Remove `maxprocs` parameter and allocate arrays dynamically using `nprocs` - Automatically expand `stacksize` as required - Add check for if global node refinment exceeds ncellsmax (which is an issue for large numbers of MPI tasks) - Allocate a larger array for the global tree (using `ncellsmaxglobal` instead of `ncellsmax`) - Move dynamic memory size calculation inside the `allocate_memory` subroutine - Tidy up some unused imports - Fix setdisc test failure due to uninitialised (but unused) variables Resolve danieljprice#260 The crash is caused by large `threadprivate` arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug. Two possible solutions: 1. Dynamically allocate a separate `xyzcache` array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, since `threadprivate` _already_ performs a "dynamic" allocation at runtime based on `OMP_NUM_THREADS`. After discussion with @danieljprice, this solution was not preferred because: - The cache size is fixed per thread (though the total memory allocation still varies with the number of threads) - Performance may be affected. However, a test of the polar benchmark showed no change in performance for 1 and 2 node hybrid jobs. Interestingly, the new version produced slightly faster run times, but not with any significance. - Additional code complexity 2. Reduce the cache size to work around the bug. Performance may not be affected, based on discussion with @danieljprice: _xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty_ Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required. Testing: - Benchmark suite to assess performance impact - Test suite for correctness Did you run the bots? yes
Type of PR:
modification to existing code
Description:
This PR combines several changes to memory allocation:
mpi_stack.F90
tompi_memory.F90
to reflect the fact that the structure is not strictly a memory stacknpart
andnprocs
remote_export
from the cell derived types, and adjust send/recv subroutines to match - this allows themaxprocs
parameter to be removedmaxprocs
parameter and allocate arrays dynamically usingnprocs
stacksize
as requiredncellsmaxglobal
instead ofncellsmax
)allocate_memory
subroutineResolve #260
The crash is caused by large
threadprivate
arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug.Two possible solutions:
xyzcache
array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, sincethreadprivate
already performs a "dynamic" allocation at runtime based onOMP_NUM_THREADS
. After discussion with @danieljprice, this solution was not preferred because:xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty
Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required.
Testing:
Did you run the bots? yes