MPI and OpenMP memory allocation optimisations #262

conradtchan · 2022-03-10T08:31:29Z

Type of PR:
modification to existing code

Description:
This PR combines several changes to memory allocation:

Rename mpi_stack.F90 to mpi_memory.F90 to reflect the fact that the structure is not strictly a memory stack
Move mpi memory allocation from initial into the main memory allocation subroutine
Set the MPI "stacksize" dynamically, based on npart and nprocs
Remove remote_export from the cell derived types, and adjust send/recv subroutines to match - this allows the maxprocs parameter to be removed
Remove maxprocs parameter and allocate arrays dynamically using nprocs
Automatically expand stacksize as required
Add check for if global node refinment exceeds ncellsmax (which is an issue for large numbers of MPI tasks)
Allocate a larger array for the global tree (using ncellsmaxglobal instead of ncellsmax)
Move dynamic memory size calculation inside the allocate_memory subroutine
Tidy up some unused imports
Fix setdisc test failure due to uninitialised (but unused) variables

Resolve #260
The crash is caused by large threadprivate arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug.

Two possible solutions:

Dynamically allocate a separate xyzcache array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, since threadprivate already performs a "dynamic" allocation at runtime based on OMP_NUM_THREADS. After discussion with @danieljprice, this solution was not preferred because:

The cache size is fixed per thread (though the total memory allocation still varies with the number of threads)
Performance may be affected. However, a test of the polar benchmark showed no change in performance for 1 and 2 node hybrid jobs. Interestingly, the new version produced slightly faster run times, but not with any significance.
Additional code complexity

Reduce the cache size to work around the bug. Performance may not be affected, based on discussion with @danieljprice:

xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty

Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required.

Testing:

Benchmark suite to assess performance impact
Test suite for correctness

Did you run the bots? yes

fixes danieljprice#260

…tom into memory-optimisation # Conflicts: # build/Makefile # src/main/memory.F90

because maxprocs is needed to get the right size, which we don't want to set as a parameter subroutines adjusted accordingly (replaced direction with remote_export)

danieljprice · 2022-03-10T11:36:55Z

the use of a threadprivate static array is VERY important for the code performance. I would be very reluctant to merge changes which attempt to make xyzcache dynamically allocatable in any way. It is also completely unnecessary since it's a small array with a size << npart, so I'm a bit confused on what this particular change is trying to achieve.

amend previous commit bugfix: missing nprocs import amend previous commit

Change back to using threadprivate for xyzcache

conradtchan · 2022-03-11T03:52:28Z

the use of a threadprivate static array is VERY important for the code performance. I would be very reluctant to merge changes which attempt to make xyzcache dynamically allocatable in any way. It is also completely unnecessary since it's a small array with a size << npart, so I'm a bit confused on what this particular change is trying to achieve.

Updated the PR description with a bit of explanation. We will go with decreasing the xyzcache size.

82a3284

necessary for test suite

to fix setdisc test failure due to uninitialised (but unused) variables

@danieljprice

Type of PR: modification to existing code Description: This PR combines several changes to memory allocation: - Rename `mpi_stack.F90` to `mpi_memory.F90` to reflect the fact that the structure is not strictly a memory stack - Move mpi memory allocation from initial into the main memory allocation subroutine - Set the MPI "stacksize" dynamically, based on `npart` and `nprocs` - Remove `remote_export` from the cell derived types, and adjust send/recv subroutines to match - this allows the `maxprocs` parameter to be removed - Remove `maxprocs` parameter and allocate arrays dynamically using `nprocs` - Automatically expand `stacksize` as required - Add check for if global node refinment exceeds ncellsmax (which is an issue for large numbers of MPI tasks) - Allocate a larger array for the global tree (using `ncellsmaxglobal` instead of `ncellsmax`) - Move dynamic memory size calculation inside the `allocate_memory` subroutine - Tidy up some unused imports - Fix setdisc test failure due to uninitialised (but unused) variables Resolve danieljprice#260 The crash is caused by large `threadprivate` arrays being allocated for certain numbers of MPI tasks and OpenMP threads. The root cause has not been identified, and is potentially a compiler bug. Two possible solutions: 1. Dynamically allocate a separate `xyzcache` array for each thread (or one large array that is split between tasks). This solution essentially makes the memory allocation a manual/explicit process, since `threadprivate` _already_ performs a "dynamic" allocation at runtime based on `OMP_NUM_THREADS`. After discussion with @danieljprice, this solution was not preferred because: - The cache size is fixed per thread (though the total memory allocation still varies with the number of threads) - Performance may be affected. However, a test of the polar benchmark showed no change in performance for 1 and 2 node hybrid jobs. Interestingly, the new version produced slightly faster run times, but not with any significance. - Additional code complexity 2. Reduce the cache size to work around the bug. Performance may not be affected, based on discussion with @danieljprice: _xyzcache does not have to be 50,000 long. Typically the number of “trial” neighbours is more like a few hundred, so something like 10,000 or even 5,000 for xyzcache length is already plenty_ Option 2 has been chosen, with cache size reduced to 10,000. This works around the issue for 2 nodes, but the problem may still exist at other task+thread counts or other systems. This can be revisited if a more robust solution is required. Testing: - Benchmark suite to assess performance impact - Test suite for correctness Did you run the bots? yes

conradtchan and others added 11 commits March 10, 2022 17:40

rename mpi_stack to mpi_memory

e260ba7

call mpi memory allocation from main memory subroutine

0a420c6

set mpi "stacksize" dynamically

a4157cd

allocate memory for omp cache, instead of using threadprivate variable

b296377

fixes danieljprice#260

Merge branch 'memory-optimisation' of github.com:ADACS-Australia/phan…

14b2cbf

…tom into memory-optimisation # Conflicts: # build/Makefile # src/main/memory.F90

get minpart from options rather than hardcoding 10

8aa2abc

remove remote_export from cell

585c24f

because maxprocs is needed to get the right size, which we don't want to set as a parameter subroutines adjusted accordingly (replaced direction with remote_export)

remove maxprocs parameter and set arrays based on nprocs at runtime

0672c87

bugfix: 585c24f broke the message tag

0f80e28

amend previous commit

37b7cbf

remove maxprocs from config

a39c846

dliptai linked an issue Mar 10, 2022 that may be closed by this pull request

segfault with 8 MPI tasks and OMP_NUM_THREADS=8 #260

Closed

fix serial compile errors

25e4434

amend previous commit bugfix: missing nprocs import amend previous commit

conradtchan force-pushed the memory-optimisation branch from 78725d4 to 25e4434 Compare March 11, 2022 00:12

conradtchan added 2 commits March 11, 2022 14:26

Revert b296377

82a3284

Change back to using threadprivate for xyzcache

decrease xyzcache size to 10000

9d9754f

conradtchan added 12 commits March 11, 2022 14:56

fix error caused by revert commit

9d782f1

82a3284

automatically expand mpi stacks as required

900fae2

add deallocate subroutines for mpi arrays

6e171e0

necessary for test suite

mpi stack allocation fix

b8ef9c3

Merge branch 'master' into memory-optimisation

ba30600

print MPI id when increasing stacksize

2d9615d

bugfix: reallocated MPI stacks had the wrong memory address

7a09cb7

Merge branch 'master' into memory-optimisation

8f5e8f0

add check for if global node refinement exceeds ncellsmax

da6bf17

safety factor of 4 for MPI stack size

cbfacac

move dynamic mem calculation inside allocate_memory subroutine

5537198

remove unused import

167a189

conradtchan added 5 commits March 18, 2022 12:22

allocate a larger array for the global tree

6c25c8a

change allocate_memory safety factor from 8 to 4

4c981be

fix integer type mismatches

eca2545

Merge remote-tracking branch 'origin/master' into memory-optimisation

a973b88

Initialise dt_in and twas to zero

030e6d6

to fix setdisc test failure due to uninitialised (but unused) variables

conradtchan marked this pull request as ready for review March 23, 2022 23:23

conradtchan merged commit c80a937 into danieljprice:master Mar 23, 2022

conradtchan deleted the memory-optimisation branch March 23, 2022 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI and OpenMP memory allocation optimisations #262

MPI and OpenMP memory allocation optimisations #262

conradtchan commented Mar 10, 2022 •

edited

Loading

danieljprice commented Mar 10, 2022

conradtchan commented Mar 11, 2022

MPI and OpenMP memory allocation optimisations #262

MPI and OpenMP memory allocation optimisations #262

Conversation

conradtchan commented Mar 10, 2022 • edited Loading

danieljprice commented Mar 10, 2022

conradtchan commented Mar 11, 2022

conradtchan commented Mar 10, 2022 •

edited

Loading