Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tp2.10 MPI_OMPH test not reproducing with different number of MPI/OMP threads #321

Open
JessicaMeixner-NOAA opened this issue Feb 23, 2021 · 6 comments
Assignees

Comments

@JessicaMeixner-NOAA
Copy link
Collaborator

When running:

./bin/run_test -b slurm -c hera -S -T -s MPI_OMPH -w work_MPI_OMPH -f -p mpirun -n 2 -t 6 ../model -o both ww3_tp2.10
with either different number of MPI tasks or OMP threads there are differences when comparing with matrix.comp.

@JessicaMeixner-NOAA JessicaMeixner-NOAA changed the title tp2.10 MPI_OMPH test not reproducing with MPI/OMP threads tp2.10 MPI_OMPH test not reproducing with different number of MPI/OMP threads Feb 23, 2021
@ukmo-ccbunney ukmo-ccbunney self-assigned this Feb 23, 2021
@JessicaMeixner-NOAA
Copy link
Collaborator Author

@ukmo-ccbunney I remember you mentioning that you thought you had looked into this before. I just found a switch: B4B that's a double switch. I wonder if these are the updates that you made for this issue before?

@ukmo-ccbunney
Copy link
Collaborator

@JessicaMeixner-NOAA - yes - the B4B is a "bit 4 bit" switch. I added it to get around the b4b issues we originally had with that regtests (to do with a reduction in the SMC module). When enabled, the B4B switch scales and converts some real values to integer values to avoid rounding errors during the reduction.

It fixes the issue when running with the same number of PRs/threads (at least for us), but there maybe other parts of the code that cause issues when running with a different combination of PEs/threads...

@aliabdolali
Copy link
Contributor

After updating the mpirun to srun on NOAA HPC, ww3_tp2.16/./work_MPI_OMPH is showing similar behavior. It is not b4b anymore.
./bin/run_test -b slurm -c hera.intel -S -T -s MPI_OMPH -w work_MPI_OMPH -f -p srun -n 4 -t 6 ../model -o both ww3_tp2.16

@ukmo-ccbunney
Copy link
Collaborator

Thanks for the update @aliabdolali
I think these OMP B4B issues are due to reduction operations (which are not guaranteed to be B4B due to the order of the reduction not being guaranteed). The way I got around it un ww3_tp2.10 (for our system at least) was to perform the reduction on arrays that had been scaled and converted to integer values. It might be that something similar needs to be done for ww3_tp2.16.

I will investigate when I get more time.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@ukmo-ccbunney I noticed that we are do have the "B4B" flag in both the tp2.10 and the 2.16 OMPH cases, I had thought that the tp2.16 was an issue where something was not getting recompiled correctly, but I used the same switches for both tests and it was okay. (I thought this because if I ran the tp2.16 test from clean clones I could get b4b, but not if they were run after the tp2.10 test and using the tp2.10 switch did not help).

@ukmo-ccbunney
Copy link
Collaborator

@JessicaMeixner-NOAA I think the B4B implementation must no longer be working (or it is and there is some other issue). It did originally fix the b4b issue with the SMC propagation a few years ago, but it seems to have come back again and I am not sure why! Certainly, it seems to be implementation specific, as you have seen with your change of scheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants