Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release/public-v1: fix grib2 read errors with large task counts (draft PR) #155

Merged

Conversation

climbfuji
Copy link
Contributor

@climbfuji climbfuji commented Sep 25, 2020

This PR is an attempt to fox the grib2 read errors encountered with large task counts randomly on several machines. The issue is described in length in ufs-community/ufs-mrweather-app#190, together with a suggestion from @GeorgeGayno-NOAA in ufs-community/ufs-mrweather-app#190 (comment) (implemented here).

Changes in this PR:

  • sorc/chgres_cube.fd/model_grid.F90: read grib2 inventory only with task 0, let all other processes wait for it to complete
  • correct syntax for MPI_ABORT calls in various files (from @DusanJovic-NOAA)

Testing:

  • Installed a new/test version of NCEPLIBS ufs-v1.1.0 that contains this bugfix on orion and ran the full MRW App regression tests, all of them passed.
  • Due to the random nature of the error, repeated runs of the regression test suite will be conducted on orion over night, and if successful also on Cheyenne tomorrow.

@ligiabernardet @panll FYI

…sk 0, let all other processes wait for it to complete
@climbfuji
Copy link
Contributor Author

@GeorgeGayno-NOAA Should we decide to proceed with this PR, I want to suggest to pull in your bugfix PR for develop that corrected the calls (arguments) to mpi_abort etc. What do you think?

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA Should we decide to proceed with this PR, I want to suggest to pull in your bugfix PR for develop that corrected the calls (arguments) to mpi_abort etc. What do you think?

I have no problem with that. I am sure @DusanJovic-NOAA would agree.

@climbfuji
Copy link
Contributor Author

@GeorgeGayno-NOAA Should we decide to proceed with this PR, I want to suggest to pull in your bugfix PR for develop that corrected the calls (arguments) to mpi_abort etc. What do you think?

I have no problem with that. I am sure @DusanJovic-NOAA would agree.

Thanks, @GeorgeGayno-NOAA. I cherry-picked the commits relevant for release/public-v1 from #148, resolved the conflicts and updated this PR. Would you please take a look? I'll test it with the MRW App on a few platforms so that we can make sure it is working as intended.

@GeorgeGayno-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA Should we decide to proceed with this PR, I want to suggest to pull in your bugfix PR for develop that corrected the calls (arguments) to mpi_abort etc. What do you think?

I have no problem with that. I am sure @DusanJovic-NOAA would agree.

Thanks, @GeorgeGayno-NOAA. I cherry-picked the commits relevant for release/public-v1 from #148, resolved the conflicts and updated this PR. Would you please take a look? I'll test it with the MRW App on a few platforms so that we can make sure it is working as intended.

@climbfuji Your updates look correct.

@climbfuji
Copy link
Contributor Author

@GeorgeGayno-NOAA Should we decide to proceed with this PR, I want to suggest to pull in your bugfix PR for develop that corrected the calls (arguments) to mpi_abort etc. What do you think?

I have no problem with that. I am sure @DusanJovic-NOAA would agree.

Thanks, @GeorgeGayno-NOAA. I cherry-picked the commits relevant for release/public-v1 from #148, resolved the conflicts and updated this PR. Would you please take a look? I'll test it with the MRW App on a few platforms so that we can make sure it is working as intended.

@climbfuji Your updates look correct.

@GeorgeGayno-NOAA this works as expected, tested on Cheyenne with both GNU and Intel and on Orion with Intel. Can you please approve and merge?

@climbfuji climbfuji merged commit 2837e2c into ufs-community:release/public-v1 Sep 28, 2020
@climbfuji
Copy link
Contributor Author

Thanks, @GeorgeGayno-NOAA - merged and retagged, pushed the tag to the repo.

@climbfuji climbfuji changed the title release/public-v1: attempt to fix grib2 read errors with large task counts (draft PR) release/public-v1: fix grib2 read errors with large task counts (draft PR) Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants