Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hera runs fail at run_fcst #389

Closed
danielabdi-noaa opened this issue Oct 1, 2022 · 4 comments
Closed

Hera runs fail at run_fcst #389

danielabdi-noaa opened this issue Oct 1, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@danielabdi-noaa
Copy link
Collaborator

danielabdi-noaa commented Oct 1, 2022

Expected behavior

Runs on Hera should not fail.

Current behavior

SRW app on Hera fails run_fcst most likely due to module/hpc-stack update? I've rebuilt the same clone of SRW app and it fails with new binaries but not with old ones.

Machines affected

Hera

Steps To Reproduce

Run CIs on hera

Output

 file: module_write_netcdf.F90 line:          917
 NetCDF: Name contains illegal characters
Abort(1) on node 10 (rank 10 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 10
srun: error: h4c03: tasks 0-11: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=36328810
@danielabdi-noaa danielabdi-noaa added the bug Something isn't working label Oct 1, 2022
@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa This is a very interesting issue. It looks like a new version of netcdf (netcdf/4.9.0) was added to the hpc stack for Intel 2022 on Friday. For some reason, rather than using netcdf/4.7.4 to build the app components and run jobs, the srw_common modulefile is loading the new netcdf/4.9.0 module. When I go into modulefiles/build_hera_intel and add:

module unload netcdf/4.9.0
module load netcdf/4.7.4

to the end, manually submitted WE2E tests run through to completion.

I don't understand why the srw_common modulefile isn't loading the netcdf/4.7.4 module. This module is still available and it is being explicitly referenced in the modulefile. Also, I'm unable to find any documentation suggesting that netcdf files generated with netcdf/4.7.4 aren't compatible with netcdf/4.9.0.

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken Good catch! As you pointed out, it looks like there is an update to hpc-stack which changed default netcdf version to 4.9.0. Indeed it is odd that it is not loading the specific version in srw_common. I will look into it.

@danielabdi-noaa
Copy link
Collaborator Author

The default was changed to 4.9.0 three days ago

$ ls /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/mpi/intel/2022.1.2/impi/2022.1.2/netcdf/ -l
total 8
-rw-r--r-- 1 Hang.Lei nwprod 1298 Mar  2  2022 4.7.4.lua
-rw-r--r-- 1 Hang.Lei nwprod 1289 Sep 30 15:34 4.9.0.lua
lrwxrwxrwx 1 Hang.Lei nwprod    9 Sep 30 15:34 default -> 4.9.0.lua

@danielabdi-noaa
Copy link
Collaborator Author

@MichaelLueken I think what is happening is that we do load netcdf/4.7.4 in srw_common but two other libraries are now built with netcdf/4.9.0 so that they unload 4.7.4 and load 4.9.0. A temporary solution i have is to change the order of

module load nccmp/1.8.9.0
module load ncio/1.1.2

so that they come before netcdf/4.7.4 is loaded. I will test this change an open a PR if it works reasonably.
Best would be to update netcdf/4.9.0 but given that it is in srw_common and other systems may not have the update, it is risky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants