Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable matrix_ncep on orion #441

Closed
JessicaMeixner-NOAA opened this issue Aug 4, 2021 · 16 comments · Fixed by #468
Closed

enable matrix_ncep on orion #441

JessicaMeixner-NOAA opened this issue Aug 4, 2021 · 16 comments · Fixed by #468
Assignees
Labels
enhancement New feature or request

Comments

@JessicaMeixner-NOAA
Copy link
Collaborator

We should be able to run matrix_ncep on orion. The first issue was to change mpirun to srun which is the command we should be using on hera too for slurm. Still debugging issues on orion that include:
-- I think the parmetis library needs to be rebuilt now that we're using hpc-stack modules on orion (@aliabdolali can you help with this?)
-- the oasis tests fail see question in issue #440

This work is being done on: https://github.com/JessicaMeixner-NOAA/WW3/tree/orion

When completed, the hope is to be able to use the hpc-stack modules on orion and run the WW3 regression tests on orion as well as hera.

@JessicaMeixner-NOAA JessicaMeixner-NOAA added the enhancement New feature or request label Aug 4, 2021
@JessicaMeixner-NOAA JessicaMeixner-NOAA self-assigned this Aug 4, 2021
@aliabdolali
Copy link
Contributor

@jessica, the path to metis/parmetis on orion is to the ones compiled using hpc-stack
/work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib
I checked the matrix_ncep and it is referred to the above-mentioned path. Do you get failure using it?

@JessicaMeixner-NOAA
Copy link
Collaborator Author

They're build with hpc-intel/2019.5 but ufs-weather-model uses hpc-intel/2018.4 (see: https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_orion.intel#L16-L18) so I was currently switching to use that intel unless there's a reason we should deviate from that?

@aliabdolali
Copy link
Contributor

@JessicaMeixner-NOAA I just removed the one with intel./2019 and compiled them with the same version of hpc stack
module use /apps/contrib/NCEP/libs/hpc-stack/modulefiles/stack

module load hpc/1.1.0
module load hpc-intel/2018.4
module load hpc-impi/2018.4

the path did not change:
/work/noaa/marine/ali.abdolali/Source/hpc-stack/parmetis-4.0.3/lib

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Thanks @aliabdolali the PDLIB tests now seem to be passing.

Current issues are:

  • Several tests are ending in segmentation fault (including at least one of the ufs1.2 tests)
  • NetCDF error, similar (or same) to what Ricardo found:
  • *** WAVEWATCH III ERROR IN OUNF :
    LINE NUMBER 2187
    NETCDF ERROR MESSAGE:
    NetCDF: Name contains illegal characters

@JessicaMeixner-NOAA
Copy link
Collaborator Author

FYI @ricampos

@JessicaMeixner-NOAA
Copy link
Collaborator Author

I can get past the segfaults I was having by adding:
ulimit -s unlimited
Now I have run into #442

@ricampos
Copy link
Collaborator

ricampos commented Aug 6, 2021

Thanks, Jessica. I will leave a note for me to remember to add this line.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Okay at this point I have a branch that runs everything on orion except for the netcdf output with the partitions, those tests still fail.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

@aliabdolali @ricampos should I go ahead and make a PR with the updates as of now or wait until we have a fix for the netcdf issues on orion?

@aliabdolali
Copy link
Contributor

@JessicaMeixner-NOAA Thanks, please go ahead and make the PR. If needed, please make an issue associated with this problem.

@ricampos
Copy link
Collaborator

Hi Jessica, I found the problem on Orion. When ww3_ounf is compiled with netcdf/4.7.4 the program crashes during partition writing with the message "NetCDF: Name contains illegal characters" as you saw. It partially writes the file (without partitions) and then stop, but the problematic netcdf file is created.
When I recompiled the model with netcdf/4.7.2 , ww3_ounf worked nicely. All good. See results at:
/work/noaa/marine/ricardo.campos/models/WW3/regtests/ww3_ufs1.3/output
I compared the partition characters and text, with the non-partition variables. And I tried to edit w3ounfmetamd, but I didn't manage to make it work with netcdf/4.7.4 . Only with netcdf/4.7.2.

@ricampos
Copy link
Collaborator

From now on I will always use module load netcdf/4.7.2 in my jobscripts.

@JessicaMeixner-NOAA
Copy link
Collaborator Author

There was an issue when running the regtests on hera, I thought I had solved that problem, but I guess not. So no pull request yet for this branch.

@ricampos while netcdf/4.7.2 solving the problem is great, that's not an hpc-stack module which is what we want to use. Let's make a new issue for just the netcdf problem problem on orion, using the hpc-stack modules instead. If needed we might need to create a simple test case that we can post on an issue on hpc-stack itself if need be.

@ricampos
Copy link
Collaborator

Understood.
But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?

@JessicaMeixner-NOAA
Copy link
Collaborator Author

Understood.
But what if this is a netcdf/4.7.4 issue instead of a WW3 issue?

It works with netcdf/4.7.4 on hera I'll make a new issue -- let's continue this conversation there.

@ricampos
Copy link
Collaborator

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants