Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ACCNR in rt.sh, not detect_machine.sh. Add some comments explaini… #1016

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Jan 20, 2022

PR Checklist

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

The rt.sh ignores the ACCNR value set by the user in the environment because the big per-platform "if/elif/fi" hard-codes it to per-platform values (ACCNR=GFS-DEV) instead of allowing a user-specified default (ACCNR=${ACCNR:-GFS-DEV}). It has to do this because the detect_machine.sh sets an incorrect default value (nems) so ACCNR is always set by the time it reaches the rt.sh per-platform if/elif/fi block. The fix is to remove the setting from detect_machine.sh and allow a user-specified default in rt.sh. The final default of ACCNR=${ACCNR:-nems} happens after that block to ensure the original behavior of a "nems" default when appropriate.

Warning: If a user or automated test system was setting ACCNR to an invalid value, then rt.sh will use that value after this PR, and will fail. Make certain your scripts are setting ACCNR properly, or not setting it at all, before testing this!

Issue(s) addressed

The first attempt at fixing this (issue #979) is in PR #981, which caused a bug #1014 and didn't fix issues mentioned in #1015.

Resolves #1014 and resolves #1015

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

To avoid wasting CPU time, I have run minimal testing, sufficient to ensure the ACCNR is passed properly. To fully trust these changes, we need to run baseline generation and regression testing for all platforms. This PR should be combined with an answer-changing PR to avoid wasting CPU time. Also note that this may break the github-driven automated testing if it set ACCNR incorrectly.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

None.

…ng where the variable should not be set or used.
@SamuelTrahanNOAA
Copy link
Collaborator Author

SamuelTrahanNOAA commented Jan 20, 2022

To make sure nobody misses it, I'm going to repeat this warning from the PR description:

Warning: If a user or automated test system was setting ACCNR to an invalid value, then rt.sh will use that value after this PR, and will fail. Make certain your scripts are setting ACCNR properly, or not setting it at all, before testing this!

@SamuelTrahanNOAA
Copy link
Collaborator Author

Also, an answer-changing PR should be merged into this branch as soon as possible so the system can be fully tested. I only tested enough to ensure the ACCNR is passed correctly.

@MinsukJi-NOAA
Copy link
Contributor

Also, an answer-changing PR should be merged into this branch as soon as possible so the system can be fully tested. I only tested enough to ensure the ACCNR is passed correctly.

I will pull this PR into #1009, which is the current one in the commit queue.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA Your branch breaks Jet support, so I'll need something else to test against.

CMake Error at CMakeLists.txt:111 (find_package):
  By not providing "FindNetCDF.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "NetCDF", but
  CMake did not find one.

  Could not find a package configuration file provided by "NetCDF" with any
  of the following names:

    NetCDFConfig.cmake
    netcdf-config.cmake

  Add the installation prefix of "NetCDF" to CMAKE_PREFIX_PATH or set
  "NetCDF_DIR" to a directory containing one of the above files.  If "NetCDF"
  provides a separate development package or SDK, be sure it has been
  installed.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I cannot see anything in your branch that would break Jet support, but somehow it fails to compile for me, while the community/develop succeeds.

@SamuelTrahanNOAA
Copy link
Collaborator Author

No, cancel that, the develop just failed too. A slightly older version of develop succeeds. I haven't tracked down what broke jet support yet...

I think I need someone else to try, to see if this is something specific to my account.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA Never mind. Now everything is compiling properly. That may have been a temporary system issue.

@MinsukJi-NOAA
Copy link
Contributor

@MinsukJi-NOAA Never mind. Now everything is compiling properly. That may have been a temporary system issue.

I tried to run the develop on jet, and got this error:

+ cp fv3_conf/compile_slurm.IN_jet fv3_conf/compile_slurm.IN
+ mkdir -p /lfs4/HFIP/nems/Minsuk.Ji/RT_BASELINE/Minsuk.Ji
mkdir: cannot create directory '/lfs4/HFIP/nems': Permission denied
++ echo 'rt.sh error on line 396'
rt.sh error on line 396

@SamuelTrahanNOAA
Copy link
Collaborator Author

Are you setting ACCNR to anything?

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA The develop branch does not run on Jet. You need to use my updates (that's why I did the PR).

@MinsukJi-NOAA
Copy link
Contributor

Are you setting ACCNR to anything?

No. I guess this error will be fixed by your code changes in this PR?

@SamuelTrahanNOAA
Copy link
Collaborator Author

Yes. My changes correct the code so it will use h-nems by default on Jet, or the value of ACCNR if you set one. On develop, you had to set ACCNR or it would break.

@MinsukJi-NOAA
Copy link
Contributor

It's good to merge this PR into #1009 ?

@SamuelTrahanNOAA
Copy link
Collaborator Author

SamuelTrahanNOAA commented Jan 21, 2022

@MinsukJi-NOAA I am testing #1009 and #1016 together right now on Jet. I will let you know when I trust the merge.

/lfs4/HFIP/hfv3gfs/s/pr-1009/tests
/lfs4/HFIP/hfv3gfs/s/rt/r/rt_158888

Edit: I had typed the wrong rt_ directory path.

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 21, 2022 via email

@SamuelTrahanNOAA
Copy link
Collaborator Author

@climbfuji That should still work of course, but with this PR everything will work as originally intended.

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 21, 2022 via email

@MinsukJi-NOAA
Copy link
Contributor

At least I can confirm that there is no problem with finding netcdf, i.e. I don’t have the issue that you reported initially.

I confirm that there is no issue finding netcdf on jet.

@SamuelTrahanNOAA
Copy link
Collaborator Author

I have no trouble finding NetCDF now either. All of the build jobs are succeeding.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA The Jet tests are going well, so I started Hera intel and gnu. Right now, all three workflows are generating baselines (rt.sh -c)

@SamuelTrahanNOAA
Copy link
Collaborator Author

Hera finished generating baselines. It is validating against them now.

/scratch1/NCEPDEV/stmp4/Samuel.Trahan/FV3_RT/REGRESSION_TEST_INTEL
/scratch1/NCEPDEV/stmp4/Samuel.Trahan/FV3_RT/REGRESSION_TEST_GNU

Jet is still generating baselines.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Hera gnu and intel tests have passed. Jet has generated its baseline, and is verifying against that now.

/lfs4/HFIP/hfv3gfs/s/rt/t/Samuel.Trahan/FV3_RT/REGRESSION_TEST_INTEL

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA I can confidently say the #1016 and #1009 work together now. However, this PR has not been reviewed yet, and is not scheduled for merge until January 25.

@MinsukJi-NOAA
Copy link
Contributor

@MinsukJi-NOAA I can confidently say the #1016 and #1009 work together now. However, this PR has not been reviewed yet, and is not scheduled for merge until January 25.

Thanks for testing @SamuelTrahanNOAA . Since the changes in this PR are small and help with Regression Testing, this PR can go in with #1009.

@SamuelTrahanNOAA
Copy link
Collaborator Author

@MinsukJi-NOAA Someone should review this before it is merged with anything else.

@SamuelTrahanNOAA
Copy link
Collaborator Author

The Jet tests passed. Though, it doesn't matter much at this point since @MinsukJi-NOAA is merging these changes in his PR #1009. That'll trigger a rerun of the regression test system on all platforms, including Jet.

@SamuelTrahanNOAA
Copy link
Collaborator Author

These changes were merged in #1009 so I'm closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Many platforms override ACCNR with a hard-coded value Bug in rt.sh jet section when ACCNR is not set
3 participants