Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to contrib spack-stack on Jet #787

Merged

Conversation

InnocentSouopgui-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA commented Aug 28, 2024

Description

Following the failure of the lfs4 storage, spack stack was moved to /contrib and jet module files needs updates.
This PR updates Jet module files to point to the new path of spack stack.

Fixes #786

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • Clone and build on Jet
  • Cycle test with Global Workflow at the following resolutions on Jet:
    • 96/48 on kjet
    • 192/96 on kjet
    • 384/192 on kjet

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA and @DavidHuber-NOAA , if you have time would both review this PR? This PR is in support of g-w issue #2841. GSI PRs need approval from two peer reviewers.

Tagging @hu5970 and @ShunLiu-NOAA for awareness since I believe regional DA uses Jet.

Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are good, thanks @InnocentSouopgui-NOAA !

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spack-stack path is correct. @InnocentSouopgui-NOAA Have you been able to run the regression tests on Jet?

@InnocentSouopgui-NOAA
Copy link
Contributor Author

The spack-stack path is correct. @InnocentSouopgui-NOAA Have you been able to run the regression tests on Jet?

No, I have not run the regression test.
I looked at the regression folder and scripts.
It looks like I will need to provide a control run. But, it is not possible to generate a new control anymore since the old installation of spack-stack is no longer available on run nodes. At the same time, Given that this does not involve any change in the code, It shouldn't change any result.

So my plan is to run a couple of cycled experiment with global workflow, at different resolutions available on Jet.

What do you think @DavidHuber-NOAA ?

@DavidHuber-NOAA
Copy link
Collaborator

@InnocentSouopgui-NOAA You can let the control be the same version by cloning it into develop. cmake will need to be run a second time to update the RT directory:

git clone git@github.com:InnocentSouopgui-NOAA/gsi -b migration-jet-contrib develop
cd develop/
./ush/build.sh
module use modulefiles
module load gsi_jet.intel
cd build
cmake ..
make
make install
cd regression
ctest -j 7

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , you don't need a control. You could do the following on Jet

  • mkdir develop
  • cd develop
  • git clone -b migration-jet-contrib --recursive https://github.com/InnocentSouopgui-NOAA/GSI.git .
  • cd ush
  • ./build.sh
  • ./build.sh - executing build.sh twice is intentional. The first build will not find executables and so it will turn off regression tests. The second build will detect executables and turn on regression tests. The regression tests will use develop for both the updat and contrl
  • cd ../build
  • ctest -j 6

@InnocentSouopgui-NOAA
Copy link
Contributor Author

I tried to run the regression test.
I am getting
No tests were found!!! message.

This is the first time I am running the GIS's regression tests.
I followed @DavidHuber-NOAA and @RussTreadon-NOAA descriptions (especially running build.sh twice). And got the same message.
What else may be missing?

@DavidHuber-NOAA
Copy link
Collaborator

@InnocentSouopgui-NOAA May I take a look at your local clone?

@InnocentSouopgui-NOAA
Copy link
Contributor Author

Yes @DavidHuber-NOAA ,
my local clone is at: /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/moving_lfs4/GSI

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , do the following

  1. cd /lfs5/NESDIS/nesdis-rdo2/Innocent.Souopgui/moving_lfs4
  2. mv GSI develop
  3. cd develop/ush
  4. ./build.sh
  5. cd ../build
  6. ctest -N

You should see 6 tests listed

  Test #1: global_4denvar
  Test #2: rtma
  Test #3: rrfs_3denvar_rdasens
  Test #4: hafs_4denvar_glbens
  Test #5: hafs_3denvar_hybens
  Test #6: global_enkf

@DavidHuber-NOAA
Copy link
Collaborator

Ah, I see the issue. For context, the regression test works by running the GSI cases twice. Once for the directory you are running from and once for the "develop" branch. The RTs assume that the develop branch is named "develop" locally. So you can trick it by naming the branch that you are running from "develop" and it will run the same version twice.

@RussTreadon-NOAA
Copy link
Contributor

As a test, I did what I suggested above but copied to /lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop. ctests are now queued and/or running under my id

Hera(fe7):/lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop/build$ q
     JOBID PARTITION  NAME                     USER             STATE        TIME TIME_LIMIT NODES NODELIST(REASON)
   9173759 kjet       global_4denvar_loproc_up Russ.Treadon     PENDING      0:00      10:00     8 (Priority)
   9173761 kjet       rtma_loproc_updat        Russ.Treadon     PENDING      0:00      30:00    12 (Priority)
   9173762 kjet       rrfs_3denvar_rdasens_lop Russ.Treadon     PENDING      0:00      15:00     4 (Priority)
   9173763 kjet       hafs_4denvar_glbens_lopr Russ.Treadon     PENDING      0:00      15:00     4 (Priority)
   9173760 kjet       hafs_3denvar_hybens_lopr Russ.Treadon     PENDING      0:00      15:00     4 (Priority)
   9173758 kjet       global_enkf_loproc_updat Russ.Treadon     RUNNING      2:04      10:00     3 k[67,89,320]

@RussTreadon-NOAA
Copy link
Contributor

Jet ctests
ctests ran to completion with the following results

Test project /lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  2460.69 sec
2/6 Test #6: global_enkf ......................   Passed  3491.32 sec
3/6 Test #4: hafs_4denvar_glbens ..............***Failed  4998.34 sec
4/6 Test #1: global_4denvar ...................   Passed  5117.85 sec
5/6 Test #5: hafs_3denvar_hybens ..............   Passed  5418.96 sec
6/6 Test #2: rtma .............................   Passed  42380.16 sec

67% tests passed, 2 tests failed out of 6

Total Test time (real) = 42380.18 sec

The following tests FAILED:
          3 - rrfs_3denvar_rdasens (Failed)
          4 - hafs_4denvar_glbens (Failed)
Errors while running CTest

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 385.888371 seconds.  This has exceeded maximum allowable threshold time of 376.086700 seconds,
resulting in Failure time-thresh of the regression test.

The timing threshold test is not robust. This failure is not a fatal failure. A rerun of this test could return Passed.

@RussTreadon-NOAA
Copy link
Contributor

Jet ctest follow-up
Rerun hafs_4denvar_glbens on Jet. This time the test passed

Test project /lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop/build
    Start 4: hafs_4denvar_glbens
1/1 Test #4: hafs_4denvar_glbens ..............   Passed  44537.48 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 44537.53 sec

Account Russ.Treadon only has access to accounting code hfv3gfs. The queue wait times are extremely long when running batch jobs under hfv3gfs as Russ.Treadon

Test rrfs_3denvar_rdasens fails because file coupler.res does not exist

forrtl: severe (24): end-of-file during read, unit 24, file /mnt/lfs5/HFIP/hfv3gfs/Russ.Treadon/ptmp/tmpreg_rrfs_3denvar_rdasens/rrfs_3denvar_rdasens_loproc_updat/coupler.res
Image              PC                Routine            Line        Source
gsi.x              0000000005A04A08  Unknown               Unknown  Unknown
gsi.x              0000000005A3BB8A  Unknown               Unknown  Unknown
gsi.x              0000000000A7DEAA  gsi_rfv3io_mod_mp         323  gsi_rfv3io_mod.f90
gsi.x              00000000006CF2A3  convert_fv3_regio          37  fv3_regional_interface.f90
gsi.x              00000000004233E3  gsimod_mp_gsimain        2349  gsimod.F90
gsi.x              0000000000414C35  MAIN__                    620  gsimain.f90

From the job log file we see

+ cp /lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ges/fv3_coupler.res coupler.res
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ges/fv3_coupler.res': No such file or directory

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA I had to delete the old CASES/regtest/regional/rrfs directory so that a symlink could be created. I have done that now.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , I'm still seeing failures when rrfs_3denvar_rdasens_loproc_updat runs. Log file /lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop/regression/rrfs_3denvar_rdasens_loproc_updat.out contains several "No such file" messages

ls: cannot access '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ens/*gdas.t??z.atmf009.mem0??.nc': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ges/sfc_data.tile7.halo0.nc': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ges/gfs_data.tile7.halo0.nc': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/ges/fv3_coupler.res': No such file or directory
sed: can't read coupler.res: No such file or directory
sed: can't read coupler.res: No such file or directory
sed: can't read coupler.res: No such file or directory
sed: can't read coupler.res: No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/gsd_sfcobs_provider.txt': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/current_bad_aircraft': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/gsd_sfcobs_uselist.txt': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/rrfs.prod.2023061012_satbias_pc': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/rrfs.prod.2023061012_satbias': No such file or directory
cp: cannot stat '/lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs/2023061012/obs/rrfs.prod.2023061012_radstat': No such file or directory

Directory /lfs5/NESDIS/nesdis-rdo2/David.Huber/save/CASES/regtest/regional/rrfs is a soft link pointing at /scratch1/BMC/wrfruc/mhu/code/data/regional/rrfs.

@DavidHuber-NOAA
Copy link
Collaborator

@RussTreadon-NOAA My apologies. I have modified my rsync command to follow symlinks. The data should all be available now.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA , thank you. The rrfs_3denvar_rdasens test now runs on Jet. I have not, however, gotten the full rrfs_3denvar_rdasens ctest to complete. My queue wait time is very long on Jet.

@InnocentSouopgui-NOAA
Copy link
Contributor Author

InnocentSouopgui-NOAA commented Sep 6, 2024

Can we separate this PR from the issue on Ctests, in order to move forward with Global Workflow PR.
Cycling at resolution C96/48, C192/96 and C384/192 ran smoothly.

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , we can do as you suggest, but we need to do the following first

  1. open a new GSI issue documenting the Jet ctest problem
  2. assign someone to work on the newly opened issue

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Sep 6, 2024

Finally got rrfs_3denvar_rdasens to pass on Jet.

Test project /lfs5/HFIP/hfv3gfs/Russ.Treadon/git/gsi/tmp/develop/build
    Start 3: rrfs_3denvar_rdasens
1/1 Test #3: rrfs_3denvar_rdasens .............   Passed  35471.37 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 35471.46 sec

Jet GSI ctests times are extremely high. My account can only use the hfv3gfs project code. This code has zero available corehours.

=================================================================================================================
Report                           Project Report for:hfv3gfs
Report Run:                      Fri 06 Sep 2024 10:23:18 AM  UTC
Report Period Beginning:         Sun 01 Sep 2024 12:00:00 AM  UTC
Report Period Ending:            Tue 01 Oct 2024 12:00:00 AM  UTC
Percentage of Period Elapsed:    18.1%
Percentage of Period Remaining:  81.9%
=================================================================================================================
Machines:                                     jet
Initial Allocation in Hours:            1,676,712
Net Allocation Adjustments:            -1,656,986
                                 ----------------
Adjusted Allocation:                       19,726

Core Hours Used:                           49,908
Windfall Core Hours Used:                       0
                                 ----------------
Total Core Hours Used:                     49,908

Project Normalized Shares:               0.000693
Project Fair Share:                      0.000000
Project Rank:                               44/45

Percentage of Period Elapsed:               18.1%
Percentage of Period Remaining:             81.9%
Percentage of Allocation Used:             100.0%

User                             Cr-HrUsed    Windfall   TotalUsed       %Used      Jobs
------------------------------ ----------- ----------- ----------- ----------- ---------
Edward.Colon                             0           0           0       0.00%         1
Jim.Jung                            49,698           0      49,698     100.00%     1,289
Russ.Treadon                           210           0         210       1.06%        19
------------------------------ ----------- ----------- ----------- ----------- ---------
Total                               49,908           0      49,908     100.00%     1,309

Total Report Runtime: 1.44 seconds (ver. 23.07.06-FNJT)

This PR can be merged pending approval from two peer reviews. Note: My approval does not count as a peer reviewer.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve.

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , were you ever able to run the GSI ctests on Jet? If so, please post the results in this PR.

@RussTreadon-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA , we can do as you suggest, but we need to do the following first

1. open a new GSI issue documenting the Jet ctest problem

2. assign someone to work on the newly opened issue

These actions are no longer necessary. I was able to run ctests on Jet. It would be good to get confirmation from other developers that they, too, can run ctests on Jet.

@DavidHuber-NOAA
Copy link
Collaborator

I will try running the ctests on Jet this morning.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA but do not stop other work just to run GSI ctests on Jet. @InnocentSouopgui-NOAA said gsi.x and enkf.x work in cycled mode at various resolutions. This along with my ctests results is sufficient to move this PR forward. We just need another peer review approval.

Copy link
Collaborator

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. Will run the ctests anyway, but I won't hold up the PR given the positive results.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @DavidHuber-NOAA!

@RussTreadon-NOAA RussTreadon-NOAA merged commit 9f44c87 into NOAA-EMC:develop Sep 6, 2024
4 checks passed
DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this pull request Sep 6, 2024
* origin/develop:
  Move to contrib spack-stack on Jet (NOAA-EMC#787)
  a quick workaround for increasing the mpi task numbers on orion for ctest :: rrfs_3denvar_rdasens  (NOAA-EMC#788)
  Recover the capability of handling model fields from operation gfs.v16.3 (NOAA-EMC#785)
  fix a bug in deter_sfc_gmi (NOAA-EMC#781)
  add safeguard to thompson_reff (NOAA-EMC#779)
  Fix incorrect usage of real(i_kind) in mg_input.f90  (NOAA-EMC#760)
  Transition to Thompson Microphysics for Microwave All-sky Assimilation (NOAA-EMC#743)
  Format changes for EUMETSAT metop-sg and CADS debug fix (NOAA-EMC#773)
  Update global_4denvar and global_enkf ctests to reflect GFS v17 (NOAA-EMC#774)
  fix for cris-fsr memory corruption (NOAA-EMC#767)
  Gnssrwnd1.0 (NOAA-EMC#747)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use contrib installation of spack-stack on Jet
4 participants