Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix HAFS GSI debug build and run issues #679

Merged
merged 9 commits into from
Mar 27, 2024

Conversation

XuLu-NOAA
Copy link
Contributor

@XuLu-NOAA XuLu-NOAA commented Jan 9, 2024

DUE DATE for merger of this PR into develop is 2/19/2024 (six weeks after PR creation).
DUE DATE for this PR is extended to 3/19/2024 because @XuLu-NOAA is on leave.
Description

Xu Lu (xu.lu@noaa.gov) and Biju Thomas (biju.thomas@noaa.gov) fixed bugs regarding HAFS GSI debug build and run issues. This is in corresponding to issue #661

Fixes #661

    1. In read_radar.f90, uninitialized toff is making all the ground-based radar observations be placed at -3h instead of 0h, which creates wrong increments for FGAT and 4DEnVar.
    2. In read_radar.f90, uninitialized zsges will crash the debug mode.
    3. In read_radar.f90, t4dvo should be used instead of t4dv in the read_radar_l2rw_novadqc subroutine.
    4. In radinfo.90, maxscan should be increased to at least 252 to allow more scans, or it will crash the debug mode.
    5. In read_fl_hdob.f90, dlnpsob is replaced with 1000. since the SFMR does not sample surface pressure, and the uninitialized dlnpsob creates issues later in setupspd.f90 in the debug mode.
    6. In mod_fv3_lola.f90, (i,j+1) should be used instead of (i+1,j) in searching for V edges. 
    7. In stpcalc.f90, when tried to find the best stepsize from outpen around L838-864, the minimum outstp(i) is stored in stp(ii), but the istp_use is asigned with i instead of ii. Create inconsistency when assigning stp(istp_use) to stpinout at L872. Should use istp_use=ii instead.

Type of change

  • [Yes] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?
Regression test on Orion:

Test project /work/noaa/hwrf/save/xulu/mergeversions/GSI/build
CMake Warning (dev) at CTestTestfile.cmake:9 (subdirs):  Syntax Warning in cmake code at /work/noaa/hwrf/save/xulu/mergeversions/GSI/build/regression/CTestTestfile.cmake:7:10
1/7 Test #4: [=[netcdf_fv3_regional]=] ........   Passed  365.11 sec
2/7 Test #7: [=[global_enkf]=] ................   Passed  430.29 sec
3/7 Test #3: [=[rrfs_3denvar_glbens]=] ........   Passed  605.35 sec
4/7 Test #2: [=[rtma]=] .......................   Passed  969.78 sec
5/7 Test #6: [=[hafs_3denvar_hybens]=] ........***Failed  1455.47 sec
6/7 Test #1: [=[global_4denvar]=] .............   Passed  1682.40 sec
7/7 Test #5: [=[hafs_4denvar_glbens]=] ........***Failed  1758.90 sec

The failed hafs_3denvar and 4denvar are within expectation due to the fix for toff. As demonstrated in the single observation tests in the following figure, the uninitialized toff can result in increment degradations due to wrongly assigned observation times:
image

…bugs regarding uninitialized variables (e.g. toff) and dual_res in GSI.

        1. In read_radar.f90, uninitialized toff is making all the ground-based radar observations to be placed at -3h instead of 0h, which create wrong increments for FGAT and 4DEnVar.
        2. In read_radar.f90, unintialized zsges will crash the debug mode.
        3. In read_radar.f90, t4dvo should be used instead of t4dv in the read_radar_l2rw_novadqc subroutine.
        4. In radinfo.90, maxscan should be increased to at least 252 to allow more scans
        5. In read_fl_hdob.f90, dlnpsob is replaced with 1000. since the SFMR does not sampling surface pressure, and the uninitiailzed dlnpsob create issues later in setupspd.f90.
        6. In mod_fv3_lola.f90, (i,j+1) should be used instead of (i+1,j) in searching for V edges
@ShunLiu-NOAA
Copy link
Contributor

@yonghuiweng Could you please review this PR? Thanks.

@ShunLiu-NOAA
Copy link
Contributor

@XuLu-NOAA Could you please run regression test on WCOSS and HERA?

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA Could you please run regression test on WCOSS and HERA?

@ShunLiu-NOAA I can do it on Hera, but no WCOSS2 account yet. @yonghuiweng Can you help run it on WCOSS2? Thanks!

@yonghuiweng
Copy link

5 test failed on wcoss2. I will re-test later.
Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed 483.53 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 725.68 sec
3/7 Test #7: global_enkf ...................... Passed 1076.69 sec
4/7 Test #2: rtma .............................***Failed 1150.46 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1213.19 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1453.97 sec
7/7 Test #1: global_4denvar ...................***Failed 1554.23 sec

29% tests passed, 5 tests failed out of 7

Total Test time (real) = 1554.23 sec

The following tests FAILED:
1 - global_4denvar (Failed)
2 - rtma (Failed)
4 - netcdf_fv3_regional (Failed)
5 - hafs_4denvar_glbens (Failed)
6 - hafs_3denvar_hybens (Failed)
Errors while running CTest

@XuLu-NOAA
Copy link
Contributor Author

Thanks, Yonghui! Here are my tests on Hera:

Test project /scratch1/NCEPDEV/hwrf/scrub/Xu.Lu/regression/GSI/build
    Start 1: [=[global_4denvar]=]
    Start 2: [=[rtma]=]
    Start 3: [=[rrfs_3denvar_glbens]=]
    Start 4: [=[netcdf_fv3_regional]=]
    Start 5: [=[hafs_4denvar_glbens]=]
    Start 6: [=[hafs_3denvar_hybens]=]
    Start 7: [=[global_enkf]=]
1/7 Test #7: [=[global_enkf]=] ................   Passed  1359.20 sec
2/7 Test #4: [=[netcdf_fv3_regional]=] ........   Passed  2545.41 sec
3/7 Test #3: [=[rrfs_3denvar_glbens]=] ........   Passed  2545.47 sec
4/7 Test #5: [=[hafs_4denvar_glbens]=] ........***Failed  4761.09 sec
5/7 Test #6: [=[hafs_3denvar_hybens]=] ........***Failed  4763.34 sec
6/7 Test #1: [=[global_4denvar]=] .............   Passed  5586.05 sec
7/7 Test #2: [=[rtma]=] .......................   Passed  22761.94 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 22761.99 sec

The following tests FAILED:
          5 - [=[hafs_4denvar_glbens]=] (Failed)
          6 - [=[hafs_3denvar_hybens]=] (Failed)
Errors while running CTest

@yonghuiweng
Copy link

After re-built the code, and tried serval times, I still only passed 3 tests (one more than the 1st try):

Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed 664.86 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 727.66 sec
3/7 Test #7: global_enkf ...................... Passed 1106.25 sec
4/7 Test #2: rtma ............................. Passed 1276.90 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1277.65 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1463.64 sec
7/7 Test #1: global_4denvar ...................***Failed 1615.82 sec

43% tests passed, 4 tests failed out of 7

Total Test time (real) = 1615.83 sec

The following tests FAILED:
1 - global_4denvar (Failed)
4 - netcdf_fv3_regional (Failed)
5 - hafs_4denvar_glbens (Failed)
6 - hafs_3denvar_hybens (Failed)
Errors while running CTest

@XuLu-NOAA
Copy link
Contributor Author

After re-built the code, and tried serval times, I still only passed 3 tests (one more than the 1st try):

Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_glbens Start 4: netcdf_fv3_regional Start 5: hafs_4denvar_glbens Start 6: hafs_3denvar_hybens Start 7: global_enkf 1/7 Test #4: netcdf_fv3_regional ..............***Failed 664.86 sec 2/7 Test #3: rrfs_3denvar_glbens .............. Passed 727.66 sec 3/7 Test #7: global_enkf ...................... Passed 1106.25 sec 4/7 Test #2: rtma ............................. Passed 1276.90 sec 5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1277.65 sec 6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1463.64 sec 7/7 Test #1: global_4denvar ...................***Failed 1615.82 sec

43% tests passed, 4 tests failed out of 7

Total Test time (real) = 1615.83 sec

The following tests FAILED: 1 - global_4denvar (Failed) 4 - netcdf_fv3_regional (Failed) 5 - hafs_4denvar_glbens (Failed) 6 - hafs_3denvar_hybens (Failed) Errors while running CTest

Hi, @yonghuiweng , can you run with --rerun-failed --output-on-failure, and put the log on Orion? Then I can check if 1 & 4 failed due to run time or something else.

@yonghuiweng
Copy link

After re-built the code, and tried serval times, I still only passed 3 tests (one more than the 1st try):
Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_glbens Start 4: netcdf_fv3_regional Start 5: hafs_4denvar_glbens Start 6: hafs_3denvar_hybens Start 7: global_enkf 1/7 Test #4: netcdf_fv3_regional ..............***Failed 664.86 sec 2/7 Test #3: rrfs_3denvar_glbens .............. Passed 727.66 sec 3/7 Test #7: global_enkf ...................... Passed 1106.25 sec 4/7 Test #2: rtma ............................. Passed 1276.90 sec 5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1277.65 sec 6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1463.64 sec 7/7 Test #1: global_4denvar ...................***Failed 1615.82 sec
43% tests passed, 4 tests failed out of 7
Total Test time (real) = 1615.83 sec
The following tests FAILED: 1 - global_4denvar (Failed) 4 - netcdf_fv3_regional (Failed) 5 - hafs_4denvar_glbens (Failed) 6 - hafs_3denvar_hybens (Failed) Errors while running CTest

Hi, @yonghuiweng , can you run with --rerun-failed --output-on-failure, and put the log on Orion? Then I can check if 1 & 4 failed due to run time or something else.

Yes, this test result is from the run with --rerun-failed.
I copied the whole folder as: orion:/work2/noaa/hwrf/noscrub/yweng/regression/GSI.

@XuLu-NOAA
Copy link
Contributor Author

Thanks, Yonghui!
After checking the log files for the global 4DEnVar. The initial grady and cost functions are the same but the b of the first iteration is different.
Since this is global 4D, so it should have nothing to do with the dual_res fix.
And the rw & spd in the convinfo are -1 in this configuration, so it should have nothing to do with the read_radar and read_fl_hdob fixes.
The only possible change is the increased maxscan that expands the satellite channels. I suspect that the original configuration was using some random values, but not sure how to isolate it as we do not know what sat obs exceeds the 250 default configuration. Also, I'm not sure why it's not showing in Orion/Hera. It will be difficult for me to identify the issue. Does anyone have clues or suggestions on the issue?
Thanks!

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Jan 14, 2024 via email

@TingLei-NOAA
Copy link
Contributor

Guys. the wcoss2 is still unavailable to me and I will give an update later.

@TingLei-NOAA
Copy link
Contributor

@yonghuiweng @XuLu-NOAA I just finished global_4denvar test and it passed( see output : c.out in /lfs/h2/emc/da/noscrub/Ting.Lei/dr-xu/GSI/build . Yonghui, would you please have a look to see what the differences are between your and my run? Thanks.

@XuLu-NOAA
Copy link
Contributor Author

@yonghuiweng @XuLu-NOAA I just finished global_4denvar test and it passed( see output : c.out in /lfs/h2/emc/da/noscrub/Ting.Lei/dr-xu/GSI/build . Yonghui, would you please have a look to see what the differences are between your and my run? Thanks.

Hi, @yonghuiweng , would you mind rerunning the regression test to see if you can reproduce the error on WCOSS2? Since Ting appears not able to reproduce.

@yonghuiweng
Copy link

@XuLu-NOAA and @TingLei-NOAA, I did a start-over test, passed 5 out of 7 tasks and only 2 tests failed (Start 5: hafs_4denvar_glbens and Start 6: hafs_3denvar_hybens). The result is saved at /lfs/h2/emc/hur/noscrub/yonghui.weng/noscrub/regression.
I realize the error I made for the last test is that I did not check out https://github.com/NOAA-EMC/GSI, I used the previous version when I did dualres tests.
Thanks.

@yonghuiweng
Copy link

@XuLu-NOAA and @TingLei-NOAA, I did a start-over test, passed 5 out of 7 tasks and only 2 tests failed (Start 5: hafs_4denvar_glbens and Start 6: hafs_3denvar_hybens). The result is saved at /lfs/h2/emc/hur/noscrub/yonghui.weng/noscrub/regression. I realize the error I made for the last test is that I did not check out https://github.com/NOAA-EMC/GSI, I used the previous version when I did dualres tests. Thanks.

The 2nd test shows:
Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 1: global_4denvar
Start 7: global_enkf
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
1/7 Test #4: netcdf_fv3_regional ..............***Failed 484.02 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 665.67 sec
3/7 Test #2: rtma ............................. Passed 1029.06 sec
4/7 Test #7: global_enkf ...................... Passed 1078.96 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1210.73 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1329.76 sec
7/7 Test #1: global_4denvar ................... Passed 1489.98 sec

57% tests passed, 3 tests failed out of 7

Total Test time (real) = 1489.99 sec

The following tests FAILED:
4 - netcdf_fv3_regional (Failed)
5 - hafs_4denvar_glbens (Failed)
6 - hafs_3denvar_hybens (Failed)
Errors while running CTest

@ShunLiu-NOAA
Copy link
Contributor

@yonghuiweng could you give the path of your test on WCOSS2? @TingLei-NOAA will take a look at into the detail of failure.

@yonghuiweng
Copy link

Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Mar 5, 2024 via email

@XuLu-NOAA
Copy link
Contributor Author

Here's the RT_test results on Hera:

Test project /scratch1/NCEPDEV/hwrf/scrub/Xu.Lu/regression/GSI/build
......
    Start 1: [=[global_4denvar]=]
    Start 2: [=[rtma]=]
    Start 3: [=[rrfs_3denvar_glbens]=]
    Start 4: [=[netcdf_fv3_regional]=]
    Start 5: [=[hafs_4denvar_glbens]=]
    Start 6: [=[hafs_3denvar_hybens]=]
    Start 7: [=[global_enkf]=]
1/7 Test #4: [=[netcdf_fv3_regional]=] ........   Passed  484.31 s                                                                     ec
2/7 Test #3: [=[rrfs_3denvar_glbens]=] ........   Passed  487.22 s                                                                     ec
3/7 Test #7: [=[global_enkf]=] ................   Passed  497.44 s                                                                     ec
^[[A4/7 Test #6: [=[hafs_3denvar_hybens]=] ........***Failed  1098                                                                     .63 sec
5/7 Test #2: [=[rtma]=] .......................   Passed  1148.34                                                                      sec
6/7 Test #5: [=[hafs_4denvar_glbens]=] ........***Failed  1335.69                                                                      sec
7/7 Test #1: [=[global_4denvar]=] .............   Passed  1625.53                                                                      sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1625.56 sec

The following tests FAILED:
          5 - [=[hafs_4denvar_glbens]=] (Failed)
          6 - [=[hafs_3denvar_hybens]=] (Failed)
Errors while running CTest

@XuLu-NOAA
Copy link
Contributor Author

Hi, @TingLei-NOAA or @yonghuiweng , could any of you help run the ctest on WCOSS2 machines? I would imagine that's the last step of this PR. Thanks!

@yonghuiweng
Copy link

Hi, @TingLei-NOAA or @yonghuiweng , could any of you help run the ctest on WCOSS2 machines? I would imagine that's the last step of this PR. Thanks!

Here is the test on wcoss2:

Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/toff_fix/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed 487.23 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 671.29 sec
3/7 Test #2: rtma ............................. Passed 1031.69 sec
4/7 Test #7: global_enkf ...................... Passed 1129.97 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1215.04 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1460.97 sec
7/7 Test #1: global_4denvar ...................***Failed 1504.06 sec

43% tests passed, 4 tests failed out of 7

Total Test time (real) = 1504.07 sec

The following tests FAILED:
1 - global_4denvar (Failed)
4 - netcdf_fv3_regional (Failed)
5 - hafs_4denvar_glbens (Failed)
6 - hafs_3denvar_hybens (Failed)
Errors while running CTest

@XuLu-NOAA
Copy link
Contributor Author

Hi, @TingLei-NOAA or @yonghuiweng , could any of you help run the ctest on WCOSS2 machines? I would imagine that's the last step of this PR. Thanks!

Here is the test on wcoss2:

Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/toff_fix/build Start 1: global_4denvar Start 2: rtma Start 3: rrfs_3denvar_glbens Start 4: netcdf_fv3_regional Start 5: hafs_4denvar_glbens Start 6: hafs_3denvar_hybens Start 7: global_enkf 1/7 Test #4: netcdf_fv3_regional ..............***Failed 487.23 sec 2/7 Test #3: rrfs_3denvar_glbens .............. Passed 671.29 sec 3/7 Test #2: rtma ............................. Passed 1031.69 sec 4/7 Test #7: global_enkf ...................... Passed 1129.97 sec 5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1215.04 sec 6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1460.97 sec 7/7 Test #1: global_4denvar ...................***Failed 1504.06 sec

43% tests passed, 4 tests failed out of 7

Total Test time (real) = 1504.07 sec

The following tests FAILED: 1 - global_4denvar (Failed) 4 - netcdf_fv3_regional (Failed) 5 - hafs_4denvar_glbens (Failed) 6 - hafs_3denvar_hybens (Failed) Errors while running CTest

Hi, @yonghuiweng , by any chance, can you take a look at the results and see if the failures in 1&4 are due to the time limit? If not, would you mind copy over the error messages on Orion? Appreciated!

@yonghuiweng
Copy link

@XuLu-NOAA Both them are failed due to nonreproducible issue of loproc_updat and loproc_control. The files are copied to: /work2/noaa/hwrf/noscrub/yweng/regression/toff_fix_wcoss2.

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA Both them are failed due to nonreproducible issue of loproc_updat and loproc_control. The files are copied to: /work2/noaa/hwrf/noscrub/yweng/regression/toff_fix_wcoss2.

Thanks, @yonghuiweng! I checked the global 4denvar dir. The error is back to what we saw around Jan 13, where you failed but Ting cannot reproduce. The initial gradient & cost functions are the same, but the first b is somehow slightly different. My guess is still the maxscan in the original configurations are outbounded in some special occasions. But @TingLei-NOAA would you mind reproducing the error as before when you got time? Thanks for the help from both of you!

@TingLei-NOAA
Copy link
Contributor

@XuLu-NOAA there are some problems with global_4denvar test on wcoss2. You could find comments from @RussTreadon-NOAA over several issues/PRs. I have also an update on this : #712.

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA there are some problems with global_4denvar test on wcoss2. You could find comments from @RussTreadon-NOAA over several issues/PRs. I have also an update on this : #712.

Thanks, @TingLei-NOAA , How about the netcdf_fv3? The loproc & hiproc stdout for each updat & contrl are consistent. But the initial gradients are different between updat & contrl. Can you reproduce those errors on WCOSS2 as well?

@TingLei-NOAA
Copy link
Contributor

@XuLu-NOAA I will see netcdf_fv3 right away.

@TingLei-NOAA
Copy link
Contributor

@XuLu-NOAA using debug mode GSI, netcdf_fv3 regression test passed on wcoss2. Shall we dig more to see what the differences are between mine and @yonghuiweng 's runs?

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA using debug mode GSI, netcdf_fv3 regression test passed on wcoss2. Shall we dig more to see what the differences are between mine and @yonghuiweng 's runs?

A quick question, in the old rerun of Jan 13 case, did you run in debug mode as well? Can you repeat his failure in non-debug mode?

@TingLei-NOAA
Copy link
Contributor

@XuLu-NOAA using debug mode GSI, netcdf_fv3 regression test passed on wcoss2. Shall we dig more to see what the differences are between mine and @yonghuiweng 's runs?

A quick question, in the old rerun of Jan 13 case, did you run in debug mode as well? Can you repeat his failure in non-debug mode?

No, the previous re-run was using optimized GSI. When I have a chance, I will try optimized one for this time.

@ShunLiu-NOAA
Copy link
Contributor

@XuLu-NOAA, With @TingLei-NOAA's PR#698 merged into develop, is it a good time to revisit this PR?

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA, With @TingLei-NOAA's PR#698 merged into develop, is it a good time to revisit this PR?

Hi, @ShunLiu-NOAA I , I've synced the lastest develop branch and tried the regression tests on Orion & Hera:
Hera:

Test project /scratch1/NCEPDEV/hwrf/save/Xu.Lu/regression/GSI/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  428.21 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  485.33 sec
3/7 Test #7: global_enkf ......................   Passed  745.08 sec
4/7 Test #2: rtma .............................   Passed  969.41 sec
5/7 Test #5: hafs_4denvar_glbens ..............***Failed  1218.28 sec
6/7 Test #6: hafs_3denvar_hybens ..............***Failed  1279.63 sec
7/7 Test #1: global_4denvar ...................   Passed  1867.09 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1867.11 sec

The following tests FAILED:
          5 - hafs_4denvar_glbens (Failed)
          6 - hafs_3denvar_hybens (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/hwrf/save/Xu.Lu/regression/GSI/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

Orion:

Test project /work/noaa/hwrf/save/xulu/mergeversions/GSI/build
...
    Start 1: [=[global_4denvar]=]
    Start 5: [=[hafs_4denvar_glbens]=]
    Start 6: [=[hafs_3denvar_hybens]=]
    Start 2: [=[rtma]=]
    Start 3: [=[rrfs_3denvar_glbens]=]
    Start 4: [=[netcdf_fv3_regional]=]
    Start 7: [=[global_enkf]=]
1/7 Test #1: [=[global_4denvar]=] .............***Failed  420.28 sec
2/7 Test #7: [=[global_enkf]=] ................   Passed  1267.86 sec
3/7 Test #4: [=[netcdf_fv3_regional]=] ........   Passed  1382.99 sec
4/7 Test #3: [=[rrfs_3denvar_glbens]=] ........   Passed  1446.05 sec
5/7 Test #2: [=[rtma]=] .......................   Passed  1867.99 sec
6/7 Test #6: [=[hafs_3denvar_hybens]=] ........***Failed  1936.05 sec
7/7 Test #5: [=[hafs_4denvar_glbens]=] ........***Failed  2237.03 sec

57% tests passed, 3 tests failed out of 7

Total Test time (real) = 2237.15 sec

The following tests FAILED:
          1 - [=[global_4denvar]=] (Failed)
          5 - [=[hafs_4denvar_glbens]=] (Failed)
          6 - [=[hafs_3denvar_hybens]=] (Failed)
Errors while running CTest

The HAFS 4d & 3d failed as expected. The global 4denvar failed on Orion due to the maximum time issue:
/work/noaa/hwrf/save/xulu/mergeversions/test/noscrub/regression/global_4denvar_regression_results.txt

So everything looks fine on Orion & Hera.

@yonghuiweng or @TingLei-NOAA Would you mind having a try with this latest version on WCOSS2 and see if the previous failure still persists?

Thanks,
Xu

@ShunLiu-NOAA
Copy link
Contributor

@XuLu-NOAA and @TingLei-NOAA Thank you for regression test. With Sho's PR#700 merged, do you mind rerun regression test on Orion and WCOSS2? Sorry for this inconvenience.

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA and @TingLei-NOAA Thank you for regression test. With Sho's PR#700 merged, do you mind rerun regression test on Orion and WCOSS2? Sorry for this inconvenience.

Hi, @ShunLiu-NOAA , Here're my ctest on Hera:

Test project /scratch1/NCEPDEV/hwrf/save/Xu.Lu/regression/GSI/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  427.49 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  544.45 sec
3/7 Test #7: global_enkf ......................   Passed  805.65 sec
4/7 Test #2: rtma .............................   Passed  968.66 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed  1042.71 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed  1282.48 sec
7/7 Test #1: global_4denvar ...................   Passed  1925.60 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1925.63 sec

The following tests FAILED:
          5 - hafs_4denvar_glbens (Failed)
          6 - hafs_3denvar_hybens (Failed)
Errors while running CTest
Output from these tests are in: /scratch1/NCEPDEV/hwrf/save/Xu.Lu/regression/GSI/buil
d/Testing/Temporary/LastTest.log

And on Orion:

Test project /work/noaa/hwrf/save/xulu/mergeversions/GSI/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #1: global_4denvar ...................***Failed  120.82 s
ec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  1687.09
sec
3/7 Test #7: global_enkf ......................   Passed  1852.08
sec
4/7 Test #3: rrfs_3denvar_glbens ..............   Passed  1930.60
sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed  2247.16
sec
6/7 Test #2: rtma .............................   Passed  2538.01 sec
7/7 Test #5: hafs_4denvar_glbens ..............***Failed  2726.58 sec

57% tests passed, 3 tests failed out of 7

Total Test time (real) = 2727.01 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          5 - hafs_4denvar_glbens (Failed)
          6 - hafs_3denvar_hybens (Failed)
Errors while running CTest
Output from these tests are in: /work/noaa/hwrf/save/xulu/mergeversions/GSI/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

The failures in hafs 4d/3d are expected. The failure in global 4d on Orion was due to the data access permission issue. Let's see what Ting found out with his test on WCOSS2.

Best,
Xu

@RussTreadon-NOAA
Copy link
Contributor

@XuLu-NOAA , if you routinely use Orion or Hercules you should request rstprod access. A similar suggestion applies to @JingCheng-NOAA who faced the same problem in GSI PR #698.

Here's a blurb about Restricted Data access on Orion

Restricted data (rstprod) is allowed on the MSU-HPC system. Be sure to follow all of NOAA's restricted data policies when using MSU-HPC.

Request access via AIM (https://aim.rdhpcs.noaa.gov/) > Request access to a new project > rstprod.

Provide the following information in your justification:

The machine(s) where you will need rstprod access on (i.e. Hercules, Orion).
The project(s) you will be using rstprod data for.

We can not loosen restrictions on rstprod data in the global_4denvar case. I'm surprised that the regional GSI cases do not use rstprod data. I assume the operational regional systems used rstprod data.

@ShunLiu-NOAA and @hu5970 , would you please check the files in CASES/regtestto ensure all files are properly labeled. Allrstproddata must belong to therstprod` group with appropriate permission restrictions. Thanks.

@TingLei-NOAA
Copy link
Contributor

On wcoss2, hafs 3d/4d regression tests failed as expected while all other tests passed.

@XuLu-NOAA
Copy link
Contributor Author

@XuLu-NOAA , if you routinely use Orion or Hercules you should request rstprod access. A similar suggestion applies to @JingCheng-NOAA who faced the same problem in GSI PR #698.

Here's a blurb about Restricted Data access on Orion

Restricted data (rstprod) is allowed on the MSU-HPC system. Be sure to follow all of NOAA's restricted data policies when using MSU-HPC.
Request access via AIM (https://aim.rdhpcs.noaa.gov/) > Request access to a new project > rstprod.
Provide the following information in your justification:
The machine(s) where you will need rstprod access on (i.e. Hercules, Orion).
The project(s) you will be using rstprod data for.

We can not loosen restrictions on rstprod data in the global_4denvar case. I'm surprised that the regional GSI cases do not use rstprod data. I assume the operational regional systems used rstprod data.

@ShunLiu-NOAA and @hu5970 , would you please check the files in CASES/regtestto ensure all files are properly labeled. Allrstproddata must belong to therstprod` group with appropriate permission restrictions. Thanks.

Hi, @RussTreadon-NOAA , we already have the rstprod access in the AIM system for Hera/Jet etc, so we cannot choose it again. Do you have any clue who should we contact in this case?

@RussTreadon-NOAA
Copy link
Contributor

I would contact your task lead(s) and have them contact your federal oversight manager(s).

@RussTreadon-NOAA
Copy link
Contributor

According to groups you both belong to the hurricane project.

Orion-login-3:~$ groups xulu
xulu : noaa-hpc stmp aoml-hafs1 aoml-hafsda hurricane hwrf
Orion-login-3:~$ groups jcheng
jcheng : noaa-hpc hurricane hwrf da-cpu da

According to the AIM View a list of projects link, Vijay Tallapragada is the portfolio manager and PI for the hurricane project.

@JingCheng-NOAA
Copy link
Contributor

JingCheng-NOAA commented Mar 26, 2024 via email

@ShunLiu-NOAA ShunLiu-NOAA merged commit b53740a into NOAA-EMC:develop Mar 27, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Address debug build and run issues for HAFS GSI
7 participants