-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix HAFS GSI debug build and run issues #679
Fix HAFS GSI debug build and run issues #679
Conversation
…bugs regarding uninitialized variables (e.g. toff) and dual_res in GSI. 1. In read_radar.f90, uninitialized toff is making all the ground-based radar observations to be placed at -3h instead of 0h, which create wrong increments for FGAT and 4DEnVar. 2. In read_radar.f90, unintialized zsges will crash the debug mode. 3. In read_radar.f90, t4dvo should be used instead of t4dv in the read_radar_l2rw_novadqc subroutine. 4. In radinfo.90, maxscan should be increased to at least 252 to allow more scans 5. In read_fl_hdob.f90, dlnpsob is replaced with 1000. since the SFMR does not sampling surface pressure, and the uninitiailzed dlnpsob create issues later in setupspd.f90. 6. In mod_fv3_lola.f90, (i,j+1) should be used instead of (i+1,j) in searching for V edges
… missing in case of future use.
@yonghuiweng Could you please review this PR? Thanks. |
@XuLu-NOAA Could you please run regression test on WCOSS and HERA? |
@ShunLiu-NOAA I can do it on Hera, but no WCOSS2 account yet. @yonghuiweng Can you help run it on WCOSS2? Thanks! |
5 test failed on wcoss2. I will re-test later. 29% tests passed, 5 tests failed out of 7 Total Test time (real) = 1554.23 sec The following tests FAILED: |
Thanks, Yonghui! Here are my tests on Hera:
|
After re-built the code, and tried serval times, I still only passed 3 tests (one more than the 1st try): Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build 43% tests passed, 4 tests failed out of 7 Total Test time (real) = 1615.83 sec The following tests FAILED: |
Hi, @yonghuiweng , can you run with --rerun-failed --output-on-failure, and put the log on Orion? Then I can check if 1 & 4 failed due to run time or something else. |
Yes, this test result is from the run with --rerun-failed. |
Thanks, Yonghui! |
Xu
Tnanks a lot for those informative analysis of the possible cause for the
failure of the global ens4dvar test.
I will take a look at them on wcoss2 next week and will first to see if it
is from the changed maxscan as you suspected
Regards
Ting
…______________________________
Ting Lei
Physical Scientist, Contractor with Lynker in support of
EMC/NCEP/NWS/NOAA
5830 University Research Ct., Cubicle 2765
College Park, MD 20740
***@***.***
301-683-3624
On Sat, Jan 13, 2024 at 6:11 PM Xu Lu ***@***.***> wrote:
Thanks, Yonghui!
After checking the log files for the global 4DEnVar. The initial grady and
cost functions are the same but the b of the first iteration is different.
Since this is global 4D, so it should have nothing to do with the dual_res
fix.
And the rw & spd in the convinfo are -1 in this configuration, so it
should have nothing to do with the read_radar and read_fl_hdob fixes.
The only possible change is the increased maxscan that expands the
satellite channels. I suspect that the original configuration was using
some random values, but not sure how to isolate it as we do not know what
sat obs exceeds the 250 default configuration. Also, I'm not sure why it's
not showing in Orion/Hera. It will be difficult for me to identify the
issue. Does anyone have clues or suggestions on the issue?
Thanks!
—
Reply to this email directly, view it on GitHub
<#679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APEFS7CBUN4N5WKFJ4P7LJLYOMIAFAVCNFSM6AAAAABBSIMZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQG44TANRWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Guys. the wcoss2 is still unavailable to me and I will give an update later. |
@yonghuiweng @XuLu-NOAA I just finished global_4denvar test and it passed( see output : c.out in /lfs/h2/emc/da/noscrub/Ting.Lei/dr-xu/GSI/build . Yonghui, would you please have a look to see what the differences are between your and my run? Thanks. |
Hi, @yonghuiweng , would you mind rerunning the regression test to see if you can reproduce the error on WCOSS2? Since Ting appears not able to reproduce. |
@XuLu-NOAA and @TingLei-NOAA, I did a start-over test, passed 5 out of 7 tasks and only 2 tests failed (Start 5: hafs_4denvar_glbens and Start 6: hafs_3denvar_hybens). The result is saved at /lfs/h2/emc/hur/noscrub/yonghui.weng/noscrub/regression. |
The 2nd test shows: 57% tests passed, 3 tests failed out of 7 Total Test time (real) = 1489.99 sec The following tests FAILED: |
@yonghuiweng could you give the path of your test on WCOSS2? @TingLei-NOAA will take a look at into the detail of failure. |
Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/GSI/build |
@xu Lu - NOAA Affiliate ***@***.***> Great . Thanks. for this
note/clarification.
Ting
…_____________________________
Ting Lei
Physical Scientist, Contractor with Lynker in support of
EMC/NCEP/NWS/NOAA
5830 University Research Ct., Cubicle 2765
College Park, MD 20740
***@***.***
301-683-3624
On Tue, Mar 5, 2024 at 2:12 PM Xu Lu ***@***.***> wrote:
@XuLu-NOAA <https://github.com/XuLu-NOAA> , so what are the cause for
failed tests? Are they ignorable or further investigation is needed?
Hi, Ting, the failures in HAFS 3d/4d are expected due to the fixes in toff
& mod_fv3_lola. This is the old pull request that has been frozen due to my
absence in the merging with the newest develop branch.
—
Reply to this email directly, view it on GitHub
<#679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/APEFS7FIHATU4N7SEK6XJJ3YWYKHZAVCNFSM6AAAAABBSIMZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZZGQ3DENZTGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Here's the RT_test results on Hera:
|
Hi, @TingLei-NOAA or @yonghuiweng , could any of you help run the ctest on WCOSS2 machines? I would imagine that's the last step of this PR. Thanks! |
Here is the test on wcoss2: Test project /lfs/h2/emc/hur/noscrub/yonghui.weng/regression/toff_fix/build 43% tests passed, 4 tests failed out of 7 Total Test time (real) = 1504.07 sec The following tests FAILED: |
Hi, @yonghuiweng , by any chance, can you take a look at the results and see if the failures in 1&4 are due to the time limit? If not, would you mind copy over the error messages on Orion? Appreciated! |
@XuLu-NOAA Both them are failed due to nonreproducible issue of loproc_updat and loproc_control. The files are copied to: /work2/noaa/hwrf/noscrub/yweng/regression/toff_fix_wcoss2. |
Thanks, @yonghuiweng! I checked the global 4denvar dir. The error is back to what we saw around Jan 13, where you failed but Ting cannot reproduce. The initial gradient & cost functions are the same, but the first b is somehow slightly different. My guess is still the maxscan in the original configurations are outbounded in some special occasions. But @TingLei-NOAA would you mind reproducing the error as before when you got time? Thanks for the help from both of you! |
@XuLu-NOAA there are some problems with global_4denvar test on wcoss2. You could find comments from @RussTreadon-NOAA over several issues/PRs. I have also an update on this : #712. |
Thanks, @TingLei-NOAA , How about the netcdf_fv3? The loproc & hiproc stdout for each updat & contrl are consistent. But the initial gradients are different between updat & contrl. Can you reproduce those errors on WCOSS2 as well? |
@XuLu-NOAA I will see netcdf_fv3 right away. |
@XuLu-NOAA using debug mode GSI, netcdf_fv3 regression test passed on wcoss2. Shall we dig more to see what the differences are between mine and @yonghuiweng 's runs? |
A quick question, in the old rerun of Jan 13 case, did you run in debug mode as well? Can you repeat his failure in non-debug mode? |
No, the previous re-run was using optimized GSI. When I have a chance, I will try optimized one for this time. |
@XuLu-NOAA, With @TingLei-NOAA's PR#698 merged into develop, is it a good time to revisit this PR? |
Hi, @ShunLiu-NOAA I , I've synced the lastest develop branch and tried the regression tests on Orion & Hera:
Orion:
The HAFS 4d & 3d failed as expected. The global 4denvar failed on Orion due to the maximum time issue: So everything looks fine on Orion & Hera. @yonghuiweng or @TingLei-NOAA Would you mind having a try with this latest version on WCOSS2 and see if the previous failure still persists? Thanks, |
@XuLu-NOAA and @TingLei-NOAA Thank you for regression test. With Sho's PR#700 merged, do you mind rerun regression test on Orion and WCOSS2? Sorry for this inconvenience. |
Hi, @ShunLiu-NOAA , Here're my ctest on Hera:
And on Orion:
The failures in hafs 4d/3d are expected. The failure in global 4d on Orion was due to the data access permission issue. Let's see what Ting found out with his test on WCOSS2. Best, |
@XuLu-NOAA , if you routinely use Orion or Hercules you should request Here's a blurb about Restricted Data access on Orion
We can not loosen restrictions on @ShunLiu-NOAA and @hu5970 , would you please check the files in CASES/regtest |
On wcoss2, hafs 3d/4d regression tests failed as expected while all other tests passed. |
Hi, @RussTreadon-NOAA , we already have the rstprod access in the AIM system for Hera/Jet etc, so we cannot choose it again. Do you have any clue who should we contact in this case? |
I would contact your task lead(s) and have them contact your federal oversight manager(s). |
According to
According to the AIM View a list of projects link, Vijay Tallapragada is the portfolio manager and PI for the |
Thank you Russ!
…On Tue, Mar 26, 2024 at 3:38 PM RussTreadon-NOAA ***@***.***> wrote:
According to groups you both belong to the hurricane project.
Orion-login-3:~$ groups xulu
xulu : noaa-hpc stmp aoml-hafs1 aoml-hafsda hurricane hwrf
Orion-login-3:~$ groups jcheng
jcheng : noaa-hpc hurricane hwrf da-cpu da
According to the AIM View a list of projects
<https://aim.rdhpcs.noaa.gov/cgi-bin/project.pl> link, Vijay Tallapragada
is the portfolio manager and PI for the hurricane project.
—
Reply to this email directly, view it on GitHub
<#679 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/BAHEWIMZ4ARYRCZYPKBGVR3Y2HFBXAVCNFSM6AAAAABBSIMZQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRRGMYTMNJUGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
DUE DATE for merger of this PR into
develop
is 2/19/2024 (six weeks after PR creation).DUE DATE for this PR is extended to 3/19/2024 because @XuLu-NOAA is on leave.
Description
Xu Lu (xu.lu@noaa.gov) and Biju Thomas (biju.thomas@noaa.gov) fixed bugs regarding HAFS GSI debug build and run issues. This is in corresponding to issue #661
Fixes #661
Type of change
How Has This Been Tested?
Regression test on Orion:
The failed hafs_3denvar and 4denvar are within expectation due to the fix for toff. As demonstrated in the single observation tests in the following figure, the uninitialized toff can result in increment degradations due to wrongly assigned observation times: