Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the reading of ensembles and setup for global multiscale runs #594

Merged
merged 51 commits into from
Sep 22, 2023

Conversation

jderber-NOAA
Copy link
Contributor

@jderber-NOAA jderber-NOAA commented Jul 26, 2023

This update improves the efficiency of the GSI, especially for multiscale runs. Details can be found in issue#585

The runs produce identical results except when ensembles are used. Identical results can be produced with ensembles as well with changes to 3 lines of code. These lines zero out negative moistures before creating virtual temperatures and use the original sensible temperatures rather than ones created from virtual temperatures (which were created from the original sensible temperatures).

All regression tests passed except due to the above reason. When those 3 lines were changed back, all regression tests passed. Changes due to the above 3 lines were very minor.

All testing was performed by myself on Hera.

See Issue to see examples of speed-ups of the code that resulted from this change.

Fixes #585

Checklist

  • [x ] My code follows the style guidelines of this project
  • [x ] I have performed a self-review of my own code
  • [ x] I have commented my code, particularly in hard-to-understand areas
  • [x ] New and existing tests pass with my changes
  • Any dependent changes have been merged and published

DUE DATE for this PR is 9/6/2023. If this PR is not merged into develop by this date, the PR will be closed and returned to the developer.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 3, 2023

Updated to head of trunk and remove commented out line from build.sh.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 4, 2023

Regression tests were rerun with this updated version of the code. All regression tests passed except 4denvar. The last update to the trunk appears to have introduced a very small change into the 4denvar. This difference is certainly at the scale of round-off. No difference in the initial penalty. First 3 iterations of the control and update are given below.

< cost,grad,step,b,step? = 1 0 6.592319865747931181E+05 1.700824233239468640E+03 1.057506532852620307E+00 0.000000000000000000E+00 good
< cost,grad,step,b,step? = 1 1 6.555948920937654329E+05 2.114103169411071576E+03 1.927472355822770655E+00 1.302893266057407962E+00 good
< cost,grad,step,b,step? = 1 2 6.469553380174754420E+05 1.349404725181090953E+03 2.810796397253191525E+00 1.212777155871386014E+00 good

cost,grad,step,b,step? = 1 0 6.592319865747931181E+05 1.700824233239468640E+03 1.057506532852620751E+00 0.000000000000000000E+00 good
cost,grad,step,b,step? = 1 1 6.555948920937654329E+05 2.114103169411072486E+03 1.927472355822761774E+00 1.302893266057409294E+00 good
cost,grad,step,b,step? = 1 2 6.469553380174754420E+05 1.349404725181082085E+03 2.810796397253162660E+00 1.212777155871367807E+00 good

Trying to find reason for small difference.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests
Install jderber-NOAA:optimize3 at 3e918e1 on Dogwood. Run ctests with following results

russ.treadon@dlogin08:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr594/build> ctest -j 9
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr594/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  483.81 sec
2/9 Test #7: rrfs_3denvar_glbens ..............   Passed  605.92 sec
3/9 Test #9: global_enkf ......................   Passed  678.08 sec
4/9 Test #5: hwrf_nmm_d3 ......................   Passed  797.90 sec
5/9 Test #4: hwrf_nmm_d2 ......................   Passed  1026.16 sec
6/9 Test #6: rtma .............................   Passed  1272.35 sec
7/9 Test #3: global_4denvar ...................***Failed  1503.70 sec
8/9 Test #2: global_4dvar .....................   Passed  1744.93 sec
9/9 Test #1: global_3dvar .....................   Passed  1983.04 sec

78% tests passed, 2 tests failed out of 9

Total Test time (real) = 1983.04 sec

The following tests FAILED:
          3 - global_4denvar (Failed)
          8 - netcdf_fv3_regional (Failed)
Errors while running CTest

The netcdf_fv3_regional failure is due to

The memory for netcdf_fv3_regional_loproc_updat is 352488 KBs.  This has exceeded maximum allowable memory of 238774 KBs, resulting in Failure memthresh of the regression test.

A check of the task 0 maximum resident set sizes for the updat (jderber-NOAA:optimize3) and contrl (develop) confirms that the loproc_updat uses more memory than loproc_contrl

netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 364344
netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 364280
netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 217068
netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 352488

The loproc_updat maximum resident set size is more consistent with the loproc_contrl for other ctests. It's not clear why the difference is larger for netcdf_fv3_regional. This failure, however, is not viewed as a fatal fail.

The global_4denvar failure is due to

The results (penalty) between the two runs are nonreproducible,
thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

The case has Failed the scalability test.
The slope for the update (54.378945 seconds per node) is less than that for the control (59.441120 seconds per node).

A check of the wall times shows that the updat code runs faster than the contrl

global_4denvar_hiproc_contrl/stdout:The total amount of wall time                        = 278.396694
global_4denvar_hiproc_updat/stdout:The total amount of wall time                        = 262.273809
global_4denvar_loproc_contrl/stdout:The total amount of wall time                        = 337.837814
global_4denvar_loproc_updat/stdout:The total amount of wall time                        = 305.776965

This is consistent with the optimization purpose of this PR. This is not a fatal fail.

The non-reproducible results between the updat and contrl is more puzzling. The initial total penalty and gradient are identical between the two codes. Differences show up in the step size for the second iteration of the first outer loop. 15 of the 19 printed digits are identical. Differences in the last four digits are at the level of real(8) numerical roundoff

updat

Initial cost function =  6.592319865747931181E+05
Initial gradient norm =  1.700824233239470232E+03
cost,grad,step,b,step? =   1   0  6.592319865747931181E+05  1.700824233239470232E+03  1.057506532852622527E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.555948920937654329E+05  2.114103169411077488E+03  1.927472355822775318E+00  1.302893266057414179E+00  good

contrl

Initial cost function =  6.592319865747931181E+05
Initial gradient norm =  1.700824233239470232E+03
cost,grad,step,b,step? =   1   0  6.592319865747931181E+05  1.700824233239470232E+03  1.057506532852622527E+00  0.000000000000000000E+00  good
cost,grad,step,b,step? =   1   1  6.555948920937654329E+05  2.114103169411077488E+03  1.927472355822779093E+00  1.302893266057414179E+00  good

John found similar behavior in his tests.

@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request Sep 7, 2023
6 tasks
@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 7, 2023 via email

@RussTreadon-NOAA
Copy link
Contributor

@jderber-NOAA , when you have time would you please update jderber-NOAA:optimize3 with the current head of the authoritative GSI develop? We may need to do this a few more times before this PR is merged into develop.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 21, 2023

Reason for reproducibility issue found. It was documented earlier in this development. With the change of 3 lines in get_gefs_ensperts_dualres.f90 (around line 190) all regression tests passed.

While looking for the reproducibility problem a few changes were made.

  1. A major error in control_vectors.f90 was found. The results from partsum were not saved from one value of nsubwin to the next. By saving the values of partsum, this problem is eliminated.
  2. A very minor change was made in general_spectral_transforms.f90. The lines were already in place but commented out. real(grd%nlon,r_kind) was used rather than float(grd%nlon).
  3. In hybrid_ensemble_isotropic.F90, subroutine bkerror_a_en, the indices of the alphacvarsclgrpmat array were reversed. This is a symmetric matrix so is not a real issue, but should be made right.
  4. In read_prepbufr.f90, subroutine read_prepbufr, the initialization of uob,vob and oelev used 0.0. This was changed to using constant zero.

Regression tests were rerun. All passed except rrfs_3denvar_glbens. Not sure why this did not pass. Results were the same.
rrfs_3denvar_glbens_hiproc_contrl/stdout:The total amount of wall time = 93.271814
rrfs_3denvar_glbens_hiproc_updat/stdout:The total amount of wall time = 81.411227
rrfs_3denvar_glbens_loproc_contrl/stdout:The total amount of wall time = 135.722101
rrfs_3denvar_glbens_loproc_updat/stdout:The total amount of wall time = 111.039343

Run times were faster.
rrfs_3denvar_glbens_hiproc_contrl/stdout:The maximum resident set size (KB) = 1136508
rrfs_3denvar_glbens_hiproc_updat/stdout:The maximum resident set size (KB) = 1136728
rrfs_3denvar_glbens_loproc_contrl/stdout:The maximum resident set size (KB) = 1793072
rrfs_3denvar_glbens_loproc_updat/stdout:The maximum resident set size (KB) = 1793692

Slightly more memory. This must have been the reason for the failure. But results are reasonable.

Updating to head of trunk.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 ctests
Install fresh clone of jderber-NOAA:optimize3 at 0b6bde9 on Cactus. Run ctests with following results

russ.treadon@clogin04:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr594/build> ctest -j 9
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr594/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  483.92 sec
2/9 Test #5: hwrf_nmm_d3 ......................   Passed  493.34 sec
3/9 Test #7: rrfs_3denvar_glbens ..............   Passed  605.97 sec
4/9 Test #4: hwrf_nmm_d2 ......................   Passed  607.00 sec
5/9 Test #9: global_enkf ......................   Passed  610.84 sec
6/9 Test #6: rtma .............................   Passed  1210.05 sec
7/9 Test #3: global_4denvar ...................   Passed  1442.23 sec
8/9 Test #1: global_3dvar .....................   Passed  1502.22 sec
9/9 Test #2: global_4dvar .....................   Passed  1563.07 sec

89% tests passed, 1 tests failed out of 9

Total Test time (real) = 1563.08 sec

The following tests FAILED:
          8 - netcdf_fv3_regional (Failed)

The netcdf_fv3_regional failure is due to the timing scalability check.

The case has Failed the scalability test.
The slope for the update (.601893 seconds per node) is less than that for the control (1.270841 seconds per node).

Examination of the updat and contrl wall times does not show any anomalous behavior

russ.treadon@clogin04:/lfs/h2/emc/ptmp/russ.treadon/pr594/tmpreg_netcdf_fv3_regional> grep wall */stdout
netcdf_fv3_regional_hiproc_contrl/stdout:The total amount of wall time                        = 63.280665
netcdf_fv3_regional_hiproc_updat/stdout:The total amount of wall time                        = 63.970015
netcdf_fv3_regional_loproc_contrl/stdout:The total amount of wall time                        = 64.551506
netcdf_fv3_regional_loproc_updat/stdout:The total amount of wall time                        = 64.451530

This is not a fatal fail.

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 21, 2023

After updating to the head of the trunk (both the control and updat), all regression tests passed on Hera.

Test project /scratch1/NCEPDEV/da/John.Derber/converge4/GSI/build
Start 1: global_3dvar
Start 2: global_4dvar
Start 3: global_4denvar
Start 4: hwrf_nmm_d2
Start 5: hwrf_nmm_d3
Start 6: rtma
Start 7: rrfs_3denvar_glbens
Start 8: netcdf_fv3_regional
Start 9: global_enkf
1/9 Test #5: hwrf_nmm_d3 ...................... Passed 560.00 sec
2/9 Test #9: global_enkf ...................... Passed 560.73 sec
3/9 Test #8: netcdf_fv3_regional .............. Passed 606.70 sec
4/9 Test #7: rrfs_3denvar_glbens .............. Passed 610.09 sec
5/9 Test #4: hwrf_nmm_d2 ...................... Passed 610.56 sec
6/9 Test #6: rtma ............................. Passed 1456.16 sec
7/9 Test #3: global_4denvar ................... Passed 1636.21 sec
8/9 Test #2: global_4dvar ..................... Passed 1811.06 sec
9/9 Test #1: global_3dvar ..................... Passed 1813.70 sec

100% tests passed, 0 tests failed out of 9

Total Test time (real) = 1813.71 sec

I will put back in the code the 3 lines that do give a minor difference. These lines result in fewer conversions between variables and should produce slightly more consistent results.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve pending peer reviews

Copy link
Contributor

@TingLei-NOAA TingLei-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished another a test using 3km conus domain RRFS case. This PR gives the identical results (final cost and gradients) compared with the EMC GSI trunk .
Thanks for this continual improvement over GSI

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Sep 21, 2023

A note: in my RRFS run, it is found there are little differences between the GSI of this PR built with or not with debug mode.
For example ( after 1hr30min run), the final cost and grad and so on are , for gsi in debug mode:

 cost,grad,step,b,step? =   2  31  5.108161372585962818E+04  4.728171079078897776E+01  2.902662923838958076E+00  1.425154427382457012E+00  good

It is , for GSI built with "realease" mode:

 cost,grad,step,b,step? =   2  31  5.108161597988699941E+04  4.728289846049271006E+01  2.902840431607158767E+00  1.425206942448039582E+00  good

EMC GSI trunk shows the same behavior. Namely, in debug mode, this PR and EMC GSI trunk show the identical results for build type : Release or debug and the above tiny differences exist between Release mode and debug mode for both branches.
I didn't notice this behavior before. If it is an issue to be further investigated , it is not specific to this PR.

Copy link
Collaborator

@CatherineThomas-NOAA CatherineThomas-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks @jderber-NOAA!

@TingLei-NOAA Thanks for looking into that reproducibility issue. If this is also an issue in the develop branch, I don't see a need to hold up this PR.

@TingLei-NOAA
Copy link
Contributor

@CatherineThomas-NOAA Agree!

@jderber-NOAA
Copy link
Contributor Author

jderber-NOAA commented Sep 22, 2023

After latest update to head of the trunk, regression test run. Results as expected. The update did not impact changes.

@RussTreadon-NOAA
Copy link
Contributor

Given the following

  • jderber-NOAA:optimize3 at 982425d is up to date with authoritative develop
  • approvals from two peer reviews
  • ctests from current head of jderber-NOAA:optimize3 pass
  • GSI Handling Review notified. OK received

proceed to merge jderber-NOAA:optimize3 into authoritative develop

@RussTreadon-NOAA RussTreadon-NOAA merged commit 2f4e7fe into NOAA-EMC:develop Sep 22, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GSI optimization with focus on ensemble input and multiscale setup.
4 participants