Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Update ufs-weather-model hash and UPP hash and use upp-addon-env spack-stack environment #1136

Draft
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

MichaelLueken
Copy link
Collaborator

@MichaelLueken MichaelLueken commented Oct 8, 2024

DESCRIPTION OF CHANGES:

  • Update ufs-weather-model hash to 38a29a6 (September 19)
  • Update UPP hash to 81b38a8 (August 13)
  • All Tier-1 modulefiles/build_* files have been updated to use the upp-addon-env spack-stack environment
  • srw_common.lua was updated to use g2/3.5.1 and g2tmpl/1.13.0 (these are required for UPP)
  • .cicd/JENKINSFILE was updated to replace cheyenne entries with derecho and to comment out derecho until EPIC allocation issues have been addressed
  • The doc/tables/Tests.csv table had nco-mode WE2E tests removed
  • The .github/CODEOWNERS file was updated to add Bruce Kropp to the list of reviewers
  • The exregional_plot_allvars.py and exregional_plot_allvars_diff.py scripts were updated to address changes made to the postxconfig-NT-fv3lam.txt file.

Type of change

  • New feature (non-breaking change which adds functionality)

TESTS CONDUCTED:

  • derecho.intel
  • gaea.intel - Fundamental and Comprehensive WE2E tests were successfully run
  • hera.gnu - Fundamental and Comprehensive WE2E tests were successfully run
  • hera.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • hercules.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • jet.intel - Fundamental WE2E tests were successfully run
  • orion.intel - Fundamental, Comprehensive, AQM WE2E, and AQM sample configuration tests were successfully run
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DOCUMENTATION:

Updated documentation related to the table defining the WE2E tests currently available in the SRW App. The nco-mode WE2E tests had been removed, but were still present in the Tests.csv file. Removed these entries.

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • My changes generate no new warnings
  • New and existing tests pass with my changes

MichaelLueken and others added 11 commits September 30, 2024 19:27
…19) and UPP hash to 81b38a8 (Aug 13). Point to upp-addon-env spack-stack environment on Hera. Update srw_common.lua to use g2/3.5.1 and g2tmpl/1.13.0. Updated exregional_plot_allvars.py to handle updates made to postxconfig-NT-fv3lam.txt.
 * .cicd/Jenkinsfile - Replaced cheyenne with derecho in commented
   sections and commented out Derecho.
 * doc/tables/Tests.csv - Removed nco-mode WE2E tests since these have
   been removed from the repository.
 * modulefiles/build_derecho_intel.lua - Update spack-stack environment
   to upp-addon-env.
 * .github/CODEOWNERS - Added Bruce Kropp as a reviewer from the Platform team.
 * modulefiles/build_orion_intel.lua - Udated spack-stack environment to upp-addon-env.
 * scripts/exregional_plot_allvars.py - Found method to successfully plot REFC using seek() and readline() pygrib commands
 * scripts/exregional_plot_allvars_diff.py - Same
@MichaelLueken
Copy link
Collaborator Author

I was able to successfully find a method to read the composite radar reflectivity from the post GRIB2 file. The issue was with the pygrib indexing of the GRIB2 file. Using wgrib2, the 37th entry in the post GRIB2 file was the REFC entry:

wgrib2 srw.t00z.prslev.f000.rrfs_conus_25km.grib2 -match REF
37:640531:d=2019070100:REFC:entire atmosphere (considered as a single layer):anl:

Using pygrib to generate an index of the GRIB2 file, we see the following for the 37th entry:

37:5:5 (instant):lambert:atmosphereSingleLayer:level 0 considered as a single layer:fcst time 0 hrs:from 201907010000

Clearly, there is something going on with the pygrib index. To get around this issue, I used the rewind() function to move to the start of the index, followed by seek(36), which moves to the 36th entry in the GRIB2 index. Finally, using readline().values allowed me to read in the information for the 37th (Maximum/Composite radar reflectivity) entry and successfully plot the data.

The following image was generated from the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
WE2E test using the current develop branch:
FV3_GFS_v17_p8_refc_baseline

This is the image generated using my feature/hash_update branch with the changes to exregional_plot_allvars.py:
FV3_GFS_v17_p8_refc_hash_update

The issue has been corrected and the composite radar reflectivity has successfully been plotted.

…ther than the deprecated atmos_nthreads, to correct issue with threading in the weather model
@MichaelLueken
Copy link
Collaborator Author

@mkavulich -

While all 6 fundamental WE2E tests successfully pass following the latest updates (example given was run on Hercules):

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              21.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20241  COMPLETE               8.50
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.67
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024101  COMPLETE              46.18
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20241015114  COMPLETE              88.54
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024101511493  COMPLETE              62.54
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             253.17

the comprehensive tests are failing with the following error:

FATAL from PE 15: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic

It isn't clear what the issue is. Since the indicated include file is part of the FV3, I'll reach out to them and see what might be happening for the tests that are now failing with this error message.

@mkavulich
Copy link
Collaborator

@MichaelLueken Thanks for your work on the OMP problem. I came here to report a similar problem: I applied your changes and noticed there is still at least one fundamental test failing on Hera: grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR. It's failing with the not all the pe_end are in the pelist error, even though all the ufs.configure and other settings appear correct. I can't figure out why as the rest of the fundamental tests are succeeding and only this one is failing, and it's especially worrysome that the failures seem to be different on different platforms. The only thought that immediately comes to mind is maybe there's an off-by-one or similar edge case based on node size?

@MichaelLueken
Copy link
Collaborator Author

@mkavulich That could certainly be the issue. Another potential issue is with respect to layout and io_layout in the input.nml file. While layout is set to be LAYOUT_X, LAYOUT_Y, io_layout is set to be 1,1. Should IO_LAYOUT_Y and IO_LAYOUT_X be set to 1, or should the value of IO_LAYOUT_X be set to WRTCMP_write_groups and IO_LAYOUT_Y be set to WRTCMP_write_tasks_per_group? Looking through some of the issues of others who have encountered this error message, it looks like they didn't properly set up one or more of the layout and io_layout entries in input.nml, which is why I'm focusing more on this direction.

@MichaelLueken
Copy link
Collaborator Author

Issue #362 was opened in NOAA-GFDL/GFDL_atmos_cubed_sphere asking about the strange FATAL from PE *: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic error message.

@mkavulich mkavulich mentioned this pull request Oct 16, 2024
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SRW forecast runs do not use threading even if OMP_NUM_THREADS_RUN_FCST is set > 1
2 participants