Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] Fixes for gaea, noaacloud, and miniconda updates #444

Merged
merged 53 commits into from
Nov 1, 2022

Conversation

mark-a-potts
Copy link
Collaborator

DESCRIPTION OF CHANGES:

This PR combines fixes for the noaacloud, @natalie-perlin's miniconda updates, and fixes for the conda activate/deactivate/reactivate issue observed on various platforms.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DOCUMENTATION:

This PR should not require any documentation changes.

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@natalie-perlin @ulmonian

natalie-perlin and others added 30 commits October 25, 2022 16:03
remove loading of system python3
remove loading system python module
Load updated miniconda3 and ask to activate regional_workflow enviroment
Load an updated miniconda3 and ask to activate regional_workflow environment
Update miniconda3 module location and ask to activate regional_workflow
Update miniconda3/4.12.0 module location and ask to activate regional_workflow environment
Run an additional cycle of "conda deactivate" and "conda activate regional_workflow". It ensures that _python3_ binary path from the *regional_workflow* environment  becomes prepended to the search $PATH, and is found first, before the _python3_ from miniconda3/4.12.0 from the *base* environment.
"conda activate regional_workfow"
…orkflow.lua

use new miniconda3/4.12.0 with regional_workflow environment
all the requested packages for the python3 are found in regional_workflow environment
load updated miniconda3/4.12.0 with regional_workflow environment
Load an updated miniconda3/4.12.0 with the regional_workflow environment
need to have miniconda3 loaded in build module
need to have miniconda3 loaded in the build module
need to have miniconda3 loaded in build module
need to have miniconda3 loaded in build module
need to have miniconda3 loaded in the build modulefile
need to have miniconda3 loaded in build modulefile
@MichaelLueken MichaelLueken added release This PR/issue is related to a release branch run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests labels Oct 31, 2022
@natalie-perlin
Copy link
Collaborator

@danielabdi-noaa , @mark-a-potts -
If different python module is loaded, Lmod/Lua requires it to be unloaded first, before be before miniconda3 is loaded (otherwise it throws an error when you attempt "load wflow_cheyenne".

Also, the script load_modules_run_task.sh loads the build_<machine>_<compiler>.lua module at the runtime for each task.

@MichaelLueken
Copy link
Collaborator

@mark-a-potts Both manual and Jenkins CI tests are failing on Orion and Cheyenne. The Jenkins CI pipeline failed to connect, causing the failure on Jenkins. For the manual run on Orion, I see the following:

201907010000               make_grid                     7518309           SUCCEEDED                   0         1          52.0
201907010000               make_orog                     7518367                DEAD                 256         2          29.0
201907010000          make_sfc_climo                           -                   -                   -         -             -
201907010000           get_extrn_ics                     7518316           SUCCEEDED                   0         1          78.0
201907010000          get_extrn_lbcs                     7518319           SUCCEEDED                   0         1         140.0

The message in log/make_orog.log:

/work/noaa/epic-ps/mlueken/ufs-srweather-app/scripts/exregional_make_orog.sh: line 562: python3: command not found
FATAL ERROR:
ERROR:
  From script:  ""
  Full path to script:  ""
Call to function to create links to orography files failed.
Exiting with nonzero status.
End exregional_make_orog.sh at Mon Oct 31 15:25:50 UTC 2022 with error code 1 (time elapsed: 00:00:24)
FATAL ERROR:
ERROR:
  From script:  "JREGIONAL_MAKE_OROG"
  Full path to script:  "/work/noaa/epic-ps/mlueken/ufs-srweather-app/jobs/JREGIONAL_MAKE_OROG"
Call to ex-script corresponding to J-job "JREGIONAL_MAKE_OROG" failed.
Exiting with nonzero status.
End JREGIONAL_MAKE_OROG at Mon Oct 31 15:25:50 UTC 2022 with error code 1 (time elapsed: 00:00:24)

For Cheyenne:

201907010000               make_grid                     7031076                DEAD                   1         2           7.0
201907010000               make_orog                           -                   -                   -         -             -
201907010000          make_sfc_climo                           -                   -                   -         -             -
201907010000           get_extrn_ics                     7031053                DEAD                   1         1           7.0
201907010000          get_extrn_lbcs                     7031055                DEAD                   1         1           7.0
Lmod has detected the following error: Cannot load module "miniconda3/4.12.0"
because these module(s) are loaded:
   python

While processing the following module(s):
    Module fullname              Module Filename
    ---------------              ---------------
    miniconda3/4.12.0            /glade/work/epicufsrt/contrib/miniconda3/modulefiles/miniconda3/4.12.0.lua
    miniconda_regional_workflow  /glade/scratch/mlueken/ufs-srweather-app/modulefiles/tasks/cheyenne/miniconda_regional_workflow.lua
    make_grid.local              /glade/scratch/mlueken/ufs-srweather-app/modulefiles/tasks/cheyenne/make_grid.local.lua

/glade/scratch/mlueken/ufs-srweather-app/ush/bash_utils/print_msg.sh: line 216: BASH_SOURCE[1]: unbound variable
FATAL ERROR:
ERROR:
  From script:  ""
  Full path to script:  ""
  Loading .local module file (in directory specified by mod-
  ules_dir) for the specified task (task_name) failed:
    task_name = "make_grid"
    modulefile_local = "make_grid.local"
    modules_dir = "/glade/scratch/mlueken/ufs-srweather-app/modulefiles/tasks/cheyenne"
Exiting with nonzero status.

Please see https://jenkins-epic.woc.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-444/2/pipeline for the Jenkins CI pipeline associated with this work. All tests on Hera, Jet, and Gaea have successfully passed, but there are issues on both Cheyenne and Orion.

@danielabdi-noaa
Copy link
Collaborator

@mark-a-potts Orion build modulefile also misses python3 loading.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Oct 31, 2022

A new issue submitted: #446

Discussing the option to retire the use of *local.lua files. If miniconda3 is loaded in build_machine_compiler.lua file, there is no need in use of *local.lua files, 10 of additional files required per machine!..

@natalie-perlin
Copy link
Collaborator

@MichaelLueken - the make_orog.local.lua file is missing on Orion, Hera, Jet... The SRW_ENV variable is not set, and therefore regional_environment is not activated.
The issue #446 is exactly addresses excessive and and seemingly unnecessary use of modulefiles to start local tasks.

It used to be needed when different conda environments were needed by different tasks (e.g., pygraf environment for plotting). Now all of tasks use regional_workflowenvironment that contains all the python3/conda packages.

@MichaelLueken
Copy link
Collaborator

@MichaelLueken - the make_orog.local.lua file is missing on Orion, Hera, Jet... The SRW_ENV variable is not set, and therefore regional_environment is not activated. The issue #446 is exactly addresses excessive and and seemingly unnecessary use of modulefiles to start local tasks.

It used to be needed when different conda environments were needed by different tasks (e.g., pygraf environment for plotting). Now all of tasks use regional_workflowenvironment that contains all the python3/conda packages.

@natalie-perlin Do you know why the WE2E tests are behaving correctly on Hera and Jet? Is it a case that the conda activate regional_workflow that is set when the tests are generated is still loaded on those machines, while this isn't the case on Orion?

@MichaelLueken
Copy link
Collaborator

@mark-a-potts On Cheyenne, when I updated to aa0bb86, the WE2E tests are running properly. Further, I made two changes on Orion:

  1. modulefiles/build_orion_intel.lua:
    Added load(pathJoin("python", os.getenv("python_ver") or "3.9.2")) below loading of cmake.

  2. modulefiles/tasks/orion/miniconda_regional_workflow.lua
    Added unload("python") to the top of the modulefile.

Following these two modifications, the WE2E are now running there as well.

I'll let you know if I encounter any failures with these modifications in my manual runs.

@natalie-perlin
Copy link
Collaborator

@MichaelLueken -
https://jenkins-epic.woc.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-444/2/pipeline/
If this is the workflow, the tests on Hera and Jet appear yet running. The first three tasks (make_grid, get_lbcs, get_ics should be fine, it's the next task that is expected to fail)

@MichaelLueken
Copy link
Collaborator

@MichaelLueken - https://jenkins-epic.woc.noaa.gov/blue/organizations/jenkins/ufs-srweather-app%2Fpipeline/detail/PR-444/2/pipeline/ If this is the workflow, the tests on Hera and Jet appear yet running. The first three tasks (make_grid, get_lbcs, get_ics should be fine, it's the next task that is expected to fail)

@natalie-perlin The Jenkins CI tests for that pipeline on Jet has passed and Hera has a SUCCESS for each test with the final pass coming within the next ten minutes. The issue now is only with Orion, but a manual run with the changes I've outlined above is currently running without issue.

@mark-a-potts
Copy link
Collaborator Author

I just pushed changes with the added files for Orion. My manual test there ran to completion.

@mark-a-potts
Copy link
Collaborator Author

I don't think we need to add the python module for Orion. The default system python3 is 3.7.5, which should be new enough for the build step of the WM.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The manual WE2E tests are currently running on Orion with the modifications I have laid out in this review. I'll rerun the Orion and Cheyenne tests in Jenkins, then this work can be merged.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @mark-a-potts has successfully ran the modified files on Orion, I will go ahead and approve these changes for now. The Jenkins CI pipeline has been resubmitted for Orion. Once the tests pass, these changes will be ready to be merged.

@@ -8,8 +8,8 @@ whatis([===[Loads libraries needed for running SRW on Orion ]===])
load("contrib")
load("rocoto")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not end up putting a python module into the Orion module files, so there is no python to unload.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Oct 31, 2022

Successfully ran a single test with the PR-444 on Cheyenne, Gaea, Hera, Jet.
The SRW builds on Orion, and getting submitted OK (is queued, but no problems seen earlier at this stage).

UPD.: Orion test completed successfully

@panll
Copy link
Collaborator

panll commented Oct 31, 2022

Yes, it works now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release This PR/issue is related to a release branch run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants