Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builds failing on WCOSS2 after LMOD_TMOD_FIND_FIRST setting added #3225

Closed
KateFriedman-NOAA opened this issue Jan 14, 2025 · 8 comments · Fixed by #3229
Closed

Builds failing on WCOSS2 after LMOD_TMOD_FIND_FIRST setting added #3225

KateFriedman-NOAA opened this issue Jan 14, 2025 · 8 comments · Fixed by #3229
Assignees
Labels
bug Something isn't working

Comments

@KateFriedman-NOAA
Copy link
Member

What is wrong?

Three builds are failing on WCOSS2 now after export LMOD_TMOD_FIND_FIRST=yes was added to ush/module-setup.sh:

kate.friedman@clogin05> ./build_all.sh gfs gsi
Resetting modules to system default. Reseting $MODULEPATH back to system default. All extra directories will be removed from $MODULEPATH.
Building 
Starting build_gsi_enkf.sh
Starting build_ufs_gfs.sh
Starting build_gfs_utils.sh
Starting build_ww3_gfs.sh
Starting build_ufs_utils.sh
Starting build_gsi_utils.sh
Starting build_gsi_monitor.sh
Starting build_upp.sh
build_gsi_enkf.sh failed with status 1!
build_ufs_utils.sh failed with status 1!
build_gsi_utils.sh failed with status 1!
build_gfs_utils.sh completed successfully!
build_gsi_monitor.sh completed successfully!
build_ww3prepost.sh completed successfully!
build_upp.sh completed successfully!
build_ufs.sh completed successfully!
BUILD ERROR: One or more components failed to build
  Check the associated build log(s) for details.

Error in the build logs (same error for all three):

Lmod has detected the following error: Unable to load module because of error when evaluating modulefile:
     /apps/ops/prod/libs/modulefiles/compiler/intel/19.1.3.304/sigio/2.3.2.lua: [string "help([[..."]:14: too many C levels (limit is 200) in main function near '"OPT"'
     Please check the modulefile and especially if there is a the line number specified in the above message
While processing the following module(s):
    Module fullname     Module Filename
    ---------------     ---------------
    sigio/2.3.2         /apps/ops/prod/libs/modulefiles/compiler/intel/19.1.3.304/sigio/2.3.2.lua
    w3emc/2.7.3         /apps/ops/prod/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.4/w3emc/2.7.3.lua
    nemsio/2.5.4        /apps/ops/prod/libs/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.4/nemsio/2.5.4.lua

What should have happened?

The builds should have completed successfully.

What machines are impacted?

WCOSS2

What global-workflow hash are you using?

53ed76e

Steps to reproduce

Clone develop and run build: ./build_all.sh gfs gsi gdas

Additional information

Started after 26fb850 commit pushed yesterday.

Do you have a proposed solution?

Temporarily comment out line 55 in ush/module-setup.sh:

 52 elif [[ ${MACHINE_ID} = wcoss2 ]]; then
 53     # We are on WCOSS2
 54     # Ignore default modules of the same version lower in the search path (req'd by spack-stack)
 55     export LMOD_TMOD_FIND_FIRST=yes
 56     module reset

Put in a TODO to uncomment when we move to spack-stack on WCOSS2.

Also, temporarily add export LMOD_TMOD_FIND_FIRST=yes to GDASApp build and load_ufsda_modules.sh. Test if other locations are needed. Add TODO comments where needed to remove setting later.

@KateFriedman-NOAA KateFriedman-NOAA added the bug Something isn't working label Jan 14, 2025
@KateFriedman-NOAA KateFriedman-NOAA self-assigned this Jan 14, 2025
@RussTreadon-NOAA
Copy link
Contributor

Thank you @KateFriedman-NOAA for documenting this problem. Sorry that I did not catch this in testing PR #3220.

I can review the PR for this issue when it's ready.

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA , this issue explains the failure I am seeing when I run GSI ctests from branch feature/wcoss2_ss on Cactus. The control gsi.x script fails with the messages you document in this issue. Thanks!

@KateFriedman-NOAA
Copy link
Member Author

All good @RussTreadon-NOAA ! I also didn't catch it in my PR that followed yours since I'd already built mine in CI test before yours went in. Thanks for confirming the error matches what you're also seeing in your testing, that helps!

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA , the proposed changes work for the g-w build on Cactus.

Execute ./build_all.sh all with the following changes in place
ush/load_ufsda_modules.sh - This change does not impact the build. It impacts running jobs.

@@ -29,6 +29,8 @@ ulimit_s=$( ulimit -S -s )
 # Find module command and purge:
 source "${HOMEgfs}/ush/detect_machine.sh"
 source "${HOMEgfs}/ush/module-setup.sh"
+#TODO  Remove the line below when g-w is updated to spack-stack
+export LMOD_TMOD_FIND_FIRST=yes      
 
 # Load our modules:
 module use "${HOMEgfs}/sorc/gdas.cd/modulefiles"

ush/module-setup.sh

@@ -51,8 +51,8 @@ elif [[ ${MACHINE_ID} = s4* ]] ; then
 
 elif [[ ${MACHINE_ID} = wcoss2 ]]; then
     # We are on WCOSS2
-    # Ignore default modules of the same version lower in the search path (req'd by spack-stack)
-    export LMOD_TMOD_FIND_FIRST=yes
+    #TODO Ignore default modules of the same version lower in the search path (req'd by spack-stack)
+    #TODO export LMOD_TMOD_FIND_FIRST=yes
     module reset
 
 elif [[ ${MACHINE_ID} = cheyenne* ]] ; then

Next up, running GDASApp based g-w CI on Cactus.

@KateFriedman-NOAA
Copy link
Member Author

Thanks @RussTreadon-NOAA ! FYI, I made similar changes in a clone and just kicked off manual CI tests on WCOSS2. My changes:

kate.friedman@clogin06:/lfs/h2/emc/global/noscrub/kate.friedman/git/bugfix-wcoss2_build> git diff --ignore-submodules
diff --git a/sorc/build_gdas.sh b/sorc/build_gdas.sh
index 43c503ab..8c85b2d8 100755
--- a/sorc/build_gdas.sh
+++ b/sorc/build_gdas.sh
@@ -20,6 +20,15 @@ while getopts ":j:dv" option; do
 done
 shift $((OPTIND-1))
 
+###################################################
+#TODO: Remove this block when spack-stack on WCOSS2
+readonly HOMEgfs=$(cd "$(dirname "$(readlink -f -n "${BASH_SOURCE[0]}" )" )/.." && pwd -P)
+source "${HOMEgfs}/ush/detect_machine.sh"
+if [[ "${MACHINE_ID}" == "wcoss2" ]]; then
+  export LMOD_TMOD_FIND_FIRST=yes
+fi
+###################################################
+
 # double quoting opts will not work since it is a string of options
 # shellcheck disable=SC2086
 BUILD_JOBS="${BUILD_JOBS:-8}" \
diff --git a/ush/load_ufsda_modules.sh b/ush/load_ufsda_modules.sh
index 8117d3f3..1c15484d 100755
--- a/ush/load_ufsda_modules.sh
+++ b/ush/load_ufsda_modules.sh
@@ -35,6 +35,10 @@ module use "${HOMEgfs}/sorc/gdas.cd/modulefiles"
 
 case "${MACHINE_ID}" in
   ("hera" | "orion" | "hercules" | "wcoss2")
+    #TODO: Remove LMOD_TMOD_FIND_FIRST line when spack-stack on WCOSS2
+    if [[ "${MACHINE_ID}" == "wcoss2" ]]; then
+      export LMOD_TMOD_FIND_FIRST=yes
+    fi
     module load "${MODS}/${MACHINE_ID}"
     ncdump=$( command -v ncdump )
     NETCDF=$( echo "${ncdump}" | cut -d " " -f 3 )
diff --git a/ush/module-setup.sh b/ush/module-setup.sh
index 366286d1..2429963d 100755
--- a/ush/module-setup.sh
+++ b/ush/module-setup.sh
@@ -52,7 +52,7 @@ elif [[ ${MACHINE_ID} = s4* ]] ; then
 elif [[ ${MACHINE_ID} = wcoss2 ]]; then
     # We are on WCOSS2
     # Ignore default modules of the same version lower in the search path (req'd by spack-stack)
-    export LMOD_TMOD_FIND_FIRST=yes
+    #export LMOD_TMOD_FIND_FIRST=yes #TODO: Uncomment this when using spack-stack
     module reset
 
 elif [[ ${MACHINE_ID} = cheyenne* ]] ; then

@KateFriedman-NOAA
Copy link
Member Author

My CI tests are running here on Cactus: /lfs/h2/emc/ptmp/kate.friedman/comrot/RUNTESTS/EXPDIR

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA : The change to sorc/build_gdas.sh is not necessary.

build_gdas.sh executes sorc/gdas.cd/build.sh. The GDASApp build.sh contains

case ${BUILD_TARGET} in
  hera | orion | hercules | wcoss2 | noaacloud | gaeac5 | gaeac6 )
    echo "`date` Building GDASApp on $BUILD_TARGET"
    source $dir_root/ush/module-setup.sh
    module use $dir_root/modulefiles
    module load GDAS/$BUILD_TARGET.$COMPILER

The GDASApp build sources its own copy of module-setup.sh. The WCOSS2 section of the GDASApp ush/module-setup.sh contains

elif [[ $MACHINE_ID = wcoss2 ]]; then
    # We are on WCOSS2                                                                                                                                                                          
    # Ignore default modules of the same version lower in the search path (req'd by spack-stack)                                                                                                
    export LMOD_TMOD_FIND_FIRST=yes
    module reset

@KateFriedman-NOAA
Copy link
Member Author

The change to sorc/build_gdas.sh is not necessary.

Ok good, thanks for letting me know! I added it to be safe, I'll remove it from my changes now.

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this issue Jan 15, 2025
Temporarily comment out LMOD_TMOD_FIND_FIRST=yes in
module-setup.sh. Move it to ush/load_ufsda_modules.sh
for runtime usage. Left note to undo these changes
when WCOSS2 is using spack-stack.

Refs NOAA-EMC#3225
KateFriedman-NOAA added a commit that referenced this issue Jan 15, 2025
…S2 (#3229)

Temporarily comment out the `LMOD_TMOD_FIND_FIRST=yes` setting in `ush/module-setup.sh`.
Move it to `ush/load_ufsda_modules.sh` for runtime usage for now.
Left note to undo these changes when WCOSS2 is using spack-stack.

Also found and corrected a spelling mistake.

Refs #3225
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants