Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APP=S2SW does not build on Gaea #511

Closed
DeniseWorthen opened this issue Apr 6, 2021 · 16 comments
Closed

APP=S2SW does not build on Gaea #511

DeniseWorthen opened this issue Apr 6, 2021 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

Description

The app=s2sw fails to build on Gaea.

To Reproduce:

Checkout the develop branch of ufs-weather-model. Edit rt.conf to contain a single S2SW compile and save it as rt.test:

COMPILE | APP=S2SW SUITES=FV3_GFS_2017_coupled,FV3_GFS_2017_satmedmf_coupled,FV3_GFS_v15p2_coupled,FV3_GFS_v16_coupled            | - wcoss_cray  jet.intel       | fv3 |
RUN     | cpld_control_wave                                                                                                       | - wcoss_cray  jet.intel       | fv3 |

Be sure to remove the gaea.intel from the next to last column, otherwise rt.sh will just exit.

Run the test:

./rt.sh -l rt.test >output 2>&1 &

Look in the RT-directory that the jobs is compiling in: compile_001/build_fv3_001/ww3_make.log:

gmake[3]: Entering directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/esmf'
gmake[3]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.

                *****************************
              ***   WAVEWATCH III setup     ***
                *****************************


[INFO] local env file wwatch3.env found in /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/wwatch3.env
   Setup file /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/wwatch3.env found
      Printer (listings)          :
      auxiliary FORTRAN compiler  : gfortran
      auxiliary C compiler        : gcc
      Source directory            : /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model
      Scratch directory           : /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/tmp
      Save source code            : yes
      Save listings               : yes

   Setup makefile for auxiliary programs


   Compile auxiliary programs
make[4]: Entering directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/aux'
gfortran -o /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/w3adc w3adc.f
make[4]: gfortran: Command not found
make[4]: *** [makefile:10: /lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/bin/w3adc] Error 127
make[4]: Leaving directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/aux'

ERROR: Error occured during compile of auxiliary programs

gmake[3]: *** [Makefile:152: setup] Error 1
gmake[3]: Leaving directory '/lustre/f2/pdata/ncep/Denise.Worthen/ufs-weather-model/WW3/model/esmf'
@DeniseWorthen DeniseWorthen added the bug Something isn't working label Apr 6, 2021
@JessicaMeixner-NOAA
Copy link
Collaborator

So, when I submit a build job on Gaea I'm getting this error:

+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"eproxy/2.0.24-7.0.2.1_2.20__g8e04b33.ari"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "eproxy/2.0.24-7.0.2.1_2.20__g8e04b33.ari"

Also make sure that all modulefiles written in TCL start with the string
#%Module

which appears to be from this line in compile.sh: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/compile.sh#L66:
source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh

Anyone else ever gotten this error? I can't get to the error @DeniseWorthen mentioned because of this right now. I've tried from a fresh clone to make sure I didn't do anything and I don't have much in my .cshrc file.

@climbfuji
Copy link
Collaborator

The last person reporting this error was using tcsh, I recommended switching to bash and never heard back. Either it worked or that person gave up.

@JessicaMeixner-NOAA
Copy link
Collaborator

@climbfuji I am using tcsh, so that's at least consistent.

@climbfuji
Copy link
Collaborator

Let me see if I can get this to work (remind me tomorrow, please) ... I use bash and the (t)csh version is not as well tested, obviously.

@JessicaMeixner-NOAA
Copy link
Collaborator

@climbfuji I made a seperate issue #536 so this issue can get back to being about the S2SW not building.

@DeniseWorthen I'll try from the command line again, but I might need you to help test until I can get the other sorted out.

@climbfuji
Copy link
Collaborator

Let me see if I can get this to work (remind me tomorrow, please) ... I use bash and the (t)csh version is not as well tested, obviously.

The trouble is that I cannot reproduce the problem, because the following works:

Dom.Heinzeller@gaea14:~> export | grep SHELL
declare -x SHELL="/bin/bash"
Dom.Heinzeller@gaea14:~> tcsh
Directory: /ncrc/home2/Dom.Heinzeller
home2/Dom.Heinzeller> env | grep SHELL
SHELL=/bin/bash
home2/Dom.Heinzeller> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
Illegal variable name.
home2/Dom.Heinzeller> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.csh
Activating lua module environment
Reloading modules ... (sit back and relax)
home2/Dom.Heinzeller>

Note that the environment variable SHELL still says bash, even though I am in a tcsh shell. Somehow it remembers aspects of my original bash login shell.

@JessicaMeixner-NOAA
Copy link
Collaborator

@climbfuji I also can load on the login node:

> source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.csh
Activating lua module environment
Reloading modules ... (sit back and relax)

but when you submit the job, the modules do not load. So it's hard. I'd be happy to do what I can to help test/reproduce the issues. I made another issue for this ( #536), should I close it?

@DeniseWorthen
Copy link
Collaborator Author

I was able to build @JessicaMeixner-NOAA gaea_ww3 branch using:

source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
module use modulefiles/
module load ufs_gaea.intel
CMAKE_FLAGS="-DAPP=S2SW" CCPP_SUITES="FV3_GFS_2017_coupled,FV3_GFS_2017_satmedmf_coupled,FV3_GFS_v15p2_coupled" BUILD_VERBOSE=1 BUILD_JOBS=1 ./build.sh > output 2>&1 &

@climbfuji
Copy link
Collaborator

@JessicaMeixner-NOAA I told init_lmod.sh (and init_lmod.csh) to ignore errors while loading modules. With that I could switch to tcsh and submit a job card from the ufs-weather-model, which uses something like this:

#!/bin/bash -l
#SBATCH -e err.bash
#SBATCH -o out.bash
#SBATCH --job-name="init_lmod_bash_test"
#SBATCH --account=esrl_bmcs
#SBATCH --qos=normal
#SBATCH --clusters=c4
#SBATCH --ntasks=1
#SBATCH --time=5

set -eux

source ./module-setup.sh
source /lustre/f2/pdata/esrl/gsd/contrib/lua-5.1.4.9/init/init_lmod.sh
module use $( pwd -P )
module load modules.fv3
module list

echo "Model started:  " `date`

sync && sleep 1
# here would be the call to srun ... fv3.exe

echo "Model ended:    " `date`

Can you check if this works for you? It's not an ideal solution, because if something changes with the module environment that breaks the init_lmod scripts we'll find out only when we compile the model / run the tests, but it's better than nothing (if it works). Thanks!

@JessicaMeixner-NOAA
Copy link
Collaborator

@climbfuji Where can I find the module-setup.sh file? I copied the /modulefiles/ufs_gaea.intel to modules.fv3, but it fails because there is no module-setup.sh.

@climbfuji
Copy link
Collaborator

Can you copy it from here for now? It's some file under NEMS with a different name.

/lustre/f2/scratch/Dom.Heinzeller/FV3_RT/init_lmod_test

@JessicaMeixner-NOAA
Copy link
Collaborator

I think it worked. My directory is here: /lustre/f2/scratch/ncep/Jessica.Meixner/init_lmod_test
I ran tryfix.sub

@climbfuji
Copy link
Collaborator

Yes, looks good. Can you try building the APP S2SW?

@JessicaMeixner-NOAA
Copy link
Collaborator

I submitted rt.sh -e and that still failed... I guess I'll have to switch from tcsh to bash?

I think Denise can now build with the fix I suggested, so we can hopefully at least have that fixed, even if I can't do it myself.

@DusanJovic-NOAA
Copy link
Collaborator

I guess I'll have to switch from tcsh to bash?

Good idea. Regardless of this issue.

@DeniseWorthen
Copy link
Collaborator Author

Build on gaea was added in PR #533

pjpegion pushed a commit to NOAA-PSL/ufs-weather-model that referenced this issue Apr 4, 2023
* Reset to zero coupling arrays for accumulated snow,  large scale rain, and convective rain at the end of each coupling step if coupling with chemistry model.
* Properly set kind type of literal constants defining zero and one.
* Initialize to zero canopy resistance output variablein noah/osu land-surface model subdriver.
* Re-implement radiation diagnostic output involving spectral band layer cloud optical depths (0.55 and 10 mu channels)
to prevent floating invalid errors due to uninitialized optical depth arrays.
* Temporarily disable filling export fields during the NUOPC Realize phase since it breaks coupling with aerosol component.
* Increase maximum number of input aerosol scavenging factors to accommodate AQM/CMAQ 5.2.1 chemical tracers.
* Remove inst_pres_height_surface from chemistryFieldNames as its imported already elsewhere

Co-authored-by: Raffaele Montuoro <raffaele.montuoro@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants