Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: WRF tutorial : gen_retro_icbc.csh has set paramfile twice #295

Closed
hkershaw-brown opened this issue Oct 1, 2021 · 14 comments
Closed
Assignees
Labels
Bug Something isn't working

Comments

@hkershaw-brown
Copy link
Member

hkershaw-brown commented Oct 1, 2021

🐛 Your bug may already be reported!
Please search on the issue tracker before creating a new issue.

Quick note on gen_retro_icbc.csh
Will fill in details when I run this.

 45 set paramfile = /glade2/scratch2/USERNAME/WORK_DIR/scripts/param.csh   # set this appropriately #%%%#
 46 set paramfile = /glade/work/thoar/DART/clean_rma_trunk/models/wrf/tutorial/scripts/param.csh

Describe the bug

We have had a couple of users last week hit problems in the wrf tutorial where
input.nml templates were not found.

  1. List the steps someone needs to take to reproduce the bug.
  2. What was the expected outcome?
  3. What actually happened?

Error Message

Please provide any error messages.

gen_retro_icbc.csh is running in /scratch/xxxxxx/DART/models/wrf/work_example/scripts
Entering gen_retro_icbc.csh for 2017042700
rm: No match.
FATAL ERROR in find_namelist_in_file Namelist input file: input.nml must exist. utilities_mod.f90 stopping.
RUNNING REAL, STEP 1

Which model(s) are you working with?

WRF

Version of DART

Which version of DART are you using?
v9.11.11

Have you modified the DART code?

No

Build information

I think this is any machine

@hkershaw-brown hkershaw-brown self-assigned this Oct 1, 2021
@hkershaw-brown hkershaw-brown added the Bug Something isn't working label Oct 4, 2021
@hkershaw-brown
Copy link
Member Author

hkershaw-brown commented Oct 4, 2021

another note on this:
following the instructions, BASE_DIR needs to be set as an environment variable outside the scripts and then is set again in param.csh. There is quite a mix of by hand copying files to setup and running setup scripts.

note: shell_scripts/init_ensemble_var.csh

 59 
 60    echo "  QUEUEING ENSEMBLE MEMBER $n at `date`"
 61 
 62    mkdir -p ${RUN_DIR}/advance_temp${n}
 63 
 64    # TJH why does the run_dir/*/input.nml come from the template_dir and not the rundir?
 65    # TJH furthermore, template_dir/input.nml.template and rundir/input.nml are identical. SIMPLIFY.
 66 
 67    ${LINK} ${RUN_DIR}/WRF_RUN/* ${RUN_DIR}/advance_temp${n}/.
 68    ${LINK} ${TEMPLATE_DIR}/input.nml.template ${RUN_DIR}/advance_temp${n}/input.nml
 69 
 70    ${COPY} ${OUTPUT_DIR}/${initial_date}/wrfinput_d01_${gdate[1]}_${gdate[2]}_mean \
 71            ${RUN_DIR}/advance_temp${n}/wrfvar_output.nc
 72    sleep 3
 73    ${COPY} ${RUN_DIR}/add_bank_perts.ncl ${RUN_DIR}/advance_temp${n}/.
 74 
 75    set cmd3 = "ncl 'MEM_NUM=${n}' 'PERTS_DIR="\""${PERTS_DIR}"\""' ${RUN_DIR}/advance_temp${n}/add_bank_perts.ncl"
 76    ${REMOVE} ${RUN_DIR}/advance_temp${n}/nclrun3.out
 77           cat >!    ${RUN_DIR}/advance_temp${n}/nclrun3.out << EOF
 78           $cmd3
 79 EOF
 80    echo $cmd3 >! ${RUN_DIR}/advance_temp${n}/nclrun3.out.tim   # TJH replace cat above
 81 
 82    cat >! ${RUN_DIR}/rt_assim_init_${n}.csh << EOF

@hkershaw-brown hkershaw-brown changed the title bug: gen_retro_icbc.csh has set paramfile twice bug: WRF tutorial : gen_retro_icbc.csh has set paramfile twice Oct 5, 2021
@hkershaw-brown
Copy link
Member Author

hkershaw-brown commented Oct 8, 2021

note 2:

There are a couple of places that everything in the WRF_RUN directory gets linked:

init_ensemble_var.csh

67    ${LINK} ${RUN_DIR}/WRF_RUN/* ${RUN_DIR}/advance_temp${n}/.

new_advance_model.csh

221    # link WRF-runtime files (required) and be.dat (if using WRF-Var)
222      ${LN} ${CENTRALDIR}/WRF_RUN/*       .

A users asked a question about rsl.out.0000 and rsl.error.0000 getting linked to WRF_RUN for every ensemble member. So all ensemble members would be writing to WRF_RUN/rsl.out.0000

Is the script expecting that you never run wrf.exe in WRF_RUN? I think the scripts are expecting only the files needed to run wrf, not any output files.

@hkershaw-brown
Copy link
Member Author

note 3: prep_ic.csh

if ( $#argv > 0 ) then
   set n     = ${1}   # pass in the ensemble member number
   set datep = ${2}   # needed for correct path to file
   set dn    = ${3}
   set paramfile = ${4}
else # values come from environment variables   #TJH If these are not set ....
   set n     = $mem_num
   set datep = $date
   set dn    = $domain
   set paramfile = $paramf
endif
source $paramfile

-echo "prep_ic.csh using n=$n datep=$datep dn=$dn paramfile=$paramf" ! paramfile might be ${4} not $paramf
+echo "prep_ic.csh using n=$n datep=$datep dn=$dn paramfile=$paramfile"

@braczka
Copy link
Contributor

braczka commented Feb 8, 2023

@hkershaw-brown Just curious about the status of this bug --- seems the paramfile being set twice is the source of the bug, and the other comments are related to general improvements of the WRF-DART tutorial scripting?

@hkershaw-brown
Copy link
Member Author

@braczka I have not worked on the wrf tutorial scripts. I've helped several users work through the tutorial, and I would recommend if a user is familiar with scripting that they are better off writing their own scripts. The tutorial states "You will need to edit these scripts, perhaps extensively, to run them within your particular computing environment." This is an understatement.

I'd like to rewrite the wrf tutorial, but this issue has been hanging out there because we haven't had the manpower/resource to commit to it. I would use a smaller wrf case (the run takes an hour on Cheyenne, which is too long to be debugging scripts efficiently).

@hkershaw-brown
Copy link
Member Author

@braczka pull #454 removes the second set paramfile
I'll bundle this in with the release for pull #450

@braczka
Copy link
Contributor

braczka commented Feb 8, 2023

Thanks @hkershaw-brown , it also makes sense to keep this issue open for reference for now with your additional notes as I become more familiar with WRF.

@hkershaw-brown
Copy link
Member Author

will do, I'll leave this issue open.
Let me know when you are up for a wrf tutorial re-write!

@braczka
Copy link
Contributor

braczka commented Feb 27, 2023

Forgot to make note of another linkage error in WRF-DART tutorial independent from set param.csh which was already fixed. When executing gen_retro_icbc.csh within Step 2 of tutorial get following:

Entering gen_retro_icbc.csh for 2017042700
FATAL ERROR in find_namelist_in_file Namelist input file: input.nml must exist. utilities_mod.f90 stopping.
RUNNING REAL, STEP 1
 
set: Variable name must contain alphanumeric characters.

Existing link command fails:

61    ${LINK} ${TEMPLATE_DIR}/input.nml.template input.nml

because input.nml.template does not exist at that path. Should be either:

61      ${LINK} ${DART_DIR}/models/wrf/tutorial/template/input.nml.template input.nml

or

61      ${LINK} ${DART_DIR}/models/wrf/work/input.nml input.nml

Will keep this open just a bit longer to see if current users uncover any other easy fixes.

@braczka
Copy link
Contributor

braczka commented Mar 9, 2023

When I run the basic WRF tutorial and run gen_retro_icbc.csh, the script throws an MPT error when it submits the pbs script real.csh, which has the command:

mpiexec_mpt dplace -s 1 ${RUN_DIR}/WRF_RUN/real.exe

An example of the error is the following:

starting wrf task  0  of 1
MPT: Launcher network accept (MPI_LAUNCH_TIMEOUT) timed out
MPT: Launcher on r13i5n4 failed to receive connection(s) from: 
r13i5n4.ib0.cheyenne.ucar.edu r13i5n0.ib0.cheyenne.ucar.edu 
r7i1n5.ib0.cheyenne.ucar.edu
MPT: MPT ERROR: Check network connectivity between hosts.        
                  Retry after increasing value of MPI_LAUNCH_TIMEOUT.
                  See MPI(1) for details.
MPT ERROR: could not launch executable        
                  (HPE MPT 2.25  08/14/21 03:06:24)
Killed

My module environment while running the job is:

Currently Loaded Modules:  1) ncarenv/1.3    3) ncarcompilers/0.5.0   5) ncl/6.6.2   7) diffuse/0.4.8  2) intel/19.0.5   4) netcdf/4.7.4          6) nco/5.0.3   8) mpt/2.22

However, when executing the script there is an automatic update to the mpt version as:
The following have been reloaded with a version change:  1)mpt/2.22 => mpt/2.25

I searched for a similar bug related to MPI_LAUNCH_TIMEOUT on the WRF forum and found something similar here

They recommended running the job in serial and not in parallel, thus I removed the MPI command altogether in favor of:
${RUN_DIR}/WRF_RUN/real.exe

This worked, but I am unsure if this is something worth correcting in the WRF Tutorial scripting or just a result of how the real.exe file was compiled for my particular case. I am just keeping notes at this point. FYI -- I am using Moha's WRF executables for real.exe at /glade/work/gharamti/WRF, using version V3.9.1.1.

@mgharamti
Copy link
Contributor

@braczka, you are running into this issue because all my WRF executables are compiled with openmpi. I am not a fan of mpt. So, I'd replace mpiexec_mpt with mpirun. Hopefully, this will fix you issue.

@braczka
Copy link
Contributor

braczka commented Mar 9, 2023

I gotcha -- I tried the mpirun command before with openmpi and it was failing before --- but I see now that's because the modules were automatically replacing openmpi with mpt while running gen_retro_icbc.csh and I didn't catch it....

@braczka
Copy link
Contributor

braczka commented Mar 9, 2023

Thanks for feedback @mgharamti and @hkershaw-brown. When submitting the real.csh script the param.csh file is sourced prior to executing real.exe. I hadn't updated my param.csh to reflect the use of openmpi thus the environment was changed prior to job submission by mistake. Thus, really not an issue with the scripting at all.

@braczka
Copy link
Contributor

braczka commented Mar 10, 2023

The input.nml.template issue occurs again during the ./init_ensemble_var.csh step. Same issue as mentioned above during the gen_retro_icbc.csh step.

Easiest fix is to edit the WRF tutorial instructions as:

cp $DART_DIR/models/wrf/tutorial/template/namelist.input.meso $BASE_DIR/template/.
cp $DART_DIR/models/wrf/tutorial/template/namelist.wps.template $BASE_DIR/template/.
Add
cp $DART_DIR/models/wrf/tutorial/template/input.nml.template $BASE_DIR/template/.

I will update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants