-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Fixes for Issue #608, #610, and #616 #609
[develop] Fixes for Issue #608, #610, and #616 #609
Conversation
… for Hera and modify run_WE2E_tests.py so that this location is obtained in a platform-independent way.
…owing up in SRW clone); Make creation of symlinks to pregenerated files depend on whether downstream tasks need those symlinks.
…emoved for another PR.
…t been launched yet, use the launch script to launch it.
@@ -427,7 +427,8 @@ The last ${num_log_lines} lines of the workflow launch log file | |||
print_info_msg "$msg" >> "${expts_status_fp}" | |||
tail -n ${num_log_lines} ${launch_wflow_log_fn} >> "${expts_status_fp}" | |||
else | |||
wflow_status="Workflow status: NOT LAUNCHED YET" | |||
wflow_status="Workflow status: NOT LAUNCHED YET | |||
Launching workflow using script \"${launch_wflow_fn}\"..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea especially for jenkins tests. This used to be like this before (i.e. launch job after checking status) but I have had issues with jenkins tests, which call this script every minute to check for status of experiments, so I changed it. Moreover, I believe the behavior of this script should be consistent with its name, should simply check status of experiments without launching them. That way it will also respect the launch time assigned in each cronjob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa The original purpose of this script was to get the updated status of each experiment in a given directory from the command line without having to go into each directory to call launch_FV3LAM_wflow.sh
. I did not realize it had been modified and being used in the Jenkins tests since I've been working on a different branch the past few months. I'm finding now that I can now no longer use it for its original purpose since the new run_WE2E_tests.py
script does not call launch_FV3LAM_wflow.sh
to launch the experiments (so there is no log.launch_FV3LAM_wflow
log file even if the experiments are running). I suppose it will once the crontab capability is added back in, but still it would be nice to be able to relaunch a bunch of jobs manually to get their updated workflow statuses. I'm thinking an easy solution is to add an argument that determines this behavior, say launch_wflows=[true|false]
, with a default value of false
so it normally behaves as you expect. Does that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian I disagree on this one. I think it is best to keep modularity and have a script/function that just checks experiment status, and another one that launches them etc. Having been burned with the surprise existence of "launch_wflow" in this script, which was launching the jobs on cheyene after a minute, I am hesitant to bringing back that functionality especially without changing the name of the script. Also the way it is now, you have no control over when to launch the job (it is immediately launched if doesn't exist), so I doubt the run_WE2E_tests.py
would use it once it matures. It probably wants to control when to launch jobs and how frequently to check for status using either CRON or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa Ok, we can rename it to something like launch_expts_get_status.sh
and put in that argument (this time with default of "true" makes more sense). I can also add another argument for a wait period before launching tasks. run_WE2E_tests.py
doesn't have to use it. Will that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian I still don't think it is a good idea but I will defer to other reviewers. Although the naming change helps, passing flags to make it do everything is not modular programming. I mean it would be odd if I pass a flag to launch_workflow.sh
to make it check status of an experiment too, wouldn't you agree? In an ideal scenario, there should be one script/function that launches a single workflow, another one that just checks status of that single experiment, another one to iterateve over multiple workflows and launch them, and another to check the status of multiple workflows. Especially with launching workflows that has more logic (time interval of launch, is it using cron etc), I believe it should be its own function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa, @gsketefian, is there a partial solution to this problem that could be introduced in this PR and then fully implemented through a subsequent PR? In other words, @gsketefian, is it possible to make a few changes to get your feature working again, but also begin what @danielabdi-noaa has requested? @danielabdi-noaa, would you be OK with an initial change here, and then a future PR to fully introduce a set of scripts to completely modularize the behavior being discussed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBeck-NOAA I was waiting to hear back on slack from @mkavulich on whether he has plans to modify this script as part of his work on the WE2E tests, in which case it would not really make sense for me to do a whole rewrite and modularize it. But I think Mike is out of the office. It would certainly be ok with me to make the changes I suggested above and make an issue to do the bigger changes (modularizing it) at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa I added the flag launch_wflows
as mentioned above. Adding a flag for a wait time before launching is more difficult; that is really determined by the CRON_RELAUNCH_INTVL_MNTS
variable in the workflow and/or argument to run_WE2E_tests.sh
(and soon its python version), and trying to work it into this script would take too much work that is probably not justified as @mkavulich pointed out. Please let me know if this is a satisfactory solution for the time being. @mkavulich any thoughts?
ush/config_defaults.yaml
Outdated
@@ -653,7 +653,7 @@ workflow: | |||
# | |||
#----------------------------------------------------------------------- | |||
# | |||
FIXdir: '{{ EXPTDIR if workflow_switches.RUN_TASK_MAKE_GRID else [user.HOMEdir, "fix"]|path_join }}' | |||
FIXdir: '{{ EXPTDIR }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCO requires that fix files be placed under $HOMEdir/fix
which is why we place the fix files there, and in EXPTDIR in "community" mode. The logic above is not straightforward since we are testing "RUN_TASK_MAKE_GRID", which is always set to false in NCO mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa Ok, I didn't realize NCO requires the fix files to be in the directory of the clone. In that case, why not test the value of RUN_ENVIR
instead of RUN_TASK_MAKE_GRID
, since many researchers (i.e. non-NCO users) will be running in community mode but will turn off the make_grid
task once their grid is all set up. So the line would be changed to:
FIXdir: '{{ EXPTDIR if user.RUN_ENVIR == "community" else [user.HOMEdir, "fix"]|path_join }}'
I tested this with both RUN_ENVIR = "nco"
and RUN_ENVIR = "community"
and it works as expected in each case. Are you ok with that solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian Sounds good. Note that this logic came about when de-tangling NCO mode from the "grid being pre-generated" so that we can run all WE2E tests with either nco or community modes. It was easier to base many of the logic based on RUN_TASK_MAKE_GRID, but some logic like symlinking/copying fix files have since became their own options, while NCO mode always symlinked fix files and community mode copied. It is now actually symlink by default to save space on experiment directory size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa Ok this one is updated now to check RUN_ENVIR
instead of RUN_TASK_MAKE_GRID
.
…NCO requires fix files to be in the SRW clone's home directory, but in community mode we want them to be in the experiment directory).
…ipt launches an unlaunched workflow when it encounters one.
@MichaelLueken I merged the latest develop into my branch and reran all the fundamental tests listed above on Hera. They all passed. Now just waiting to get some approves! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gsketefian Thanks for working with @danielabdi-noaa to address his concerns! I have run tests on Jet and the modifications to fix are working as expected. I will now approve these changes and submit the Jenkins tests.
Machine: hera |
|
@MichaelLueken Thanks for shepherding this through :) |
Machine: hera |
@gsketefian The first runs from the ci-hera-intel-WE, two tests failed. The rerun showed that the two tests that failed previously successfully passed. The Jenkins tests have all successfully completed without issue. I will now move forward with merging this work. Thanks! |
@MichaelLueken Thanks for merging and closing the accompanying issues. Next vx PR coming soon! |
The way the fix directory is set changed in PR #609, specifically item number 3. I think this has been causing an issue with forced run_envir=nco mode. I don't have a full explanation for this at the moment, but reverting back to old way of setting FIXdir seems to atleast partially fix the issue. I was able to run fundamental tests successfully on Hera using run_envir=nco.
DESCRIPTION OF CHANGES:
MET_verification_only_vx
(Issue [develop] Fix WE2E yaml configuration file forMET_verification_only_vx
#608).FIXdir
toHOMEdir/fix
only whenRUN_ENVIR="nco"
, not whenRUN_TASK_MAKE_GRID=False
; otherwise, setFIXdir
toEXPTDIR
(Issue [develop] Decide on location of symlinks to fix files depending onRUN_ENVIR
, notRUN_TASK_MAKE_GRID
#616).get_expts_status.sh
so that if an experiment hasn't been launched yet, it calls the launch scriptlaunch_FV3LAM_wflow.sh
to launch it instead of only outputting a message that it's not yet launched.Type of change
TESTS CONDUCTED:
The following fundamental tests were conducted on Hera with Intel and passed:
MET_verification
MET_verification_only_vx
community_ensemble_2mems
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_HRRR
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_2017_gfdlmp_regional_plot
MET_ensemble_verification
community_ensemble_2mems_stoch
pregen_grid_orog_sfc_climo
In addition, other tests were conducted that run some pre-processing tasks but not others (e.g. run
make_orog
but notmake_grid
andmake_sfc_climo
), and they also passed. Those are not included as new WE2E tests to keep the number of tests low.DOCUMENTATION:
No changes to documentation are necessary as far as I can tell.
ISSUE:
This resolves Issues #608 and #610.
CHECKLIST
NA
LABELS (optional):
A Code Manager needs to add the following labels to this PR: