-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] First implementation of run_WE2E_tests.py #558
[develop] First implementation of run_WE2E_tests.py #558
Conversation
- Write new config file and print as test
…ge what gets printed to screen, and have all messages printed to log file
…ic functions, not ready to use yet
implement rest of the command line options, then finally implement non-cron submission and tracking of jobs. Changes in this commit: - Ensure all dictionary entries within each section are uppercase - Implement functions for updating settings in task_get_extrn_ics and task_get_extrn_lbcs sections - Import generate_FV3LAM_wflow function directly and call it - Implement debigging arguments in generate_FV3LAM_wflow.py and setup.py - Tweak setup_logging to avoid double-printing of log messages when calling generate_FV3LAM_wflow from another function with logging
…ment. Much cleaner implementation
- Remove workdir dependencies from setup function - Return "EXPT_DIR" from workflow generation function, move "completion" message to outside of generation function to avoid confusing/incorrect "success" messages
monitor_jobs.py (currently just starts the first rocotorun of each experiment)
- Add function for writing monitor file (this will be overwritten each time) - Add finction to query rocoto database file for each experiment (using sqlite3), and extract the relevant information for each job - Add a skeleton function for updating the status of each experiment (based on each job's individual status within that experiment)
…o not fail if the respective input files don't exist (since that task will not be run anyway)
monitor_jobs.py: - Remove verbose flags from rocotorun calls, since we don't use the output anyway - Remove some debug prints, add some others - We need to loop over a copy of running_expts rather than the original so we can remove entries within the loop - Move rocotorun to the end of the status check loop to give time for the rocoto database to be fully updated before it is checked again - Add a short delay between loops over running_expts for further safety against slowly updating databases run_WE2E_tests.py: - If RUN_TASK_GET_EXTRN_ICS or RUN_TASK_GET_EXTRN_LBCS are False, skip the setup function for that respective task - Fix staged paths for ICS and LBCS
… setup.py but not in test config file
…false, do not fail if the respective input files don't exist (since that task will not be run anyway)" I forgot that the default for USE_USER_STAGED_EXTRN_FILES is false, so if True then we should expect failure. This reverts commit ef072f9.
…onfig.deactivate_tasks.yaml; since we are not running these tasks, we should not set any of those settings
…operator simply references the original). Also add some more debugging/timing info.
…figure out what to do with specify_template_filenames...
particularly handy for debugging the functionality, and seems to work flawlessly as implemented :) Additionally, adding multiple calls to rocotorun in a row to get around a potential bug with rocotorun leaving hung background processes. In correspondance with Chris to try to solve this in a cleaner way.
…; seems to be specific to Hera head nodes (but could appear elsewhere)
- Remove tests that are symlinks to tests already included - Fix bug in capability to include list of tests as a file - Remove some unnecessary prints
- Handle blank/empty lines in a test file without failure - Omit duplicate tests similarly to symlink duplicates
if not match: | ||
raise Exception(f"Could not find test {test}") | ||
# Because some test files are symlinks to other tests, check that we don't | ||
# include the same test twice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich I thought you were getting rid of the symlinks feature. Seems like you're still checking for that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I was confusing in our previous conversation; I had no plans to change the existence of symlinks; I only changed it so that this does not result in a failure, it simply omits duplicate tests. Whether or not that's the best approach, I'm all ears on opinions there.
@mkavulich I tried testing what happens with duplicate file names: I copied When I launched |
@gsketefian The case of duplicate file names is not one I had accounted for; I had only accounted for the user specifying duplicate tests. I just pushed a new commit with a check that there are no config.TESTNAME.yaml files with the same TESTNAME. |
…issue_462_new_WE2E_script_rebased
- Use os.path.join for platform agnosticism - Rearrange logic to avoid use of non-failure use of "try" block - Fix another bug if blank lines are included in test file - Put un-closed calls to open() in "with" block
- Expand update_expt_status() to reduce duplicate code - Update docstrings
@gsketefian @christinaholtNOAA Thank you for your reviews, I believe I have addressed all your concerns. Let me know if I missed something or you have more questions/comments. |
@mkavulich I was testing redundancy with symlinks (I had a symlink to a test, and I listed both in my list of tests), and the script caught the redundancy with an appropriate warning. Just the output to screen is not well-formatted (seems to be a lot of extra spaces where there should be just one) and thus a bit hard to read. Do you mind fixing if possible? Here's what the output looked like:
|
@mkavulich Two questions came to mind as I was doing further tests (these are just for discussion; I'm approving the PR):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned last week, I'm totally fine with the PR as-is, but left just a couple of follow up comments just to circle back to the review.
@@ -0,0 +1,54 @@ | |||
# This is an example yaml file showing the various entries that can be created for tracking jobs by monitor_jobs.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be nice just to drop in a comment on how this appeared here so that future you is not completely baffled. ;)
# If RUN_TASK_GET_EXTRN_ICS is false, do nothing and return | ||
if 'workflow_switches' in cfg: | ||
if 'RUN_TASK_GET_EXTRN_ICS' in cfg['workflow_switches']: | ||
if cfg['workflow_switches']['RUN_TASK_GET_EXTRN_ICS'] is False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Sure. What about:
if cfg.get('workflow_switches', {}).get('RUN_TASK_GET_EXTRN_ICS', True) is False:
return cfg_ics
The logic should return if workflow_switches
is set and RUN_TASK_GET_EXTRN_ICS
is set to something that evaluates to False.
If workflow_switches
or RUN_TASK_GET_EXTRN_ICS
isn't set, or RUN_TASK_GET_EXTRN_ICS
evaluates to something not False, it does not return.
Thanks @christinaholtNOAA; for expediency and to avoid unnecessary testing, I will roll these suggestions into my working branch for the next update to these scripts. |
PR #566 changed the variable "MODEL" to a more descriptive name, but failed to make this change in config.community.yaml. The unit tests for generate_FV3LAM_wflow.py make use of this file as an input config.yaml, so they are now failing due to this incorrect variable name. This wasn't caught because prior to #558 the unit tests were broken for a different reason. This change simply makes the appropriate rename, which should fix the failing unit test. Also created an f-string that was missed in a setup.py error message.
This PR improves on the new ./run_WE2E_tests.py script (introduced in #558), implementing all the features present in the previous shell-based workflow. Some new files are also introduced for better organization and additional functionality: * tests/WE2E/utils.py This is a collection of functions used by other scripts, contained here to avoid circular dependencies. * tests/WE2E/WE2E_summary.py Given an experiment directory or .yaml file, outputs a summary to screen of each experiment, its status, and the number of core hours used. It also prints a summary file with detailed information about each task for each experiment. * tests/WE2E/print_test_info.py Will print a file WE2E_test_info.txt, very similar to the legacy WE2E_test_info.csv with just a few minor format differences. Any scripts can be run with the -h argument to print information about all available options (not including utils.py, which is not designed to be run stand-alone). With this PR, the old superseded shell-based tools are removed.
DESCRIPTION OF CHANGES:
This PR introduces two new scripts to the repository:
run_WE2E_tests.py
andmonitor_jobs.py
. The purpose of these scripts is to eventually provide a pythonic replacement for the current workflow end-to-end test submission script. Additionally, the monitor_jobs function gives the capability to monitor and submit jobs automatically via the command line or a batch job, rather than relying on crontab entries.run_WE2E_tests.py
This new script is roughly analogous to the legacy bash version,
run_WE2E_tests.sh
. This script will set up and run a set of workflow end-to-end tests as specified by the user.This script is similar in behavior to the original script, but introduces several improvements and/or simplifications:
tests_file
,test_type
, andtest_name
) are collapsed into a single argument,--tests
.log.run_WE2E_tests
). This can prevent important messages from being drowned out by the flood of information printed to screen.use_cron_to_relaunch
is set to true, the experiment data will be automatically fed into themonitor_jobs()
function, which will launch and track experiments until all are complete. If the script is interrupted for some reason, it can be re-run and experiments continued (see below)Example usage:
./run_WE2E_tests.py -m=hera -a=fv3lam --tests=testlist
./run_WE2E_tests.py -t=nco_inline_post -m=orion -a=gsd-fv3-test -q
./run_WE2E_tests.py -m=jet -a=gsd-fv3-dev --tests=comprehensive -d
./run_WE2E_tests.py -t=fundamental -c=gnu -m=cheyenne -a=P48500053 --expt_basedir=/glade/scratch/kavulich
Example output:
The
--help
flag gives some usage information:For a real example of its usage, here I use the
-q
flag to suppress output fromgenerate_FV3LAM_workflow()
, showing just the output being made by this script:Note that the message about Inline post is a "warning"-level message from generate_FV3LAM_workflow, and so is still printed to screen.
monitor_jobs.py
This new script, designed to be called automatically by run_WE2E_tests.py or run stand-alone, will read a dictionary (either provided directly to the function or read from a YAML file) that specifies the location of a number of experiments that need to be monitored. The main function, monitor_jobs(), will keep track of these experiments, advance the workflow with calls to
rocotorun
in each experiment directory, and monitor successes and errors as they occur, reporting a summary at the end.In addition, while these jobs are being monitored and run, a YAML file tracking the status of all jobs will be written to disk. This file can be read directly to see the details of how each job is coming along, but most importantly, this job file can be read back into monitor_jobs.py as a command line argument. Therefore if the
./run_WE2E_tests.py
script fails or is quit at any point after the experiments have been generated, the script can be re-started and continue to monitor jobs where it left off.Example usage:
In this case, I ran
./run_WE2E_tests.py
but killed it before all experiments completed. Here I just look in my test directory for the latest "monitor_jobs" yaml file, and feed that back to the script:And querying the monitor file afterwards shows that all experiments are indeed complete:
Additional changes
In addition to the new scripts, a few changes have been made to the rest of the workflow that should have no significant impact on existing tests:
generate_FV3LAM_wflow.py
setup.py
tests/WE2E/test_configs/wflow_features/config.deactivate_tasks.yaml
Missing features vs. run_WE2E_tests.sh
This initial version is incomplete, and a few more capabilities must be added before a wholesale replacement of run_WE2E_tests.sh
cron_relaunch_intvl_mnts
,generate_csv_file
, andopsroot
are not yet implementeddate
in DATE_FIRST_CYCL and DATE_LAST_CYCL is not yet implementedType of change
TESTS CONDUCTED:
DEPENDENCIES:
None
DOCUMENTATION:
Will be contributed with later PR deprecating old system. This is a preliminary implementation for wider exposure and feedback.
ISSUE:
CHECKLIST