[develop] First implementation of run_WE2E_tests.py #558

mkavulich · 2023-01-24T22:43:06Z

DESCRIPTION OF CHANGES:

This PR introduces two new scripts to the repository: run_WE2E_tests.py and monitor_jobs.py. The purpose of these scripts is to eventually provide a pythonic replacement for the current workflow end-to-end test submission script. Additionally, the monitor_jobs function gives the capability to monitor and submit jobs automatically via the command line or a batch job, rather than relying on crontab entries.

run_WE2E_tests.py

This new script is roughly analogous to the legacy bash version, run_WE2E_tests.sh. This script will set up and run a set of workflow end-to-end tests as specified by the user.

This script is similar in behavior to the original script, but introduces several improvements and/or simplifications:

The argument list is made simpler and more pythonic (which will solve problems listed in Workflow end-to-end script should follow standard syntax for arguments, documentation needs updating #369)
The three various test specification arguments (tests_file, test_type, and test_name) are collapsed into a single argument, --tests.
The large bit of logic assessing available tests, linking test names, IDs, and subdirectories is omitted; replaced with simple checks based on file names.
"Verbose" and "Debug" options as applied to individual tests are replaced with "verbose_tests" and "debug_tests" respectively, and a "--debug" flag for script verbosity is added.
A new "--quiet" flag is introduced, which prevents output (except for warnings and errors) from generate_FV3LAM_workflow printing to screen (they will still be printed to the log file, log.run_WE2E_tests). This can prevent important messages from being drowned out by the flood of information printed to screen.
Because we are not intermingling all variables from test config, defaults, and machine defaults together, we can better control the resulting test files that are submitted to the generate function
Omits checks that are redundant to checks in generate_FV3LAM_workflow.py (for example, ensuring the forecast length is evenly divisible by the LBC interval)
Unless use_cron_to_relaunch is set to true, the experiment data will be automatically fed into the monitor_jobs() function, which will launch and track experiments until all are complete. If the script is interrupted for some reason, it can be re-run and experiments continued (see below)

Example usage:

Running a list of tests in a file named "testlist" on Hera
./run_WE2E_tests.py -m=hera -a=fv3lam --tests=testlist
Running the single test "nco_inline_post" on orion, while not printing output from generate_FV3LAM_workflow to screen
./run_WE2E_tests.py -t=nco_inline_post -m=orion -a=gsd-fv3-test -q
Running the "comprehensive" set of tests on Jet, in debug mode (lots more output print to screen)
./run_WE2E_tests.py -m=jet -a=gsd-fv3-dev --tests=comprehensive -d
Running the "fundamental" set of tests on Cheyenne, with the gnu compiler, with all experiment directories under the directory "/glade/scratch/kavulich"
./run_WE2E_tests.py -t=fundamental -c=gnu -m=cheyenne -a=P48500053 --expt_basedir=/glade/scratch/kavulich

Example output:

The --help flag gives some usage information:

$ ./run_WE2E_tests.py -h
usage: run_WE2E_tests.py [-h] -m MACHINE -a ACCOUNT -t [TESTS ...] [-c COMPILER] [-d] [-q] [--modulefile MODULEFILE]
                         [--run_envir RUN_ENVIR] [--expt_basedir EXPT_BASEDIR] [--exec_subdir EXEC_SUBDIR]
                         [--use_cron_to_relaunch] [--cron_relaunch_intvl_mnts CRON_RELAUNCH_INTVL_MNTS] [--debug_tests]
                         [--verbose_tests]

required arguments:
  -m MACHINE, --machine MACHINE
                        Machine name; see ush/machine/ for valid values
  -a ACCOUNT, --account ACCOUNT
                        Account name for running submitted jobs
  -t [TESTS ...], --tests [TESTS ...]
                        Can be one of three options (in order of priority): 1. A test name or list of test names. 2. A
                        test suite name ("fundamental", "comprehensive", or "all") 3. The name of a file (full or relative
                        path) containing a list of test names.

optional arguments:
  -h, --help            show this help message and exit
  -c COMPILER, --compiler COMPILER
                        Compiler used for building the app
  -d, --debug           Script will be run in debug mode with more verbose output
  -q, --quiet           Suppress console output from workflow generation; this will helpkeep the screen uncluttered
  --modulefile MODULEFILE
                        Modulefile used for building the app
  --run_envir RUN_ENVIR
                        Overrides RUN_ENVIR variable to a new value ( "nco" or "community" ) for all experiments
  --expt_basedir EXPT_BASEDIR
                        Explicitly set EXPT_BASEDIR for all experiments
  --exec_subdir EXEC_SUBDIR
                        Explicitly set EXEC_SUBDIR for all experiments
  --use_cron_to_relaunch
                        Explicitly set USE_CRON_TO_RELAUNCH for all experiments; this option disables the "monitor" script
                        functionality
  --cron_relaunch_intvl_mnts CRON_RELAUNCH_INTVL_MNTS
                        Overrides CRON_RELAUNCH_INTVL_MNTS for all experiments
  --debug_tests         Explicitly set DEBUG=TRUE for all experiments
  --verbose_tests       Explicitly set VERBOSE=TRUE for all experiments

For more information about config arguments (denoted in CAPS), see ush/config_defaults.yaml

For a real example of its usage, here I use the -q flag to suppress output from generate_FV3LAM_workflow(), showing just the output being made by this script:

./run_WE2E_tests.py -t=fundamental -m=orion -a=gsd-fv3-test -q 
Checking that all tests are valid
Will run 5 tests:
/work/noaa/gsd-fv3-test/kavulich/UFS/issue_462/ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_community/config.grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta.yaml
/work/noaa/gsd-fv3-test/kavulich/UFS/issue_462/ufs-srweather-app/tests/WE2E/test_configs/grids_extrn_mdls_suites_community/config.grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta.yaml
/work/noaa/gsd-fv3-test/kavulich/UFS/issue_462/ufs-srweather-app/tests/WE2E/test_configs/wflow_features/config.community_ensemble_008mems.yaml
/work/noaa/gsd-fv3-test/kavulich/UFS/issue_462/ufs-srweather-app/tests/WE2E/test_configs/wflow_features/config.deactivate_tasks.yaml
/work/noaa/gsd-fv3-test/kavulich/UFS/issue_462/ufs-srweather-app/tests/WE2E/test_configs/wflow_features/config.inline_post.yaml

Inline post is turned on, deactivating post-processing tasks:
RUN_TASK_RUN_POST = False

calling function that monitors jobs, prints summary
Writing information for all experiments to monitor_jobs_20230124154853.yaml
Checking tests available for monitoring...
Starting experiment grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta running
Starting experiment grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta running
Starting experiment community_ensemble_008mems running
Starting experiment deactivate_tasks running
Starting experiment inline_post running
Setup complete; monitoring 5 experiments
Experiment deactivate_tasks is COMPLETE; will no longer monitor.
Experiment grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta is COMPLETE; will no longer monitor.
Experiment grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta is COMPLETE; will no longer monitor.
Experiment inline_post is COMPLETE; will no longer monitor.
Experiment community_ensemble_008mems is COMPLETE; will no longer monitor.
All 5 experiments finished in 0:46:43.495644
All experiments are complete
Summary of results available in monitor_jobs_20230124154853.yaml

Note that the message about Inline post is a "warning"-level message from generate_FV3LAM_workflow, and so is still printed to screen.

monitor_jobs.py

This new script, designed to be called automatically by run_WE2E_tests.py or run stand-alone, will read a dictionary (either provided directly to the function or read from a YAML file) that specifies the location of a number of experiments that need to be monitored. The main function, monitor_jobs(), will keep track of these experiments, advance the workflow with calls to rocotorun in each experiment directory, and monitor successes and errors as they occur, reporting a summary at the end.

In addition, while these jobs are being monitored and run, a YAML file tracking the status of all jobs will be written to disk. This file can be read directly to see the details of how each job is coming along, but most importantly, this job file can be read back into monitor_jobs.py as a command line argument. Therefore if the ./run_WE2E_tests.py script fails or is quit at any point after the experiments have been generated, the script can be re-started and continue to monitor jobs where it left off.

Example usage:

In this case, I ran ./run_WE2E_tests.py but killed it before all experiments completed. Here I just look in my test directory for the latest "monitor_jobs" yaml file, and feed that back to the script:

$ ls -l
total 488
-rw-r--r-- 1 Michael.Kavulich fv3lam   5827 Jan 17 15:59 create_WE2E_resource_summary.py
-rwxr-xr-x 1 Michael.Kavulich fv3lam  15347 Jan 17 15:59 get_expts_status.sh
-rwxr-xr-x 1 Michael.Kavulich fv3lam  64794 Jan 17 15:59 get_WE2Etest_names_subdirs_descs.sh
-rw-r--r-- 1 Michael.Kavulich fv3lam  12423 Jan 23 21:33 log.monitor_jobs
-rw-r--r-- 1 Michael.Kavulich fv3lam 182013 Jan 24 21:55 log.run_WE2E_tests
drwxr-sr-x 2 Michael.Kavulich fv3lam   4096 Jan 23 22:19 machine_suites
-rw-r--r-- 1 Michael.Kavulich fv3lam    915 Jan 24 21:55 monitor_jobs_20230124215513.yaml
-rwxr-xr-x 1 Michael.Kavulich fv3lam  12508 Jan 24 00:26 monitor_jobs.py
drwxr-sr-x 2 Michael.Kavulich fv3lam   4096 Jan 24 18:47 __pycache__
drwxr-sr-x 9 Michael.Kavulich fv3lam   4096 Jan 20 04:51 rocoto
-rwxr-xr-x 1 Michael.Kavulich fv3lam  22882 Jan 24 21:50 run_WE2E_tests.py
-rwxr-xr-x 1 Michael.Kavulich fv3lam  50269 Jan 17 16:00 run_WE2E_tests.sh
-rwxr-xr-x 1 Michael.Kavulich fv3lam   2844 Jan 17 15:59 setup_WE2E_tests.sh
drwxr-sr-x 6 Michael.Kavulich fv3lam   4096 Jan 23 16:58 test_configs
$ cat monitor_jobs_20230124222802.yaml 
### WARNING ###
### THIS FILE IS AUTO_GENERATED AND REGULARLY OVER-WRITTEN BY monitor_jobs.py
### EDITS MAY RESULT IN MISBEHAVIOR OF EXPERIMENTS RUNNING
deactivate_tasks:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/deactivate_tasks
  status: CREATED
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  status: CREATED
custom_ESGgrid:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/custom_ESGgrid
  status: CREATED
$ ./monitor_jobs.py -y=monitor_jobs_20230124215513.yaml 
Checking tests available for monitoring...
Starting experiment deactivate_tasks running
Starting experiment nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR running
Starting experiment custom_ESGgrid running
Setup complete; monitoring 3 experiments
Experiment deactivate_tasks is COMPLETE; will no longer monitor.
Experiment nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR is COMPLETE; will no longer monitor.
Experiment custom_ESGgrid is COMPLETE; will no longer monitor.
All 3 experiments finished in 0:15:59.928853

And querying the monitor file afterwards shows that all experiments are indeed complete:

$ cat monitor_jobs_20230124215513.yaml 
### WARNING ###
### THIS FILE IS AUTO_GENERATED AND REGULARLY OVER-WRITTEN BY monitor_jobs.py
### EDITS MAY RESULT IN MISBEHAVIOR OF EXPERIMENTS RUNNING
deactivate_tasks:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/deactivate_tasks
  status: COMPLETE
  make_grid_201907010000: SUCCEEDED
  make_orog_201907010000: SUCCEEDED
  make_sfc_climo_201907010000: SUCCEEDED
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
  status: COMPLETE
  get_extrn_ics_202008100000: SUCCEEDED
  get_extrn_lbcs_202008100000: SUCCEEDED
  make_ics_202008100000: SUCCEEDED
  make_lbcs_202008100000: SUCCEEDED
  run_fcst_202008100000: SUCCEEDED
custom_ESGgrid:
  expt_dir: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/issue_462_new_WE2E_script/expt_dirs/custom_ESGgrid
  status: COMPLETE
  make_grid_201907010000: SUCCEEDED
  get_extrn_ics_201907010000: SUCCEEDED
  get_extrn_lbcs_201907010000: SUCCEEDED
  make_orog_201907010000: SUCCEEDED
  make_sfc_climo_201907010000: SUCCEEDED
  make_ics_201907010000: SUCCEEDED
  make_lbcs_201907010000: SUCCEEDED
  run_fcst_201907010000: SUCCEEDED
  run_post_f000_201907010000: SUCCEEDED
  run_post_f001_201907010000: SUCCEEDED
  run_post_f002_201907010000: SUCCEEDED
  run_post_f003_201907010000: SUCCEEDED
  run_post_f004_201907010000: SUCCEEDED
  run_post_f005_201907010000: SUCCEEDED
  run_post_f006_201907010000: SUCCEEDED

Additional changes

In addition to the new scripts, a few changes have been made to the rest of the workflow that should have no significant impact on existing tests:

generate_FV3LAM_wflow.py
- function generate_FV3LAM_wflow() now takes an optional "debug" argument that, when true, will provide more verbose output from this and other functions that receive the argument.
- function generate_FV3LAM_wflow() now returns EXPTDIR, the string indicating the full path of the created experiment. This is useful for scripts/functions that call this function directly rather than via command-line
- Move the "Experiment generation completed" message to the very end of the script, so that any failure will pre-empt it
- Generalize logging setup so that it will work properly even if called from another function.
setup.py
- function setup() now takes an optional "debug" argument that, when true, will provide more verbose output
- Add several debug logging messages, remove an old "print" statement
- Avoid using "cd" and other full-path changes so that results will be the same regardless of the directory this function is called from
tests/WE2E/test_configs/wflow_features/config.deactivate_tasks.yaml
- Remove sections for tasks that are not used (LBCS and ICS are not staged since those tasks are deactivated

Missing features vs. run_WE2E_tests.sh

This initial version is incomplete, and a few more capabilities must be added before a wholesale replacement of run_WE2E_tests.sh

The command-line options cron_relaunch_intvl_mnts, generate_csv_file, and opsroot are not yet implemented
Currently, the default behavior is to not submit jobs to cron, but rather to use monitor_jobs.py to monitor and run the experiments
The ability to include calls to UNIX date in DATE_FIRST_CYCL and DATE_LAST_CYCL is not yet implemented
No checks that all MET executables/directories are set up properly for the given machine...should these be implemented in setup.py instead?
No hard-coded "maxtries" based on platform yet
Potentially other edge use cases I have missed?

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

TESTS CONDUCTED:

DEPENDENCIES:

None

DOCUMENTATION:

Will be contributed with later PR deprecating old system. This is a preliminary implementation for wider exposure and feedback.

ISSUE:

This is partially in response to Running comprehensive suite of WE2E tests overwhelms some platforms #462. Still needs some more work before it can be closed.
When this script supersedes the legacy bash script, it will also solve Workflow end-to-end script should follow standard syntax for arguments, documentation needs updating #369
Some changes I had to make here (related to calling the generate function from another directory) also fix Python unittest fails in github actions #539

CHECKLIST

My code follows the style guidelines in the Contributor's Guide
I have performed a self-review of my own code using the Code Reviewer's Guide
I have commented my code, particularly in hard-to-understand areas
My changes need updates to the documentation. I have made corresponding changes to the documentation
My changes do not require updates to the documentation (explain).
My changes generate no new warnings
New and existing tests pass with my changes
Any dependent changes have been merged and published

…l valid

- Write new config file and print as test

…ge what gets printed to screen, and have all messages printed to log file

…aml accordingly

…ic functions, not ready to use yet

implement rest of the command line options, then finally implement non-cron submission and tracking of jobs. Changes in this commit: - Ensure all dictionary entries within each section are uppercase - Implement functions for updating settings in task_get_extrn_ics and task_get_extrn_lbcs sections - Import generate_FV3LAM_wflow function directly and call it - Implement debigging arguments in generate_FV3LAM_wflow.py and setup.py - Tweak setup_logging to avoid double-printing of log messages when calling generate_FV3LAM_wflow from another function with logging

…ment. Much cleaner implementation

correct location.

- Remove workdir dependencies from setup function - Return "EXPT_DIR" from workflow generation function, move "completion" message to outside of generation function to avoid confusing/incorrect "success" messages

monitor_jobs.py (currently just starts the first rocotorun of each experiment)

- Add function for writing monitor file (this will be overwritten each time) - Add finction to query rocoto database file for each experiment (using sqlite3), and extract the relevant information for each job - Add a skeleton function for updating the status of each experiment (based on each job's individual status within that experiment)

…o not fail if the respective input files don't exist (since that task will not be run anyway)

monitor_jobs.py: - Remove verbose flags from rocotorun calls, since we don't use the output anyway - Remove some debug prints, add some others - We need to loop over a copy of running_expts rather than the original so we can remove entries within the loop - Move rocotorun to the end of the status check loop to give time for the rocoto database to be fully updated before it is checked again - Add a short delay between loops over running_expts for further safety against slowly updating databases run_WE2E_tests.py: - If RUN_TASK_GET_EXTRN_ICS or RUN_TASK_GET_EXTRN_LBCS are False, skip the setup function for that respective task - Fix staged paths for ICS and LBCS

… setup.py but not in test config file

…false, do not fail if the respective input files don't exist (since that task will not be run anyway)" I forgot that the default for USE_USER_STAGED_EXTRN_FILES is false, so if True then we should expect failure. This reverts commit ef072f9.

…onfig.deactivate_tasks.yaml; since we are not running these tasks, we should not set any of those settings

…operator simply references the original). Also add some more debugging/timing info.

…figure out what to do with specify_template_filenames...

particularly handy for debugging the functionality, and seems to work flawlessly as implemented :) Additionally, adding multiple calls to rocotorun in a row to get around a potential bug with rocotorun leaving hung background processes. In correspondance with Chris to try to solve this in a cleaner way.

…; seems to be specific to Hera head nodes (but could appear elsewhere)

- Remove tests that are symlinks to tests already included - Fix bug in capability to include list of tests as a file - Remove some unnecessary prints

- Handle blank/empty lines in a test file without failure - Omit duplicate tests similarly to symlink duplicates

gsketefian · 2023-01-31T19:03:57Z

tests/WE2E/run_WE2E_tests.py

+        if not match:
+            raise Exception(f"Could not find test {test}")
+    # Because some test files are symlinks to other tests, check that we don't
+    # include the same test twice


@mkavulich I thought you were getting rid of the symlinks feature. Seems like you're still checking for that here.

Sorry if I was confusing in our previous conversation; I had no plans to change the existence of symlinks; I only changed it so that this does not result in a failure, it simply omits duplicate tests. Whether or not that's the best approach, I'm all ears on opinions there.

tests/WE2E/run_WE2E_tests.py

gsketefian · 2023-01-31T19:53:36Z

@mkavulich I tried testing what happens with duplicate file names: I copied wflow_features/config.MET_verification.yaml to a new directory under test_configs named verification, i.e. I created the file verification/config.MET_verification.yaml. The files are identical.

When I launched run_WE2E_tests.py, it tried running both instances, so it would be good to have it check for duplicate tests under the test_configs directory. In this case, one of the MET_verification experiment directories got renamed to MET_verification_old_... (probably when the 2nd one launched), and the script seemed to monitor one experiment thereafter (I guess the first test failed because of the directory rename, but I'm not sure).

mkavulich · 2023-01-31T22:03:25Z

@gsketefian The case of duplicate file names is not one I had accounted for; I had only accounted for the user specifying duplicate tests. I just pushed a new commit with a check that there are no config.TESTNAME.yaml files with the same TESTNAME.

…ness

…issue_462_new_WE2E_script_rebased

- Use os.path.join for platform agnosticism - Rearrange logic to avoid use of non-failure use of "try" block - Fix another bug if blank lines are included in test file - Put un-closed calls to open() in "with" block

- Expand update_expt_status() to reduce duplicate code - Update docstrings

mkavulich · 2023-02-01T21:33:08Z

@gsketefian @christinaholtNOAA Thank you for your reviews, I believe I have addressed all your concerns. Let me know if I missed something or you have more questions/comments.

gsketefian · 2023-02-02T16:15:46Z

@mkavulich I was testing redundancy with symlinks (I had a symlink to a test, and I listed both in my list of tests), and the script caught the redundancy with an appropriate warning. Just the output to screen is not well-formatted (seems to be a lot of extra spaces where there should be just one) and thus a bit hard to read. Do you mind fixing if possible? Here's what the output looked like:

$ ./run_WE2E_tests.py --machine hera --account rtrr --tests my_tests.txt --expt_basedir /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_mkavulich_new_run_WE2E_tests/expt_dirs/set01 -q
Checking that all tests are valid
WARNING: test file /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_mkavulich_new_run_WE2E_tests/ufs-srweather-app/tests/WE2E/test_configs/verification/config.MET_verification_link.yaml is a symbolic link to a
                                test file (/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_mkavulich_new_run_WE2E_tests/ufs-srweather-app/tests/WE2E/test_configs/wflow_features/config.MET_verification.yaml) that is also included in the
                                test list. Only the latter test will be run.
Will run 1 tests:
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_mkavulich_new_run_WE2E_tests/ufs-srweather-app/tests/WE2E/test_configs/wflow_features/config.MET_verification.yaml
Workflow for test MET_verification successfully generated in
/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/TEST_mkavulich_new_run_WE2E_tests/expt_dirs/set01/MET_verification

calling function that monitors jobs, prints summary
Writing information for all experiments to monitor_jobs_20230202091032.yaml
Checking tests available for monitoring...
Starting experiment MET_verification running
Setup complete; monitoring 1 experiments

gsketefian · 2023-02-02T16:34:15Z

@mkavulich Two questions came to mind as I was doing further tests (these are just for discussion; I'm approving the PR):

The current (shell) version searches only a specific set of subdirectories under test_config for tests. I had done it that way because often during testing I may make a backup copy of a subdirectory before starting to modify the files in it. The new script searches all subdirectories. I'm ok with that (I can always copy backup directories to somewhere outside of test_configs), just wondering how you decided on that. Just code simplicity?
Once the cron features are enabled in a future PR, will the monitoring script be able to monitor tasks running with cron as well? Currently, there's a script (get_expts_status.sh) that does this. It would be nice to have a proper replacement for it (or just leave it as is if not).

christinaholtNOAA

As I mentioned last week, I'm totally fine with the PR as-is, but left just a couple of follow up comments just to circle back to the review.

christinaholtNOAA · 2023-02-06T17:22:15Z

tests/WE2E/monitor_jobs.yaml

@@ -0,0 +1,54 @@
+# This is an example yaml file showing the various entries that can be created for tracking jobs by monitor_jobs.py


It could be nice just to drop in a comment on how this appeared here so that future you is not completely baffled. ;)

christinaholtNOAA · 2023-02-06T17:29:06Z

tests/WE2E/run_WE2E_tests.py

+    # If RUN_TASK_GET_EXTRN_ICS is false, do nothing and return
+    if 'workflow_switches' in cfg:
+        if 'RUN_TASK_GET_EXTRN_ICS' in cfg['workflow_switches']:
+            if cfg['workflow_switches']['RUN_TASK_GET_EXTRN_ICS'] is False:


Oh. Sure. What about:

if cfg.get('workflow_switches', {}).get('RUN_TASK_GET_EXTRN_ICS', True) is False: return cfg_ics

The logic should return if workflow_switches is set and RUN_TASK_GET_EXTRN_ICS is set to something that evaluates to False.

If workflow_switches or RUN_TASK_GET_EXTRN_ICS isn't set, or RUN_TASK_GET_EXTRN_ICS evaluates to something not False, it does not return.

mkavulich · 2023-02-06T19:20:20Z

Thanks @christinaholtNOAA; for expediency and to avoid unnecessary testing, I will roll these suggestions into my working branch for the next update to these scripts.

PR #566 changed the variable "MODEL" to a more descriptive name, but failed to make this change in config.community.yaml. The unit tests for generate_FV3LAM_wflow.py make use of this file as an input config.yaml, so they are now failing due to this incorrect variable name. This wasn't caught because prior to #558 the unit tests were broken for a different reason. This change simply makes the appropriate rename, which should fix the failing unit test. Also created an f-string that was missed in a setup.py error message.

This PR improves on the new ./run_WE2E_tests.py script (introduced in #558), implementing all the features present in the previous shell-based workflow. Some new files are also introduced for better organization and additional functionality: * tests/WE2E/utils.py This is a collection of functions used by other scripts, contained here to avoid circular dependencies. * tests/WE2E/WE2E_summary.py Given an experiment directory or .yaml file, outputs a summary to screen of each experiment, its status, and the number of core hours used. It also prints a summary file with detailed information about each task for each experiment. * tests/WE2E/print_test_info.py Will print a file WE2E_test_info.txt, very similar to the legacy WE2E_test_info.csv with just a few minor format differences. Any scripts can be run with the -h argument to print information about all available options (not including utils.py, which is not designed to be run stand-alone). With this PR, the old superseded shell-based tools are removed.

mkavulich added 30 commits January 17, 2023 16:00

Initial version, takes file with list of tests and checks they are al…

6846d25

…l valid

Add argparse

f9fdd50

- Add more command line options, format help message better

24e3b63

- Write new config file and print as test

Use direct calls to logging, use args.debug and logging.debug to mana…

a912c39

…ge what gets printed to screen, and have all messages printed to log file

Add more command line arguments from original script, update config.y…

17de277

…aml accordingly

Almost ready to call generate_FV3LAM_wflow! Added some section-specif…

d3cca9f

…ic functions, not ready to use yet

Implemented test specification as a single, unified command-line argu…

b36bb7d

…ment. Much cleaner implementation

Fix test filename bug

8ae8327

Updates to logging to ensure log files are written correctly and in the

2e614e3

correct location.

Improvements to workflow generation

7d7501f

- Remove workdir dependencies from setup function - Return "EXPT_DIR" from workflow generation function, move "completion" message to outside of generation function to avoid confusing/incorrect "success" messages

Finish up initial version of run_WE2E_tests.py, start work on

3cce1d7

monitor_jobs.py (currently just starts the first rocotorun of each experiment)

monitor_jobs.py nearly complete!

3d53ed7

If RUN_TASK_GET_EXTRN_ICS and/or RUN_TASK_GET_EXTRN_LBCS are false, d…

ef072f9

…o not fail if the respective input files don't exist (since that task will not be run anyway)

Fixing bug in specify_template_filenames test: files were re-named in…

18e1355

… setup.py but not in test config file

Address lists correctly so --tests argument works correctly

39083c7

Remove sections "task_get_extrn_ics" and "task_get_extrn_lbcs" from c…

374ed9e

…onfig.deactivate_tasks.yaml; since we are not running these tasks, we should not set any of those settings

In monitor_jobs, need to make a copy of running_expts explicitly ( = …

f439b63

…operator simply references the original). Also add some more debugging/timing info.

Script now runs most of Cheyenne's fundamental tests...still need to …

e48fc4b

…figure out what to do with specify_template_filenames...

Make things a bit more verbose, especially for debug mode

3774fa6

Had more discussion with Chris Harrop about nature of rocotorun issue…

e447cba

…; seems to be specific to Hera head nodes (but could appear elsewhere)

Improvements to run_WE2E_tests.py

64944eb

- Remove tests that are symlinks to tests already included - Fix bug in capability to include list of tests as a file - Remove some unnecessary prints

More sprucing up of the new script

213720d

- Handle blank/empty lines in a test file without failure - Omit duplicate tests similarly to symlink duplicates

add 'quiet' option to suppress test generation output

786ad00

Remove some old debug prints

05c06b7

Cleanup from pylint

d0c7a17

gsketefian reviewed Jan 31, 2023

View reviewed changes

tests/WE2E/run_WE2E_tests.py Show resolved Hide resolved

Add check that there are no duplicate test file names

f4a0b0a

mkavulich added 10 commits January 31, 2023 22:59

Consolidate rocotorun_cmd into a defined variable for clarity/compact…

3145ee0

…ness

Fix bug with debug flag causing pernicious I/O errors

70a477d

Format improvement suggested by Christina

74d9608

Merge commit '25418e86315c1df1e20ca2fd7e096124c28efec4' into feature/…

dd86024

…issue_462_new_WE2E_script_rebased

Address more review comments:

61c5940

- Use os.path.join for platform agnosticism - Rearrange logic to avoid use of non-failure use of "try" block - Fix another bug if blank lines are included in test file - Put un-closed calls to open() in "with" block

Convert --machine argument to lowercase, use os.path.join for paths

cfaca28

Forgot to commit this fix for bad directory name

3dee9a3

In monitor_jobs.py:

8378507

- Expand update_expt_status() to reduce duplicate code - Update docstrings

Per review comments, use .get() method in dictionaries to simplify logic

5bcd011

Address more review comments

cbcfe5c

mkavulich mentioned this pull request Feb 1, 2023

run_WE2E_tests.sh does not run the intended tests on Hera if COMPILER is not explicitly set #571

Closed

Address final (?) reviewer comments

a6e6608

gsketefian approved these changes Feb 2, 2023

View reviewed changes

mkavulich force-pushed the develop branch from b011cfd to 25418e8 Compare February 2, 2023 20:13

christinaholtNOAA approved these changes Feb 6, 2023

View reviewed changes

MichaelLueken merged commit 7299fef into ufs-community:develop Feb 7, 2023

mkavulich mentioned this pull request Feb 7, 2023

[develop] Fix failing unit test due to renamed variable #583

Merged

11 tasks

mkavulich mentioned this pull request Feb 8, 2023

Replace shell-based WE2E test scripts and run_srw_tests.py script with run_WE2E_tests.py and monitor_jobs.py #586

Closed

15 tasks

mkavulich mentioned this pull request Mar 6, 2023

[develop] Replace shell-based WE2E scripts with python versions #637

Merged

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[develop] First implementation of run_WE2E_tests.py #558

[develop] First implementation of run_WE2E_tests.py #558

mkavulich commented Jan 24, 2023 •

edited

Loading

gsketefian Jan 31, 2023

mkavulich Jan 31, 2023

gsketefian commented Jan 31, 2023

mkavulich commented Jan 31, 2023

mkavulich commented Feb 1, 2023

gsketefian commented Feb 2, 2023

gsketefian commented Feb 2, 2023

christinaholtNOAA left a comment

christinaholtNOAA Feb 6, 2023

christinaholtNOAA Feb 6, 2023

mkavulich commented Feb 6, 2023

		@@ -0,0 +1,54 @@
		# This is an example yaml file showing the various entries that can be created for tracking jobs by monitor_jobs.py

[develop] First implementation of run_WE2E_tests.py #558

[develop] First implementation of run_WE2E_tests.py #558

Conversation

mkavulich commented Jan 24, 2023 • edited Loading

DESCRIPTION OF CHANGES:

run_WE2E_tests.py

Example usage:

Example output:

monitor_jobs.py

Example usage:

Additional changes

Missing features vs. run_WE2E_tests.sh

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

gsketefian Jan 31, 2023

Choose a reason for hiding this comment

mkavulich Jan 31, 2023

Choose a reason for hiding this comment

gsketefian commented Jan 31, 2023

mkavulich commented Jan 31, 2023

mkavulich commented Feb 1, 2023

gsketefian commented Feb 2, 2023

gsketefian commented Feb 2, 2023

christinaholtNOAA left a comment

Choose a reason for hiding this comment

christinaholtNOAA Feb 6, 2023

Choose a reason for hiding this comment

christinaholtNOAA Feb 6, 2023

Choose a reason for hiding this comment

mkavulich commented Feb 6, 2023

mkavulich commented Jan 24, 2023 •

edited

Loading