Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Replace shell-based WE2E scripts with python versions #637

Merged

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Mar 6, 2023

DESCRIPTION OF CHANGES:

This PR improves on the new ./run_WE2E_tests.py script (introduced in #558), implementing all the features present in the previous shell-based workflow. Some new files are also introduced for better organization and additional functionality:

  • tests/WE2E/utils.py This is a collection of functions used by other scripts, contained here to avoid circular dependencies.
  • tests/WE2E/WE2E_summary.py Given an experiment directory or .yaml file, outputs a summary to screen of each experiment, its status, and the number of core hours used. It also prints a summary file with detailed information about each task for each experiment.
  • tests/WE2E/print_test_info.py Will print a file WE2E_test_info.txt, very similar to the legacy WE2E_test_info.csv with just a few minor format differences.

Any scripts can be run with the -h argument to print information about all available options (not including utils.py, which is not designed to be run stand-alone).

With this PR, the old superseded shell-based tools are removed. This will require action by anyone whose workflow relies on those tools. I will reach out to those that I know have automated testing based on these tools, but please let me know if any other capabilities or modifications are needed to keep existing workflows working.

New options

  • --procs=## For monitoring and submitting jobs, this will call rocotorun and read rocoto database files in parallel for the specified number of parallel tasks. This can greatly speed up the submission of large test suites, such as the comprehensive tests. Only use this option if you are sure you have access to the specified number of parallel cores; python will behave badly if you try to over-subscribe your available tasks.
  • --opsroot= If test is for NCO mode, sets OPSROOT
  • --print_test_info Create a WE2E_test_info.txt file summarizing each test prior to starting experiment (False by default)

Other updates

  • The -d flag (debug mode) now captures output from rocotorun in log files, such as job cards and job submission messages. This is not enabled by default because it is very slow.
  • Job monitoring can now be paused with ctrl-c, and a message on how to resume testing is printed to screen
  • All auto-generated experiment monitoring files now use the naming convention WE2E_tests_DATETIME.yaml
  • functions check_task_get_extrn_ics() and check_task_get_extrn_lbcs() are consolidated into a single check_task_get_extrn_bcs() function
  • If TEST_VX_FCST_INPUT_BASEDIR is not set for this platform, do not exit with an error, instead set to empty string
  • Added more detailed checks to ensure that an experiment is actually finished before script exits
  • Fixed broken capabilities (calculating "relative cost" and number of forecasts) in WE2E_test_info.txt file

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

I have tested on all platforms I have access to; I would appreciate help testing on other platforms.

  • hera.intel
    • Ran three identical suites (from tests/WE2E/machine_suites/fundamental) with the old run_WE2E_tests.sh, and with run_WE2E_tests.py in the default configuration, as well as using the --use_cron_to_relaunch option. All tests produced identical output.
    • Ran all test suite options (fundamental,comprehensive, and all), as well as individual tests and input files with test names to ensure expected behavior for all -t options
    • Quit script using ctrl-c at various points during the execution to ensure expected behavior
    • Ran WE2E_summary.py on several previous experiment directories to ensure expected output
  • orion.intel
    • Ran fundamental tests
  • cheyenne.intel
    • Ran fundamental tests
  • cheyenne.gnu
    • Ran fundamental tests
  • gaea.intel
  • jet.intel
    • Ran all tests
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins

DOCUMENTATION:

Documentation is updated for relevant WE2E capabilities that have been changed/added. A lot of documentation related to individual tests and suites is still outdated, but this will be updated at a future date with changes related to #587.

This version of the docs can be viewed here; changes were made to chapters 10 and 12: https://ufs-srweather-app-mkavulich.readthedocs.io/en/latest/index.html

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes

CONTRIBUTORS:

@christinaholtNOAA Contributed a better way of setting relative dates (as used in test get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS)

mkavulich added 30 commits March 6, 2023 05:35
…to a single function: check_task_get_extrn_bcs
…e hour calculation for missing NNODES variables, clean up formatting
 - Remove another duplicate routine
 - Rename "job_summary" to "WE2E_summary"
 - Rename various auto-created yaml files to WE2E_tests_{TIME}.yaml for
consistency
 - Write "summary_{TIME}.txt" containing full report on WE2E tests
 --opsroot, allows user to set NCO OPSROOT variable
 --print_test_details, allows user to print a pipe-delimited text file
(test_details.txt) analogous to previous WE2E_test_info.csv file
 - Remove incorrectly left-in commented code
 - Add documentation for get_or_print_blank()
 - Some suggested changes from pylint
 - Pass "refresh" flag correctly in parallel mode: only for first pass
through of tests list
 - Move second "rocotorun" call to immediately after for better chance
of creating rocoto db file prior to attempting to read
 - Only mark experiment in "error" state if it was not created after the
second pass through
 - Print warning message for the case where jobs are continuously not
submitted, giving users info in case they mis-configured their
experiment
…jobs, fix incorrect variable in error message
…, UNKNOWN jobs will be retried, so we don't want to mark these as ERROR
 - Set VX_FCST_INPUT_BASEDIR to null string if not set on platform
   (rather than failing)
 - Print message about updating experiment on first go around; this
   allows user to see progress with parallel option
 - Only call compare_rocotostat() if job may be finished or stuck
 - Add stand-alone script for calling function without running full
test suite
 - Fix "calculate_cost" function and use it to calculate relative
test cost as in previous implementation
 - Add entry for number of forecasts in output
filenames. Also, fix link detection problem, and use default filename
when called from run script
will greatly increase the speed of runs on most platforms, but will have
the downside of not capturing all rocotorun output in log files,
including job cards and other messages.
@danielabdi-noaa
Copy link
Collaborator

@mkavulich If I recall correctly, this line is added so that the setup_WE2E_tests.sh script gets a clean environment smiliar to one you would get with slurm's --export=NONE. Without it the setup script failed when you execute it from a shell where the conda environment is activated i.e.

(regional_workflow) $ ./setup_WE2_tests.sh

vs

$ ./setup_WE2_tests.sh

The first one was not working because you have conda activated twice, within and outside the setup script and paths were mixed up and it was not able to find jinja2 or yaml IIRC. I can not find the chat we had with @MichaelLueken but there was an issue then for sure. Things may have changed since then with upgrades to modulefiles so you can try to remove it if you can't find other solution. Personally I like it since it will make sure the script behaves the same whether conda is activated outside the script or not.

…n the user's environment, so we have to pass it as an argument to setup_WE2E_tests.sh to be exported prior to running the WE2E test script
@mkavulich
Copy link
Collaborator Author

@danielabdi-noaa Thanks for the info! With that in mind, I was able to implement a fix that still keeps a clean environment, just passing $HOME as an argument to be exported later. This has allowed the Jenkins script to run successfully on Jet!

@MichaelLueken You can try the script again now and it should complete successfully. Just note that if you run on Hera you may see random failures; the reason for this is due to an NCO mode problem described in an issue I just opened (#652)

@MichaelLueken
Copy link
Collaborator

@mkavulich and @danielabdi-noaa Thank you both very much for identifying the cause of the issue and correcting it! The manual run of the Jenkins test script I ran on Orion successfully runs now.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Manually running run_WE2E_tests.py, manually running the Jenkins test script - .cicd/scripts/srw_test.sh, and relaunching tests using monitor_jobs.py were all successful. Approving this work now.

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Mar 8, 2023
@MichaelLueken
Copy link
Collaborator

Please note that the Jenkins failure is due to issue #635. The Jenkins test is failing for Orion. I have manually submitted the tests on Orion and they have successfully passed, as noted above.

@MichaelLueken
Copy link
Collaborator

@mkavulich Additional issues have been encountered with the Jenkins tests.

During the tarring of the log files at the end of the tests, the following is being encountered:

*/log.launch_FV3LAM_wflow: Cannot stat: No such file or directory

This is causing the testing phase to fail, since this file is required in the tarring step.

The Jenkinsfile includes the following line:

sh 'cd "${SRW_WE2E_EXPERIMENT_BASE_DIR}" && tar --create --gzip --verbose --dereference --file "${WORKSPACE}/we2e_test_logs-${SRW_PLATFORM}-${SRW_COMPILER}.tgz" */log.generate_FV3LAM_wflow */log.launch_FV3LAM_wflow */log/*'

Since log.launch_FV3LAM_wflow is no longer part of the workflow, this entry on line 180 of the .cicd/Jenkinsfile will need to be removed.

@mkavulich
Copy link
Collaborator Author

@MichaelLueken thanks for the info, I removed the now-non-existent log file from the archive, and also added new log and status files to the archive. Let me know if this still does not work.

tests/WE2E/setup_WE2E_tests.sh Show resolved Hide resolved
tests/WE2E/setup_WE2E_tests.sh Outdated Show resolved Hide resolved
tests/WE2E/setup_WE2E_tests.sh Outdated Show resolved Hide resolved
fi
./setup_WE2E_tests.sh ${platform} ${SRW_PROJECT} ${SRW_COMPILER} ${test_type} \
--expt_basedir=${we2e_experiment_base_dir} \
--opsroot=${nco_dir} | tee ${progress_file}

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the new run_WE2E works but just wanted to make sure the various delays are taken care of

  • initial delay of 300 sec
  • polling frequency of 60 sec
  • after completion delay of 600 sec. This one is to make sure jenkins does not delete experiment directory prematurely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, none of those delays should be relevant with the new script.

  1. The ./run_WE2E_tests.py script takes care of both the setting up and monitoring of tests, so no initial delay is needed
  2. The new script serially cycles through active experiments with no delay
  3. The new script will not finish until all tasks have been confirmed complete.

One thing that may need to change is that for larger tests (such as, if the comprehensive tests are re-enabled), it may be worth using the new -p parallel option. Because the Jenkins tests call the script with -d, calls to rocotorun are fairly slow, taking ~30 seconds each (and there are two calls to rocotorun per test check). But this should be done cautiously depending on the platform: too many parallel tasks may overload the login node if that's where this script is running.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third one, note that the existing ./run_WE2E_Tests.sh also does not return until all tasks are complete, so if it was needed for the shell script it may also be needed for the python script. I vaguely recall that server side jenkins code by @jessemcfarland deletes experiment directories as soon as srw_tests.sh finishes. This could happen before all cronjobs are removed, since they need to be called one last time for them to be removed from the crontable.

The first one, initial delay, is required to given enough time for atleast one workflow to be launched. If run_WE2E_tests.py does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.

The second one is ok i think since it looks like the behaviour of checking experiment status changed. The old get_expt_status.sh was time consuming so it checks status of experiments every minute and print to the screen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the existing ./run_WE2E_Tests.sh also does not return until all tasks are complete

The launch_FV3LAM_wflow.sh script tried to implement this but it was done in quite a hacky way, so there were instances where the script would assume the experiment was done when there were still jobs running (for example, if there was a failed task but other tasks were still running). The new implementation does not use that script (does not use cron jobs either) and so does not suffer from that problem. It will only exit if it has verified that all jobs are done, via reading the rocoto database file directly.

If run_WE2E_tests.py does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.

This is correct, the run_WE2E_tests.py both submits and monitors jobs, so unless something goes horribly wrong there is no possibility of "desync" there.

I won't promise that I have anticipated all potential problems, but if any of these problems you mention (or similar ones) happen in the future, I can fix it in the run_WE2E_tests.py script itself, we shouldn't rely on hand-wavy delays to avoid problems anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did not realize cron is no longer being used. I agree the initial/final delays are not required in this case.

tests/WE2E/setup_WE2E_tests.sh Show resolved Hide resolved
Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed review. I left some comments/questions below.

tests/WE2E/utils.py Outdated Show resolved Hide resolved
tests/WE2E/utils.py Show resolved Hide resolved
tests/WE2E/WE2E_summary.py Outdated Show resolved Hide resolved
tests/WE2E/run_WE2E_tests.py Show resolved Hide resolved
tests/WE2E/WE2E_summary.py Show resolved Hide resolved
tests/WE2E/utils.py Outdated Show resolved Hide resolved
tests/WE2E/utils.py Show resolved Hide resolved
tests/WE2E/utils.py Outdated Show resolved Hide resolved
tests/WE2E/utils.py Outdated Show resolved Hide resolved
tests/WE2E/utils.py Outdated Show resolved Hide resolved
@gsketefian
Copy link
Collaborator

@mkavulich I looked through the changes and they seem fine. I have a few questions that I could find the answers to by carefully looking at the code, but I figure just asking is much faster:

  1. With the current version of run_WE2E_tests.py and accompanying scripts, I've noticed that while the script is re-submitting jobs, if I go into an experiment directory and call rocotorun or launch_FV3LAM_wflow.sh, I get the following error:
$ ./launch_FV3LAM_wflow.sh 

Modules based on Lua: Version 8.5.2  2021-05-12 12:44 -05:00
    by Robert McLay mclay@tacc.utexas.edu

Currently Loaded Modules:
  1) rocoto/1.3.3   2) miniconda3/4.12.0   3) wflow_hera

FATAL ERROR: 
ERROR:
  From script:  "launch_FV3LAM_wflow.sh"
  Full path to script:  "/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/DTC_ensemble_task_pythonize/ufs-srweather-app/ush/launch_FV3LAM_wflow.sh"
Call to "rocotorun" failed with return code 1.
Exiting with nonzero status.

This doesn't always happen, and it may depend on the task the experiment is currently on. Have you encountered this, and do you know if the changes in this PR get around this?
2) Is the launch_FV3LAM_wflow.sh script being kept, and if not, is there a replacement for it that can be called from within an experiment directory (maybe via a symlink) that will relaunch the workflow? This is handy for me when I'm debugging a single task and want to relaunch it over and over again and get its status.
3) Is WE2E_summary.py the replacement for get_expts_status.sh?
Thanks.

@mkavulich
Copy link
Collaborator Author

@gsketefian Thanks for your comments and questions.

  1. This is a limitation of the launch_FV3LAM_wflow.sh script and how it interacts with rocoto. Rocoto locks the .db file while rocotorun is being run, so if you try to run two rocotorun processes simultaneously, the second will fail with the message WARNING: Workflow is locked by pid xxxxxx on host xx.xxx.xx.xx and exit code 1. So if the ./run_WE2E_tests.py or ./monitor_jobs.py scripts are using rocotorun while you run that script, it will fail. The same thing would happen if you tried to run launch_FV3LAM_wflow twice simultaneously. Is there a use case I'm missing that would require running launch_FV3LAM_wflow while the monitor_jobs function is actively monitoring/submitting jobs?
  2. launch_FV3LAM_wflow.sh is not being removed with this PR (I was only addressing the WE2E tests in this PR, and I think this script has uses outside of that testing framework), but if you want to get the status of a job WE2E_summary.py will give you the summary of the currently running jobs (you can give it a yaml file or a top-level experiment directory as an argument).
    If you need the ability to see the status of an active experiment while you're in that directory, is there a reason that simply calling rocotostat doesn't give you want you need?
  3. WE2E_summary.py is not a 1-to-1 replacement of get_expts_status.sh, but it should give the same relevant information. It does not not have the num_log_lines argument (because it reads rocoto .db files directly rather than relying on parsing a log file) and it does not have the launch_wflows argument (WE2E_summary.py does not advance the workflow with rocotorun, it only reads and reports the current status).
    If the ability to manually advance the workflow one step at a time is needed it can be added. But I was not anticipating that use being necessary.

@gsketefian
Copy link
Collaborator

@mkavulich:

  1. A use case is when you launch a bunch of tests with run_WE2E_tests.py, and say one of them starts to fail (some tasks report back as failed), and you want to stop resubmitting tasks for that experiment and go to its directory and run one task at a time to debug (while allowing the other experiment to continue). Do run_WE2E_tests.py and monitor_jobs.py lock the .db files of all experiments while they're running? Or do they read it, free it for a bit, then read it again, etc? I have a feeling it's the latter since I don't always get that error.
  2. As I step through the tasks, I like to get the most updated status of the experiments, so I like to run launch_FV3LAM_wflow.sh because it calls rocotorun before rocotostat and stores the results in a log file, which is helpful if I want to see the progress (e.g. what tasks were called from one re-launch to the next). I realize these can all be done from the command line, but it's nice to have one script do it all. As long as launch_FV3LAM_wflow.sh is not being removed, it should be fine.
  3. Sounds like WE2E_summary.py will be work for my purposes.

Currently, if I want to step one task at a time with an experiment is to first create it with run_WE2E_tests.py, then Ctr-C out of it once it's done creating the experiment directories, then go to the experiment directory and use launch_FV3LAM_wflow.sh to go one or two tasks at a time while I debug the xml, the ex-script, j-job, METplus conf file, etc. I'll just keep using that approach.
Thanks!

@mkavulich
Copy link
Collaborator Author

@gsketefian To answer your first question, the new scripts cycle through each experiment serially, running rocotorun on one experiment at a time, with a 5 second delay between loops. It is only while rocotorun is being called that the rocoto database is locked.

As a side note, if one experiment fails, then the monitor_jobs function will not continue running rocotorun in that directory. So if that is your use case you shouldn't see conflicts when using the launch_FV3LAM_wflow.sh script unless something weird is happening.

 - Correct import location for print_WE2E_summary
 - Use os.path.join() for path strings
 - Correct script name
 - Set global variables for column width in job summary

Also a bug fix for cases where variable definitions file doesn't exist
(can occur if experiment is moved or re-created after yaml generation)
@mkavulich
Copy link
Collaborator Author

@christinaholtNOAA I believe I have addressed all your comments. Let me know if you have anything further.

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just one comment below.

Copy link
Collaborator

@christinaholtNOAA christinaholtNOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@MichaelLueken
Copy link
Collaborator

@mkavulich All Jenkins tests successfully passed. The Orion tests ultimately sat in queue for several hours (over eleven hours), leading to a time-out for the Build and Test check. Will now move forward with merging this PR.

@MichaelLueken MichaelLueken merged commit d917ace into ufs-community:develop Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
No open projects
Status: Done
6 participants