-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Replace shell-based WE2E scripts with python versions #637
[develop] Replace shell-based WE2E scripts with python versions #637
Conversation
…to a single function: check_task_get_extrn_bcs
…monitor to report core hours
…e hour calculation for missing NNODES variables, clean up formatting
- Remove another duplicate routine - Rename "job_summary" to "WE2E_summary" - Rename various auto-created yaml files to WE2E_tests_{TIME}.yaml for consistency - Write "summary_{TIME}.txt" containing full report on WE2E tests
--opsroot, allows user to set NCO OPSROOT variable --print_test_details, allows user to print a pipe-delimited text file (test_details.txt) analogous to previous WE2E_test_info.csv file
- Remove incorrectly left-in commented code - Add documentation for get_or_print_blank() - Some suggested changes from pylint
…te`, courtesy of Christina Holt
- Pass "refresh" flag correctly in parallel mode: only for first pass through of tests list - Move second "rocotorun" call to immediately after for better chance of creating rocoto db file prior to attempting to read - Only mark experiment in "error" state if it was not created after the second pass through - Print warning message for the case where jobs are continuously not submitted, giving users info in case they mis-configured their experiment
…mitted tasks remaining
…jobs, fix incorrect variable in error message
…, UNKNOWN jobs will be retried, so we don't want to mark these as ERROR
- Set VX_FCST_INPUT_BASEDIR to null string if not set on platform (rather than failing) - Print message about updating experiment on first go around; this allows user to see progress with parallel option - Only call compare_rocotostat() if job may be finished or stuck
- Add stand-alone script for calling function without running full test suite - Fix "calculate_cost" function and use it to calculate relative test cost as in previous implementation - Add entry for number of forecasts in output
filenames. Also, fix link detection problem, and use default filename when called from run script
will greatly increase the speed of runs on most platforms, but will have the downside of not capturing all rocotorun output in log files, including job cards and other messages.
@mkavulich If I recall correctly, this line is added so that the
vs
The first one was not working because you have conda activated twice, within and outside the setup script and paths were mixed up and it was not able to find |
…n the user's environment, so we have to pass it as an argument to setup_WE2E_tests.sh to be exported prior to running the WE2E test script
@danielabdi-noaa Thanks for the info! With that in mind, I was able to implement a fix that still keeps a clean environment, just passing $HOME as an argument to be exported later. This has allowed the Jenkins script to run successfully on Jet! @MichaelLueken You can try the script again now and it should complete successfully. Just note that if you run on Hera you may see random failures; the reason for this is due to an NCO mode problem described in an issue I just opened (#652) |
@mkavulich and @danielabdi-noaa Thank you both very much for identifying the cause of the issue and correcting it! The manual run of the Jenkins test script I ran on Orion successfully runs now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkavulich Manually running run_WE2E_tests.py
, manually running the Jenkins test script - .cicd/scripts/srw_test.sh
, and relaunching tests using monitor_jobs.py
were all successful. Approving this work now.
Please note that the Jenkins failure is due to issue #635. The Jenkins test is failing for Orion. I have manually submitted the tests on Orion and they have successfully passed, as noted above. |
@mkavulich Additional issues have been encountered with the Jenkins tests. During the tarring of the log files at the end of the tests, the following is being encountered:
This is causing the testing phase to fail, since this file is required in the tarring step. The Jenkinsfile includes the following line: ufs-srweather-app/.cicd/Jenkinsfile Line 180 in ff6f103
Since |
@MichaelLueken thanks for the info, I removed the now-non-existent log file from the archive, and also added new log and status files to the archive. Let me know if this still does not work. |
fi | ||
./setup_WE2E_tests.sh ${platform} ${SRW_PROJECT} ${SRW_COMPILER} ${test_type} \ | ||
--expt_basedir=${we2e_experiment_base_dir} \ | ||
--opsroot=${nco_dir} | tee ${progress_file} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the new run_WE2E
works but just wanted to make sure the various delays are taken care of
- initial delay of 300 sec
- polling frequency of 60 sec
- after completion delay of 600 sec. This one is to make sure jenkins does not delete experiment directory prematurely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, none of those delays should be relevant with the new script.
- The
./run_WE2E_tests.py
script takes care of both the setting up and monitoring of tests, so no initial delay is needed - The new script serially cycles through active experiments with no delay
- The new script will not finish until all tasks have been confirmed complete.
One thing that may need to change is that for larger tests (such as, if the comprehensive tests are re-enabled), it may be worth using the new -p
parallel option. Because the Jenkins tests call the script with -d
, calls to rocotorun are fairly slow, taking ~30 seconds each (and there are two calls to rocotorun per test check). But this should be done cautiously depending on the platform: too many parallel tasks may overload the login node if that's where this script is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third one, note that the existing ./run_WE2E_Tests.sh
also does not return until all tasks are complete, so if it was needed for the shell script it may also be needed for the python script. I vaguely recall that server side jenkins code by @jessemcfarland deletes experiment directories as soon as srw_tests.sh
finishes. This could happen before all cronjobs are removed, since they need to be called one last time for them to be removed from the crontable.
The first one, initial delay, is required to given enough time for atleast one workflow to be launched. If run_WE2E_tests.py
does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.
The second one is ok i think since it looks like the behaviour of checking experiment status changed. The old get_expt_status.sh
was time consuming so it checks status of experiments every minute and print to the screen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that the existing ./run_WE2E_Tests.sh also does not return until all tasks are complete
The launch_FV3LAM_wflow.sh
script tried to implement this but it was done in quite a hacky way, so there were instances where the script would assume the experiment was done when there were still jobs running (for example, if there was a failed task but other tasks were still running). The new implementation does not use that script (does not use cron jobs either) and so does not suffer from that problem. It will only exit if it has verified that all jobs are done, via reading the rocoto database file directly.
If run_WE2E_tests.py does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.
This is correct, the run_WE2E_tests.py
both submits and monitors jobs, so unless something goes horribly wrong there is no possibility of "desync" there.
I won't promise that I have anticipated all potential problems, but if any of these problems you mention (or similar ones) happen in the future, I can fix it in the run_WE2E_tests.py
script itself, we shouldn't rely on hand-wavy delays to avoid problems anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I did not realize cron is no longer being used. I agree the initial/final delays are not required in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review. I left some comments/questions below.
@mkavulich I looked through the changes and they seem fine. I have a few questions that I could find the answers to by carefully looking at the code, but I figure just asking is much faster:
This doesn't always happen, and it may depend on the task the experiment is currently on. Have you encountered this, and do you know if the changes in this PR get around this? |
@gsketefian Thanks for your comments and questions.
|
Currently, if I want to step one task at a time with an experiment is to first create it with |
@gsketefian To answer your first question, the new scripts cycle through each experiment serially, running rocotorun on one experiment at a time, with a 5 second delay between loops. It is only while rocotorun is being called that the rocoto database is locked. As a side note, if one experiment fails, then the monitor_jobs function will not continue running rocotorun in that directory. So if that is your use case you shouldn't see conflicts when using the launch_FV3LAM_wflow.sh script unless something weird is happening. |
- Correct import location for print_WE2E_summary - Use os.path.join() for path strings - Correct script name - Set global variables for column width in job summary Also a bug fix for cases where variable definitions file doesn't exist (can occur if experiment is moved or re-created after yaml generation)
This reverts commit b88c4a7.
@christinaholtNOAA I believe I have addressed all your comments. Let me know if you have anything further. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one comment below.
This reverts commit 8f5525d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@mkavulich All Jenkins tests successfully passed. The Orion tests ultimately sat in queue for several hours (over eleven hours), leading to a time-out for the Build and Test check. Will now move forward with merging this PR. |
DESCRIPTION OF CHANGES:
This PR improves on the new
./run_WE2E_tests.py
script (introduced in #558), implementing all the features present in the previous shell-based workflow. Some new files are also introduced for better organization and additional functionality:tests/WE2E/utils.py
This is a collection of functions used by other scripts, contained here to avoid circular dependencies.tests/WE2E/WE2E_summary.py
Given an experiment directory or .yaml file, outputs a summary to screen of each experiment, its status, and the number of core hours used. It also prints a summary file with detailed information about each task for each experiment.tests/WE2E/print_test_info.py
Will print a fileWE2E_test_info.txt
, very similar to the legacyWE2E_test_info.csv
with just a few minor format differences.Any scripts can be run with the
-h
argument to print information about all available options (not includingutils.py
, which is not designed to be run stand-alone).With this PR, the old superseded shell-based tools are removed. This will require action by anyone whose workflow relies on those tools. I will reach out to those that I know have automated testing based on these tools, but please let me know if any other capabilities or modifications are needed to keep existing workflows working.
New options
--procs=##
For monitoring and submitting jobs, this will callrocotorun
and read rocoto database files in parallel for the specified number of parallel tasks. This can greatly speed up the submission of large test suites, such as the comprehensive tests. Only use this option if you are sure you have access to the specified number of parallel cores; python will behave badly if you try to over-subscribe your available tasks.--opsroot=
If test is for NCO mode, sets OPSROOT--print_test_info
Create aWE2E_test_info.txt
file summarizing each test prior to starting experiment (False by default)Other updates
-d
flag (debug mode) now captures output fromrocotorun
in log files, such as job cards and job submission messages. This is not enabled by default because it is very slow.ctrl-c
, and a message on how to resume testing is printed to screenWE2E_tests_DATETIME.yaml
check_task_get_extrn_ics()
andcheck_task_get_extrn_lbcs()
are consolidated into a singlecheck_task_get_extrn_bcs()
functionTEST_VX_FCST_INPUT_BASEDIR
is not set for this platform, do not exit with an error, instead set to empty stringWE2E_test_info.txt
fileType of change
TESTS CONDUCTED:
I have tested on all platforms I have access to; I would appreciate help testing on other platforms.
tests/WE2E/machine_suites/fundamental
) with the oldrun_WE2E_tests.sh
, and withrun_WE2E_tests.py
in the default configuration, as well as using the--use_cron_to_relaunch
option. All tests produced identical output.fundamental
,comprehensive
, andall
), as well as individual tests and input files with test names to ensure expected behavior for all-t
optionsctrl-c
at various points during the execution to ensure expected behaviorWE2E_summary.py
on several previous experiment directories to ensure expected outputDOCUMENTATION:
Documentation is updated for relevant WE2E capabilities that have been changed/added. A lot of documentation related to individual tests and suites is still outdated, but this will be updated at a future date with changes related to #587.
This version of the docs can be viewed here; changes were made to chapters 10 and 12: https://ufs-srweather-app-mkavulich.readthedocs.io/en/latest/index.html
ISSUE:
run_srw_tests.py
script withrun_WE2E_tests.py
andmonitor_jobs.py
#586./run_WE2E_tests.py
to avoid use of crontab; to run faster it can be run on a compute node with the-p
option.run_WE2E_tests.sh
does not run the intended tests on Hera if COMPILER is not explicitly set #571run_WE2E_tests.sh
silently discards invalid test options #568CHECKLIST
CONTRIBUTORS:
@christinaholtNOAA Contributed a better way of setting relative dates (as used in test
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS
)