Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Replace shell-based WE2E scripts with python versions #637

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
a51c83d
Fix error introduced to pythonized WE2E script
mkavulich Feb 21, 2023
7e393a0
Fix failing "specify_template_filenames" test
mkavulich Feb 22, 2023
a63286d
Add logic and instructions for neatly interrupting and resuming the m…
mkavulich Feb 22, 2023
5330ecf
Consolidate check_task_get_extrn_ics and check_task_get_extrn_lbcs in…
mkavulich Feb 23, 2023
e27dacc
Remove some unnecessary debug prints
mkavulich Feb 23, 2023
6dcc7c2
Initial version of job summary script
mkavulich Feb 23, 2023
b5fb16c
Update experiment yamls to track task cores and walltime, update job_…
mkavulich Feb 23, 2023
e653076
Print summary once all jobs are finished monitoring, add failsafe cor…
mkavulich Feb 23, 2023
da15a91
Move some functions into a new utils.py file to avoid circular depend…
mkavulich Feb 23, 2023
3799562
Send rocotorun output to logging.debug, append logs rather than overw…
mkavulich Feb 24, 2023
17bf55d
Add ability to update experiments in parallel
mkavulich Feb 24, 2023
cabea58
Remove duplicate routine, add totals row to job summary
mkavulich Feb 24, 2023
523c349
More improvements
mkavulich Feb 25, 2023
170a732
Add final missing options:
mkavulich Feb 27, 2023
90ab3b9
- Remove incorrectly left-in exit call
mkavulich Mar 1, 2023
e14bc5b
New way of specifying relative time tests without system calls to `da…
mkavulich Mar 1, 2023
fe2d763
Some needed changes to address problems with parallel mode
mkavulich Mar 1, 2023
d5abe53
Add a final check using rocotostat to ensure that there are no un-sub…
mkavulich Mar 2, 2023
6281c40
Fix logic for jobs in DEAD or UNKNOWN status, add logic for "FAILED" …
mkavulich Mar 2, 2023
dd02d0a
Remove UNKNOWN from the status check; according to rocoto source code…
mkavulich Mar 2, 2023
5185b04
Some more fixes from final testing
mkavulich Mar 2, 2023
dcc8c58
Start updating documentation
mkavulich Mar 2, 2023
ceff2d0
Finish implementation of "print_test_details()"
mkavulich Mar 6, 2023
735f202
For test details, rename everything to be more consistent with old
mkavulich Mar 6, 2023
970e74a
Continue updating documentation through first few sections of WE2E ch…
mkavulich Mar 6, 2023
e75a78b
Some more documentation updates
mkavulich Mar 6, 2023
7ccafde
Update gitignore for new/updated filenames
mkavulich Mar 6, 2023
c98f4fd
Documentation for WE2E_summary.py script
mkavulich Mar 6, 2023
b208a4b
Revert behavior of rocotorun to only capture output if debug=True. This
mkavulich Mar 6, 2023
baa8fcb
Fixes suggested by pylint
mkavulich Mar 6, 2023
b26623f
The big moment: ditching the old shell version.
mkavulich Mar 6, 2023
5621385
Updates to Jenkins test script for new python workflow
mkavulich Mar 6, 2023
7e5436d
Fix unit test for new behavior of calculate_cost.py
mkavulich Mar 6, 2023
1dc1229
Fix unit test for real this time?
mkavulich Mar 6, 2023
399b10d
Don't call rocotorun for WE2E_summary
mkavulich Mar 6, 2023
0bd9c50
Add directory name to test summary
mkavulich Mar 7, 2023
903d881
- More general cleanup, including suggestions from pylint
mkavulich Mar 7, 2023
e7380ca
Fix missed "expt_dict" rename, widen test name column in summary file
mkavulich Mar 7, 2023
18389ca
Add missing column header for test info file
mkavulich Mar 7, 2023
5255a82
Fixes to Jenkins testing scripts from Mike Lueken
mkavulich Mar 7, 2023
35da5be
If database is not loaded, need to return every time, even if a toler…
mkavulich Mar 7, 2023
d484d14
Rocoto requires /home/Michael.Kavulich to be set to a writable path i…
mkavulich Mar 8, 2023
b4f7319
Update archiving of relevant log files in Jenkinsfile
mkavulich Mar 8, 2023
dca8c25
Fixes suggested by Daniel
mkavulich Mar 10, 2023
e2e7a6c
Fix usage instructions for more flexible "tests" argument
mkavulich Mar 10, 2023
effbc01
Address PR comments
mkavulich Mar 15, 2023
099a2d2
A couple bug fixes from latest changes
mkavulich Mar 15, 2023
b88c4a7
Addressing more review comments: Update dicts in place
mkavulich Mar 15, 2023
40e5fb5
Revert "Addressing more review comments: Update dicts in place"
mkavulich Mar 15, 2023
b321371
Final set of review comments
mkavulich Mar 15, 2023
8f5525d
Un-revert intended change to flow of WE2E_summary.py
mkavulich Mar 15, 2023
4ec7dfc
Revert "Un-revert intended change to flow of WE2E_summary.py"
mkavulich Mar 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .cicd/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ pipeline {
post {
always {
// Archive the test log files
sh 'cd "${SRW_WE2E_EXPERIMENT_BASE_DIR}" && tar --create --gzip --verbose --dereference --file "${WORKSPACE}/we2e_test_logs-${SRW_PLATFORM}-${SRW_COMPILER}.tgz" */log.generate_FV3LAM_wflow */log.launch_FV3LAM_wflow */log/*'
sh 'cd "${SRW_WE2E_EXPERIMENT_BASE_DIR}" && tar --create --gzip --verbose --dereference --file "${WORKSPACE}/we2e_test_logs-${SRW_PLATFORM}-${SRW_COMPILER}.tgz" */log.generate_FV3LAM_wflow */log/* ${WORKSPACE}/tests/WE2E/WE2E_tests_*yaml ${WORKSPACE}/tests/WE2E/WE2E_summary*txt ${WORKSPACE}/tests/WE2E/log.*'
// Remove the data sets from the experiments directory to conserve disk space
sh 'find "${SRW_WE2E_EXPERIMENT_BASE_DIR}" -regextype posix-extended -regex "^.*(orog|[0-9]{10})$" -type d | xargs rm -rf'
s3Upload consoleLogLevel: 'INFO', dontSetBuildResultOnFailure: false, dontWaitForConcurrentBuildCompletion: false, entries: [[bucket: 'woc-epic-jenkins-artifacts', excludedFile: '', flatten: false, gzipFiles: false, keepForever: false, managedArtifacts: true, noUploadOnFailure: false, selectedRegion: 'us-east-1', showDirectlyInBrowser: false, sourceFile: 'we2e_test_results-*-*.txt', storageClass: 'STANDARD', uploadFromSlave: false, useServerSideEncryption: false], [bucket: 'woc-epic-jenkins-artifacts', excludedFile: '', flatten: false, gzipFiles: false, keepForever: false, managedArtifacts: true, noUploadOnFailure: false, selectedRegion: 'us-east-1', showDirectlyInBrowser: false, sourceFile: 'we2e_test_logs-*-*.tgz', storageClass: 'STANDARD', uploadFromSlave: false, useServerSideEncryption: false]], pluginFailureResultConstraint: 'FAILURE', profileName: 'main', userMetadata: []
Expand Down
51 changes: 4 additions & 47 deletions .cicd/scripts/srw_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,58 +38,15 @@ else
fi

cd ${we2e_test_dir}
./setup_WE2E_tests.sh ${platform} ${SRW_PROJECT} ${SRW_COMPILER} ${test_type} \
expt_basedir=${we2e_experiment_base_dir} \
opsroot=${nco_dir}

# Run the new run_srw_tests script if the machine is Cheyenne.
if [[ "${platform}" = "cheyenne" ]]; then
cd ${workspace}/ush
./run_srw_tests.py -e=${we2e_experiment_base_dir}
cd ${we2e_test_dir}
fi

# Progress file
progress_file="${workspace}/we2e_test_results-${platform}-${SRW_COMPILER}.txt"

# Allow the tests to start before checking for status.
# TODO: Create a parameter that sets the initial start delay.
if [[ "${platform}" != "cheyenne" ]]; then
sleep 300
fi

# Wait for all tests to complete.
while true; do

# Check status of all experiments
./get_expts_status.sh expts_basedir="${we2e_experiment_base_dir}" \
verbose="FALSE" | tee ${progress_file}

# Exit loop only if there are not tests in progress
set +e
grep -q "Workflow status: IN PROGRESS" ${progress_file}
exit_code=$?
set -e

if [[ $exit_code -ne 0 ]]; then
break
fi

# TODO: Create a paremeter that sets the poll frequency.
sleep 60
done

# Allow we2e cron jobs time to complete and clean up themselves
# TODO: Create parameter that sets the interval for the we2e cron jobs; this
# value should be some factor of that interval to ensure the cron jobs execute
# before the workspace is cleaned up.
if [[ "${platform}" != "cheyenne" ]]; then
sleep 600
fi
./setup_WE2E_tests.sh ${platform} ${SRW_PROJECT} ${SRW_COMPILER} ${test_type} \
--expt_basedir=${we2e_experiment_base_dir} \
--opsroot=${nco_dir} | tee ${progress_file}

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how the new run_WE2E works but just wanted to make sure the various delays are taken care of

  • initial delay of 300 sec
  • polling frequency of 60 sec
  • after completion delay of 600 sec. This one is to make sure jenkins does not delete experiment directory prematurely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, none of those delays should be relevant with the new script.

  1. The ./run_WE2E_tests.py script takes care of both the setting up and monitoring of tests, so no initial delay is needed
  2. The new script serially cycles through active experiments with no delay
  3. The new script will not finish until all tasks have been confirmed complete.

One thing that may need to change is that for larger tests (such as, if the comprehensive tests are re-enabled), it may be worth using the new -p parallel option. Because the Jenkins tests call the script with -d, calls to rocotorun are fairly slow, taking ~30 seconds each (and there are two calls to rocotorun per test check). But this should be done cautiously depending on the platform: too many parallel tasks may overload the login node if that's where this script is running.

Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third one, note that the existing ./run_WE2E_Tests.sh also does not return until all tasks are complete, so if it was needed for the shell script it may also be needed for the python script. I vaguely recall that server side jenkins code by @jessemcfarland deletes experiment directories as soon as srw_tests.sh finishes. This could happen before all cronjobs are removed, since they need to be called one last time for them to be removed from the crontable.

The first one, initial delay, is required to given enough time for atleast one workflow to be launched. If run_WE2E_tests.py does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.

The second one is ok i think since it looks like the behaviour of checking experiment status changed. The old get_expt_status.sh was time consuming so it checks status of experiments every minute and print to the screen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the existing ./run_WE2E_Tests.sh also does not return until all tasks are complete

The launch_FV3LAM_wflow.sh script tried to implement this but it was done in quite a hacky way, so there were instances where the script would assume the experiment was done when there were still jobs running (for example, if there was a failed task but other tasks were still running). The new implementation does not use that script (does not use cron jobs either) and so does not suffer from that problem. It will only exit if it has verified that all jobs are done, via reading the rocoto database file directly.

If run_WE2E_tests.py does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.

This is correct, the run_WE2E_tests.py both submits and monitors jobs, so unless something goes horribly wrong there is no possibility of "desync" there.

I won't promise that I have anticipated all potential problems, but if any of these problems you mention (or similar ones) happen in the future, I can fix it in the run_WE2E_tests.py script itself, we shouldn't rely on hand-wavy delays to avoid problems anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did not realize cron is no longer being used. I agree the initial/final delays are not required in this case.

# Set exit code to number of failures
set +e
failures=$(grep "Workflow status: FAILURE" ${progress_file} | wc -l)
failures=$(grep " DEAD " ${progress_file} | wc -l)
if [[ $failures -ne 0 ]]; then
failures=1
fi
Expand Down
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ lib/
share/
modulefiles/extrn_comp_build/
sorc/*/
tests/WE2E/WE2E_test_info.csv
tests/WE2E/WE2E_tests_*.yaml
tests/WE2E/*.txt
tests/WE2E/*.log
tests/WE2E/log.*
ush/__pycache__/
ush/config.yaml
ush/python_utils/__pycache__/
ush/*.swp

*.swp
__pycache__
2 changes: 1 addition & 1 deletion docs/UsersGuide/source/ConfigWorkflow.rst
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ METplus Parameters
Test Directories
----------------------

These directories are used only by the ``run_WE2E_tests.sh`` script, so they are not used unless the user runs a Workflow End-to-End (WE2E) test. Their function corresponds to the same variables without the ``TEST_`` prefix. Users typically should not modify these variables. For any alterations, the logic in the ``run_WE2E_tests.sh`` script would need to be adjusted accordingly.
These directories are used only by the ``run_WE2E_tests.py`` script, so they are not used unless the user runs a Workflow End-to-End (WE2E) test (see :numref:`Chapter %s <WE2E_tests>`). Their function corresponds to the same variables without the ``TEST_`` prefix. Users typically should not modify these variables. For any alterations, the logic in the ``run_WE2E_tests.py`` script would need to be adjusted accordingly.

``TEST_EXTRN_MDL_SOURCE_BASEDIR``: (Default: "")
This parameter allows testing of user-staged files in a known location on a given platform. This path contains a limited dataset and likely will not be useful for most user experiments.
Expand Down
Loading