-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Replace shell-based WE2E scripts with python versions #637
Merged
MichaelLueken
merged 52 commits into
ufs-community:develop
from
mkavulich:feature/replace_sh_WE2E_script
Mar 20, 2023
Merged
Changes from all commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
a51c83d
Fix error introduced to pythonized WE2E script
mkavulich 7e393a0
Fix failing "specify_template_filenames" test
mkavulich a63286d
Add logic and instructions for neatly interrupting and resuming the m…
mkavulich 5330ecf
Consolidate check_task_get_extrn_ics and check_task_get_extrn_lbcs in…
mkavulich e27dacc
Remove some unnecessary debug prints
mkavulich 6dcc7c2
Initial version of job summary script
mkavulich b5fb16c
Update experiment yamls to track task cores and walltime, update job_…
mkavulich e653076
Print summary once all jobs are finished monitoring, add failsafe cor…
mkavulich da15a91
Move some functions into a new utils.py file to avoid circular depend…
mkavulich 3799562
Send rocotorun output to logging.debug, append logs rather than overw…
mkavulich 17bf55d
Add ability to update experiments in parallel
mkavulich cabea58
Remove duplicate routine, add totals row to job summary
mkavulich 523c349
More improvements
mkavulich 170a732
Add final missing options:
mkavulich 90ab3b9
- Remove incorrectly left-in exit call
mkavulich e14bc5b
New way of specifying relative time tests without system calls to `da…
mkavulich fe2d763
Some needed changes to address problems with parallel mode
mkavulich d5abe53
Add a final check using rocotostat to ensure that there are no un-sub…
mkavulich 6281c40
Fix logic for jobs in DEAD or UNKNOWN status, add logic for "FAILED" …
mkavulich dd02d0a
Remove UNKNOWN from the status check; according to rocoto source code…
mkavulich 5185b04
Some more fixes from final testing
mkavulich dcc8c58
Start updating documentation
mkavulich ceff2d0
Finish implementation of "print_test_details()"
mkavulich 735f202
For test details, rename everything to be more consistent with old
mkavulich 970e74a
Continue updating documentation through first few sections of WE2E ch…
mkavulich e75a78b
Some more documentation updates
mkavulich 7ccafde
Update gitignore for new/updated filenames
mkavulich c98f4fd
Documentation for WE2E_summary.py script
mkavulich b208a4b
Revert behavior of rocotorun to only capture output if debug=True. This
mkavulich baa8fcb
Fixes suggested by pylint
mkavulich b26623f
The big moment: ditching the old shell version.
mkavulich 5621385
Updates to Jenkins test script for new python workflow
mkavulich 7e5436d
Fix unit test for new behavior of calculate_cost.py
mkavulich 1dc1229
Fix unit test for real this time?
mkavulich 399b10d
Don't call rocotorun for WE2E_summary
mkavulich 0bd9c50
Add directory name to test summary
mkavulich 903d881
- More general cleanup, including suggestions from pylint
mkavulich e7380ca
Fix missed "expt_dict" rename, widen test name column in summary file
mkavulich 18389ca
Add missing column header for test info file
mkavulich 5255a82
Fixes to Jenkins testing scripts from Mike Lueken
mkavulich 35da5be
If database is not loaded, need to return every time, even if a toler…
mkavulich d484d14
Rocoto requires /home/Michael.Kavulich to be set to a writable path i…
mkavulich b4f7319
Update archiving of relevant log files in Jenkinsfile
mkavulich dca8c25
Fixes suggested by Daniel
mkavulich e2e7a6c
Fix usage instructions for more flexible "tests" argument
mkavulich effbc01
Address PR comments
mkavulich 099a2d2
A couple bug fixes from latest changes
mkavulich b88c4a7
Addressing more review comments: Update dicts in place
mkavulich 40e5fb5
Revert "Addressing more review comments: Update dicts in place"
mkavulich b321371
Final set of review comments
mkavulich 8f5525d
Un-revert intended change to flow of WE2E_summary.py
mkavulich 4ec7dfc
Revert "Un-revert intended change to flow of WE2E_summary.py"
mkavulich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how the new
run_WE2E
works but just wanted to make sure the various delays are taken care ofThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, none of those delays should be relevant with the new script.
./run_WE2E_tests.py
script takes care of both the setting up and monitoring of tests, so no initial delay is neededOne thing that may need to change is that for larger tests (such as, if the comprehensive tests are re-enabled), it may be worth using the new
-p
parallel option. Because the Jenkins tests call the script with-d
, calls to rocotorun are fairly slow, taking ~30 seconds each (and there are two calls to rocotorun per test check). But this should be done cautiously depending on the platform: too many parallel tasks may overload the login node if that's where this script is running.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the third one, note that the existing
./run_WE2E_Tests.sh
also does not return until all tasks are complete, so if it was needed for the shell script it may also be needed for the python script. I vaguely recall that server side jenkins code by @jessemcfarland deletes experiment directories as soon assrw_tests.sh
finishes. This could happen before all cronjobs are removed, since they need to be called one last time for them to be removed from the crontable.The first one, initial delay, is required to given enough time for atleast one workflow to be launched. If
run_WE2E_tests.py
does not exit thinking all jobs completed, when in reality no jobs were launched after say 4 mins, then it should be fine.The second one is ok i think since it looks like the behaviour of checking experiment status changed. The old
get_expt_status.sh
was time consuming so it checks status of experiments every minute and print to the screen.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
launch_FV3LAM_wflow.sh
script tried to implement this but it was done in quite a hacky way, so there were instances where the script would assume the experiment was done when there were still jobs running (for example, if there was a failed task but other tasks were still running). The new implementation does not use that script (does not use cron jobs either) and so does not suffer from that problem. It will only exit if it has verified that all jobs are done, via reading the rocoto database file directly.This is correct, the
run_WE2E_tests.py
both submits and monitors jobs, so unless something goes horribly wrong there is no possibility of "desync" there.I won't promise that I have anticipated all potential problems, but if any of these problems you mention (or similar ones) happen in the future, I can fix it in the
run_WE2E_tests.py
script itself, we shouldn't rely on hand-wavy delays to avoid problems anyway.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I did not realize cron is no longer being used. I agree the initial/final delays are not required in this case.