You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) as well as the large number of submitted jobs (several hundred) results in problems on certain platforms. The known issues are detailed in the next section.
Machines affected
Cheyenne
The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) strains the capability of the node(s) handling cron jobs on the Cheyenne machine. CISL has asked us specifically to limit our use of cron in this way.
Hera
The large number of individual tests means that on platforms with a by-user job submission limit (300 on Hera) can be surpassed. While this does not create errors, attempting to submit additional jobs results in the new jobs being rejected, which rocoto counts as a failure and does not handle well.
Steps To Reproduce
Run the comprehensive suite of WE2E tests on the platforms described above.
Detailed Description of Fix
There are a few potential fixes for these issues individually, but the best fix would be to create a python script that will set up each test in the comprehensive suite, generate all of them, and track and manage the active jobs using rocoto rather than creating a crontab entry for each test. This python script would be intended to run continuously until all WE2E tasks are complete, and could be run via the command line or submitted to a compute node. By functioning this way, we will bypass the cron functionality to avoid those issues, and it will be relatively easy to implement any necessary throttling to avoid issues related to the number of submitted jobs.
Possible Implementation
My proposal for this implementation would be to create a script that initially has just the basic functionality described above, which could be created quickly and used to run the comprehensive test suite in the short term. Once implemented, the script could be fleshed out to include all the necessary functionality of run_WE2E_tests.sh, eventually superseding it completely.
Bonuses of a re-write in python include
Working nicely with the existing efforts to python-ize the app
Take advantage of existing python tools easily
Would make adding additional useful functionality, such as a test summary file detailing the status of each test, much easier.
Specific requirements:
Should be able to run from login node or batch queue
Must be able to stop specific tests without killing all tests
desired but not required Should be able to resume interrupted test suite
The text was updated successfully, but these errors were encountered:
#466 Represents a good interim solution for the Cheyenne crontab issue. I will be working on a more complete re-write of run_WE2E_tests.sh in python in the coming days for a more complete solution to these issues.
…/framework submodule pointer update for ufs-community#462 (#1654)
* update FV3 submodule and .gitmodules for testing of 20230313_combo
* turn off cpld_control_p8_faster cheyenne
Expected behavior
All comprehensive tests should run
Current behavior
The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) as well as the large number of submitted jobs (several hundred) results in problems on certain platforms. The known issues are detailed in the next section.
Machines affected
Cheyenne
The large number of cron tasks needing to be executed simultaneously (57 for the current comprehensive suite) strains the capability of the node(s) handling cron jobs on the Cheyenne machine. CISL has asked us specifically to limit our use of cron in this way.
Hera
The large number of individual tests means that on platforms with a by-user job submission limit (300 on Hera) can be surpassed. While this does not create errors, attempting to submit additional jobs results in the new jobs being rejected, which rocoto counts as a failure and does not handle well.
Steps To Reproduce
Run the comprehensive suite of WE2E tests on the platforms described above.
Detailed Description of Fix
There are a few potential fixes for these issues individually, but the best fix would be to create a python script that will set up each test in the comprehensive suite, generate all of them, and track and manage the active jobs using rocoto rather than creating a crontab entry for each test. This python script would be intended to run continuously until all WE2E tasks are complete, and could be run via the command line or submitted to a compute node. By functioning this way, we will bypass the cron functionality to avoid those issues, and it will be relatively easy to implement any necessary throttling to avoid issues related to the number of submitted jobs.
Possible Implementation
My proposal for this implementation would be to create a script that initially has just the basic functionality described above, which could be created quickly and used to run the comprehensive test suite in the short term. Once implemented, the script could be fleshed out to include all the necessary functionality of run_WE2E_tests.sh, eventually superseding it completely.
Bonuses of a re-write in python include
Specific requirements:
The text was updated successfully, but these errors were encountered: