You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several flaws in the Rocoto implementation in rt_utils.sh that cause frequent failures. PR #1195 corrects these flaws, and those updates have been tested extensively by several developers on Hera and Jet.
The rocotorun is run too frequently: every 20 seconds. Batch systems queries can take longer than that to register that a job was submitted. This can cause Rocoto to submit the same job twice during times of extreme load on the batch system. Rocoto developers and RDHPCS admins strongly suggest a minimum of a 60 second delay between executions of rocotorun.
If a Rocoto command exits with non-zero status, rt.sh fails. Rocoto commands, by design, do not retry failed operations. Instead, they report a failure with a non-zero exit status and expect the caller to rerun the command in a few minutes. The rt_utils.sh fails instead, due to "set -e".
During times of extreme intel license contention, hera.intel build jobs will take longer than 30 minutes to compile. If this contention lasts several hours, then the workflow will fail because the job will have hit its wallclock limit 3 times (the maximum number of tries for a compile_* job). This can be fixed easily by changing the wallclock limit for build jobs on Hera to 1 hour, as is done on some other platforms that share this problem.
Although the compile_* jobs have maxtries=3, the test jobs only have maxtries=1. This is a problem on hera.intel which intentionally misconfigures MPI to support GOCART, a model that cannot run with MPI configurations deemed safe on Hera. That misconfiguration causes the model to freeze on startup occasionally. Beyond that problem, all jobs on all platforms can fail due to system issues, so correcting GOCART's flaws will not alleviate this. The ecFlow workflow will retry test jobs, but Rocoto does not. The two should behave the same way.
To Reproduce:
What compilers/machines are you seeing this with?
Hera and Jet, especially hera.intel.
Give explicit steps to reproduce the behavior.
Run rt.sh -r many times and see occasional failures, as described above.
Notice that maxtries=1 for all test jobs (but not compile_ jobs) in rocoto_workflow.xml
Run rocotorewind on the rocoto_workflow. This will cause rt.sh to abort if rocotorun and rocotorewind ran at the same time. You may have to try this several times to get them to run simultaneously.
Additional context
Fixed in #1195 which was tested many times by several developers on Jet and Hera.
Output
N/A
The text was updated successfully, but these errors were encountered:
Description
There are several flaws in the Rocoto implementation in rt_utils.sh that cause frequent failures. PR #1195 corrects these flaws, and those updates have been tested extensively by several developers on Hera and Jet.
To Reproduce:
What compilers/machines are you seeing this with?
Hera and Jet, especially hera.intel.
Give explicit steps to reproduce the behavior.
Additional context
Fixed in #1195 which was tested many times by several developers on Jet and Hera.
Output
N/A
The text was updated successfully, but these errors were encountered: