-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive jobs fail for fourth full cycle when starting with a half-cycle #1003
Comments
@WalterKolczynski-NOAA Would you expect this bug in cycles after that fourth full cycle?
|
I'm seeing this error in almost every cycle in new cycled tests on Orion and WCOSS2. Sometimes it's a race condition unrelated to the half-cycle and other times happens when the prior cycle has a DEAD job (e.g. gdasvrfy) or otherwise hasn't completed. |
I've encountered the race condition as well. The problem is compounded by rocotocompleting the dead archiving tasks leaving no string to be grepped in the rocoto log, so each cycle winds up requiring intervention to complete. |
Are there any plans for this? Any cycled run longer than a day requires intervention on each cycle to keep running. |
I'm going to start tackling this today, I hope. |
Great - once it's working I'll merge it into the EFSOI branch, get the PR in, and get that out of everybody's hair. |
Splits the existing rocoto cycle definitions up to offer better job control. This means that only the jobs that are due to run will appear in a cycle's job list from rocotostat/rocotoviewer. It also allows for the removal of some of the cycleexist dependencies that were there solely to prevent the job from running in the half cycle. A side effect of this change is that the half-cycle will be recognized as a completed cycle, fixing the bug with archive jobs starting in the fourth cycle (NOAA-EMC#1004). The gdas cycledef has been split into a `gdas_half` for the first half- cycle and `gdas` for the other GDAS cycles. Tasks that run during that first half-cycle therefore run on two cycledefs. For gfs, instead of slicing perpindicular to time, a new cycledef `gfs_cont` (continuity) was created in parallel to the existing gfs cycledef that omits the first cycle. This was done since only one job (`aerosol_init`) currently skips the first cycle, and this prevents the need to provide two cycledefs for every gfs task but one. Since some time math is now being done on sdate in workflow_xml.py, we now keep those as datetime objects and only convert to string when writing the cycledef strings. In order to access the pygw utilities in the workflow directory, a symlink is created in `workflow` pointing to the pygw location in `ush`. A better solution may be found in the future. Fixes NOAA-EMC#1003
Splits the existing rocoto cycle definitions up to offer better job control. This means that only the jobs that are due to run will appear in a cycle's job list from rocotostat/rocotoviewer. It also allows for the removal of some of the cycleexist dependencies that were there solely to prevent the job from running in the half cycle. A side effect of this change is that the half-cycle will be recognized as a completed cycle, fixing the bug with archive jobs starting in the fourth cycle (#1003). The gdas cycledef has been split into a `gdas_half` for the first half- cycle and `gdas` for the other GDAS cycles. Tasks that run during that first half-cycle therefore run on two cycledefs. For gfs, instead of slicing perpindicular to time, a new cycledef `gfs_cont` (continuity) was created in parallel to the existing gfs cycledef that omits the first cycle. This was done since only one job (`aerosol_init`) currently skips the first cycle, and this prevents the need to provide two cycledefs for every gfs task but one. Since some time math is now being done on sdate in workflow_xml.py, we now keep those as datetime objects and only convert to string when writing the cycledef strings. In order to access the pygw utilities in the workflow directory, a symlink is created in `workflow` pointing to the pygw location in `ush`. A better solution may be found in the future. Fixes #1003
Expected behavior
Archive jobs should complete normally
Current behavior
When running cold start, the archive job fails for the fourth full cycle (24 h after the half-cycle).
Machines affected
All
To Reproduce
Set up a cycled cold-start experiment that runs for more than a day.
Detailed Description
Failure is caused near the end of the archive script when it attempts to delete data no longer needed. It is checking the rocoto log(!) to see if the cycle 24 h previous completed successfully. However, the key string that is being searched for,
This cycle is complete: Success
is not written for the half-cycle. This is likely caused by not all jobs running in the half-cycle and no task assigned as the "final" task, so the cycle is never "complete" from a rocoto perspective.Possible Implementation
There are a lot of things going on here that can be fixed. First, there are prerequisites that can be used so the job only runs if the cycle is complete (if it exists) rather than parsing the rocoto log file. Even if that approach is used, it will be unsuccessful unless the half-cycle is marked as complete. This can be done by setting
final="true"
in the task definition, or preferably by specifying a new cycledef for the half-cycle that includes only the jobs to be run (or both for good measure).The text was updated successfully, but these errors were encountered: