Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive jobs fail for fourth full cycle when starting with a half-cycle #1003

Closed
WalterKolczynski-NOAA opened this issue Aug 28, 2022 · 6 comments · Fixed by #1078
Closed
Labels
bug Something isn't working

Comments

@WalterKolczynski-NOAA
Copy link
Contributor

Expected behavior
Archive jobs should complete normally

Current behavior
When running cold start, the archive job fails for the fourth full cycle (24 h after the half-cycle).

Machines affected
All

To Reproduce
Set up a cycled cold-start experiment that runs for more than a day.

Detailed Description

Failure is caused near the end of the archive script when it attempts to delete data no longer needed. It is checking the rocoto log(!) to see if the cycle 24 h previous completed successfully. However, the key string that is being searched for, This cycle is complete: Success is not written for the half-cycle. This is likely caused by not all jobs running in the half-cycle and no task assigned as the "final" task, so the cycle is never "complete" from a rocoto perspective.

Possible Implementation
There are a lot of things going on here that can be fixed. First, there are prerequisites that can be used so the job only runs if the cycle is complete (if it exists) rather than parsing the rocoto log file. Even if that approach is used, it will be unsuccessful unless the half-cycle is marked as complete. This can be done by setting final="true" in the task definition, or preferably by specifying a new cycledef for the half-cycle that includes only the jobs to be run (or both for good measure).

@WalterKolczynski-NOAA WalterKolczynski-NOAA added the bug Something isn't working label Aug 28, 2022
@AndrewEichmann-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA Would you expect this bug in cycles after that fourth full cycle?

Orion-login-2[37] aeichman$ tail -6 logs/2021061006/gdasearc00.log
++ earc.sh[237]: tail -n 1 /work/noaa/da/aeichman/para//xylose/logs/2021060900.log
++ earc.sh[237]: grep 'This cycle is complete: Success'
+ earc.sh[237]: testend=
+ earc.sh[1]: postamble earc.sh 1662778819 1
+ preamble.sh[67]: set +x
End earc.sh at 03:00:31 with error code 1 (time elapsed: 00:00:12)
Orion-login-2[38] aeichman$ 

@KateFriedman-NOAA
Copy link
Member

I'm seeing this error in almost every cycle in new cycled tests on Orion and WCOSS2. Sometimes it's a race condition unrelated to the half-cycle and other times happens when the prior cycle has a DEAD job (e.g. gdasvrfy) or otherwise hasn't completed.

@AndrewEichmann-NOAA
Copy link
Contributor

I've encountered the race condition as well. The problem is compounded by rocotocompleting the dead archiving tasks leaving no string to be grepped in the rocoto log, so each cycle winds up requiring intervention to complete.

@AndrewEichmann-NOAA
Copy link
Contributor

Are there any plans for this? Any cycled run longer than a day requires intervention on each cycle to keep running.

@WalterKolczynski-NOAA
Copy link
Contributor Author

I'm going to start tackling this today, I hope.

@AndrewEichmann-NOAA
Copy link
Contributor

Great - once it's working I'll merge it into the EFSOI branch, get the PR in, and get that out of everybody's hair.

WalterKolczynski-NOAA added a commit to WalterKolczynski-NOAA/global-workflow that referenced this issue Oct 17, 2022
Splits the existing rocoto cycle definitions up to offer better job
control. This means that only the jobs that are due to run will appear
in a cycle's job list from rocotostat/rocotoviewer. It also allows for
the removal of some of the cycleexist dependencies that were there
solely to prevent the job from running in the half cycle. A side effect
of this change is that the half-cycle will be recognized as a completed
cycle, fixing the bug with archive jobs starting in the fourth cycle
(NOAA-EMC#1004).

The gdas cycledef has been split into a `gdas_half` for the first half-
cycle and `gdas` for the other GDAS cycles. Tasks that run during that
first half-cycle therefore run on two cycledefs.

For gfs, instead of slicing perpindicular to time, a new cycledef
`gfs_cont` (continuity) was created in parallel to the existing gfs
cycledef that omits the first cycle. This was done since only one job
(`aerosol_init`) currently skips the first cycle, and this prevents the
need to provide two cycledefs for every gfs task but one.

Since some time math is now being done on sdate in workflow_xml.py, we
now keep those as datetime objects and only convert to string when
writing the cycledef strings.

In order to access the pygw utilities in the workflow directory, a
symlink is created in `workflow` pointing to the pygw location in `ush`.
A better solution may be found in the future.

Fixes NOAA-EMC#1003
WalterKolczynski-NOAA added a commit that referenced this issue Oct 20, 2022
Splits the existing rocoto cycle definitions up to offer better job control. This means that only the jobs that are due to run will appear in a cycle's job list from rocotostat/rocotoviewer. It also allows for the removal of some of the cycleexist dependencies that were there solely to prevent the job from running in the half cycle. A side effect of this change is that the half-cycle will be recognized as a completed cycle, fixing the bug with archive jobs starting in the fourth cycle (#1003).

The gdas cycledef has been split into a `gdas_half` for the first half- cycle and `gdas` for the other GDAS cycles. Tasks that run during that first half-cycle therefore run on two cycledefs.

For gfs, instead of slicing perpindicular to time, a new cycledef `gfs_cont` (continuity) was created in parallel to the existing gfs cycledef that omits the first cycle. This was done since only one job (`aerosol_init`) currently skips the first cycle, and this prevents the need to provide two cycledefs for every gfs task but one.

Since some time math is now being done on sdate in workflow_xml.py, we now keep those as datetime objects and only convert to string when writing the cycledef strings.

In order to access the pygw utilities in the workflow directory, a symlink is created in `workflow` pointing to the pygw location in `ush`. A better solution may be found in the future.

Fixes #1003
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants