-
Notifications
You must be signed in to change notification settings - Fork 23
[Design] Workflows steps replays on error
This design document is in the accepted
status.
When running workflows and specially longest ones a failure happening at the almost final steps could be very frustrating.
In order to mitigate this Yorc currently ship a feature that allow to manually fix a failed step and use the REST API to mark this
step as DONE
. Then another endpoint allows to resume a failed workflow. This way the workflow can run the successors steps of the failed one and finish in success.
This is useful when a manual fix is possible but we will sometimes want to replay the failed step using Yorc in case of an intermittent network error for instance. This is currently not supported by Yorc and the purpose of this document is to defined how we should implement it.
One important point that should be taken into account in this design spec is that we do not want to introduce a breaking change into the Yorc API.
Currently to mark a step as manually fixed and resume the workflow, we should use the following API endpoints:
Only a step in ERROR
state can be updated and it is set by the API to the DONE
state.
Otherwise an HTTP 400 (Bad request) error is returned.
PUT /deployments/<deployment_id>/tasks/<taskId>/steps/<stepId>
Response:
HTTP/1.1 200 OK
Content-Length: 0
Resume a task for a given deployment. The task should be in status "FAILED" to be resumed. Otherwise an HTTP 400 (Bad request) error is returned.
PUT /deployments/<deployment_id>/tasks/<taskId>
Response:
HTTP/1.1 202 OK
Content-Length: 0 0
The API endpoint that update the step state does not allow to set a state other than DONE
.
In order to be replayed the step state should be put in the INITIAL
state.
All failed steps should be updated one by one. However it doesn't seems to be enforced at any time that no step should be in failed state before resuming the task. (Investigation needed)
The update step state endpoint should now support a payload indicating the state that should be used.
PUT /deployments/<deployment_id>/tasks/<taskId>/steps/<stepId>
Request body:
{
"state": "INITIAL"
}
- API should enforce that
state
is eitherDONE
orINITIAL
- API should enforce that the current step state is
ERROR
- Having an empty request body should still be allowed and considered a requesting a step state to be set to
DONE
for backward compatibility
This option is easy to implement and allow to keep a consistent API with backward compatibility. However it does not allow to update several steps at once.
The API should now support a new endpoint allowing to update steps states. The current endpoint for marking steps as manually fixed remains as it.
PUT /deployments/<deployment_id>/tasks/<taskId>/steps
Request body:
{
"mystep1": {"state": "INITIAL"},
"mystep2": {"state": "DONE"}
}
- API should enforce that
state
is eitherDONE
orINITIAL
- API should enforce that the current step state is
ERROR
This option allows to set several steps states in a single API call. However we do in this case have different ways to update steps states this could be a bit confusing.
In this option the current endpoint for marking steps as manually fixed remains as it.
We then change the resume task endpoint behavior to automatically set as INITIAL
all steps in ERROR
state before running the task again
This option allow to keep the API signature unchanged. Several failed steps can be retried in a single API call. But this introduce 2 different ways to manage steps states one for manually fixed steps and the other for retried steps, this makes the API confusing and inconsistent.
After collecting feedback from Yorc users and specially the Alien4Cloud team we decided to implement the option 1.
Main reasons leading to this decision are the lower complexity of this implementation and keeping a consistent API.
Currently there is no need for batching updates of several steps states.