Skip to content

[Design] Workflows steps replays on error

Loïc Albertin edited this page Jul 6, 2021 · 6 revisions

Workflows steps replays on error

Status

This design document is in the accepted status.

Context

When running workflows and specially longest ones a failure happening at the almost final steps could be very frustrating. In order to mitigate this Yorc currently ship a feature that allow to manually fix a failed step and use the REST API to mark this step as DONE. Then another endpoint allows to resume a failed workflow. This way the workflow can run the successors steps of the failed one and finish in success.

This is useful when a manual fix is possible but we will sometimes want to replay the failed step using Yorc in case of an intermittent network error for instance. This is currently not supported by Yorc and the purpose of this document is to defined how we should implement it.

One important point that should be taken into account in this design spec is that we do not want to introduce a breaking change into the Yorc API.

Pre-change API description

Currently to mark a step as manually fixed and resume the workflow, we should use the following API endpoints:

Update step state

Only a step in ERROR state can be updated and it is set by the API to the DONE state. Otherwise an HTTP 400 (Bad request) error is returned.

PUT /deployments/<deployment_id>/tasks/<taskId>/steps/<stepId>

Response:

HTTP/1.1 200 OK
Content-Length: 0

Resume a task

Resume a task for a given deployment. The task should be in status "FAILED" to be resumed. Otherwise an HTTP 400 (Bad request) error is returned.

PUT /deployments/<deployment_id>/tasks/<taskId>

Response:

HTTP/1.1 202 OK
Content-Length: 0 0

Pre-change analyze

The API endpoint that update the step state does not allow to set a state other than DONE.

In order to be replayed the step state should be put in the INITIAL state.

All failed steps should be updated one by one. However it doesn't seems to be enforced at any time that no step should be in failed state before resuming the task. (Investigation needed)

Changes options

Option 1

The update step state endpoint should now support a payload indicating the state that should be used.

PUT /deployments/<deployment_id>/tasks/<taskId>/steps/<stepId>

Request body:

{
  "state": "INITIAL"
}
  • API should enforce that state is either DONE or INITIAL
  • API should enforce that the current step state is ERROR
  • Having an empty request body should still be allowed and considered a requesting a step state to be set to DONE for backward compatibility

Advantages and drawbacks

This option is easy to implement and allow to keep a consistent API with backward compatibility. However it does not allow to update several steps at once.

Option 2

The API should now support a new endpoint allowing to update steps states. The current endpoint for marking steps as manually fixed remains as it.

PUT /deployments/<deployment_id>/tasks/<taskId>/steps

Request body:

{
  "mystep1": {"state": "INITIAL"},
  "mystep2": {"state": "DONE"}
}
  • API should enforce that state is either DONE or INITIAL
  • API should enforce that the current step state is ERROR

Advantages and drawbacks

This option allows to set several steps states in a single API call. However we do in this case have different ways to update steps states this could be a bit confusing.

Option 3

In this option the current endpoint for marking steps as manually fixed remains as it. We then change the resume task endpoint behavior to automatically set as INITIAL all steps in ERROR state before running the task again

Advantages and drawbacks

This option allow to keep the API signature unchanged. Several failed steps can be retried in a single API call. But this introduce 2 different ways to manage steps states one for manually fixed steps and the other for retried steps, this makes the API confusing and inconsistent.

Decision

After collecting feedback from Yorc users and specially the Alien4Cloud team we decided to implement the option 1.

Main reasons leading to this decision are the lower complexity of this implementation and keeping a consistent API.

Currently there is no need for batching updates of several steps states.