Skip to content

Commit

Permalink
Proposal timeouts and typical error page (#823)
Browse files Browse the repository at this point in the history
* timeout proposal
  • Loading branch information
achimweigel authored Sep 25, 2023
1 parent e58f832 commit 25b8e85
Show file tree
Hide file tree
Showing 3 changed files with 127 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/guided-tour/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ In this tour, you will learn about the different Landscaper features by simple e

[7. Handling a Delete Error](./error-handling/delete-error)

You can find a list of error messages and corresponding solutions [here](./error-handling/problem_analysis.md).

## Blueprints and Components

[8. An Installation with an Externally Stored Blueprint](./blueprints/external-blueprint)
Expand Down
6 changes: 6 additions & 0 deletions docs/guided-tour/error-handling/problem_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
## Common Issues

This document lists issues and how to fix them.

- **Last error in DeployItem: Op: TemplateChart - Reason: RenderHelmValues**
- Helm templating failed with the provided values
119 changes: 119 additions & 0 deletions docs/proposals/timeouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Refactoring timeouts

This document describes the current state of the timeout implementation, its problems and a potential solution
how these might be resolved.

## Current state

### Which timeouts currently exists?

This is just a short list of timeouts which currently exists:

Globale:
- LandscaperConfiguration.DeployItemTimeouts.Pickup
- LandscaperConfiguration.DeployItemTimeouts.ProgressingDefault

DeployItem:
- DeployItem.DeployItemSpec.Timeout (set on LandscaperConfiguration.DeployItemTimeouts.ProgressingDefault if nothing
is specified)

Helm Deployer
- Configuration.ExportConfiguration.DefaultTimeout
- DeployItem...ProviderConfiguration.DeleteTimeout

- For real helm deployer
- ...HelmInstallConfiguration.Timeout
- ...HelmUpgradeConfiguration.Timeout
- ...HelmDeleteConfiguration.Timeout


Manifest Deployer
- Configuration.ExportConfiguration.DefaultTimeout
- DeployItem...ProviderConfiguration.DeleteTimeout


Readiness-Checks as part of the helm and manifest deployer provider config
- ProviderConfiguration...ReadinessCheckConfiguration.Timeout
- ProviderConfiguration...ReadinessCheckConfiguration.CustomReadinessCheckConfiguration.Timeout

### Current strategy

Currently, either the deployer runs into one of its different and very unintuitive timeouts and sets its state on failure
or the central timeout controller finds out that a DeployItem is processed longer than DeployItem.DeployItemSpec.Timeout
and this controller sets the DeployItem on failure.

### Problems with the current strategy

- If the deployer does not finish in time, the timeout controller sets a DeployItem on failure. This results in two problems:
- If there is some error in the deployer execution, this is not stored in the DeployItem because of a write conflict.
- Even if the timeout controller sets the DeployItem on failure, the deployer might continue processing on it, e.g. if it is
current deploying the helm chart via helm 3 api. Retriggering a new reconcilation will not start working on that
DeployItem before the deployer has finished its work. Currently, it is completely unclear how long this work
might endure.

- Too many possible settings with sometimes strange semantics/implementations
- Understanding the timeouts and their exact semantics is very complicated.
- For example Steampunk: They started setting DeployItem.DeployItemSpec.Timeout=15m but where completely surprised
that some of their readiness checks failed because these have a default timeout of 3 minutes. Therefore, they also set
ProviderConfiguration...ReadinessCheckConfiguration.Timeout=15min for every DeployItem. Similar problems might occur
for exports.
- Even if the overall timeout for a DeployItem is set, no specific timeout is set for the helm chart deployment with
helm 3. What does this mean? How long might the deployer hang in this step and thereby preventing another
reconciliation even after the timeout controller has set the state on failure? Or will the helm 3 deployment fail,
due to some other default settings. There are similar questions for the manifest and helm-manifest deployer, e.g.
what is the default timeout for the http client etc.?
- There are some really strange or even inconsistent implementations e.g.
- Timeout in customreadinesscheck. Checks are executed one after the other in an undefined order. So it depends on
that order how much time was already spend before which might result in a failure of a readiness check timeout.
- deleteTimeout is used in the context of manifest and helm manifest deployer when deleting orphaned resources during
an upgrade. This setting has nothing to do with the deletion of a DeployItem which is very confusing.

- Pickup timeout: If there is a high load on the system, the pickup of particular DeployItems by a deployer is deferred
because the worker threads might be exhausted and therefore their state is set on failure by the timeout controller.
Just waiting some additional time would have been resulted in their proper processing. This resulted in problems for
the Steampunk team.


## Solution proposal

- Remove pickup timeout, set it on a very large value or add some DeployItem specific setting:
- The main reason for the pickup timeout was to resolve the problem, that if someone wrongly configures a DeployItem
such that no deployer would be responsible, that this DeployItem is finished. This use case is mainly relevant during
development. Comparing this with the problem of failed DeployItems in productive systems under heavy load, seems to
be a bad deal. Furthermore, for the development scenario, it is already possible to find out that a DeployItem
is not picked up, by its empty status and such DeployItems could be stopped by the interrupt annotation.
Therefore, the removal of this timeout or setting it on a high value might be a good idea.
- If we want to be more defencive, we could add a pickup timeout to the DeployItem specification itself. Then the
customer could decide by itself if he wants this feature e.g. in his development scenario.

- Reduce number of processing timeout settings
- Motivation: It is less complicated for a user to only set one timeout for the deployment or deletion of a DeployItem.
Even finding here the right value is already complicated but to take all the other current settings into account
is nearly impossible.
- Only one processing timeout on deploy item level which determines how long the deployment is allowed to require.
- Only one delete timeout on deploy item level which determines how long the deletion is allowed to require.
(if not set, use some default or use same as for progressing timeout)
- Timeout controller:
- Exclude the manifest and helm deployer from the timeout controller
- When we have also found a similar solution for the container deployer we could perhaps completely remove the central
processing timeout controller?
- Deployment/Upgrade/Deletion
- Check the timeout between the main steps like templating, deployment, readiness checks, export generation. A
particular step gets an allowed duration computed by (timeout - required duration of steps before). When the timout
is exceeded the processing is interrupted and the DeployItem set on failed.
- Running requests like helm 3 deployments or manifest deployments are canceled if they require too much time
- The helm 3 timeout is recomputed with respect to the remaining duration - no matter if it is set or not.
- Deletions are handled the same way.

## Open Question

- When does the processing timeout start? When a DeployItem is triggered of when the deployer starts working on it?
- Alternative two has the advantage, that a system under load, where it might need some time until a deployer could
pick up a DeployItem, is not running out of time. It seems also to be much easier for a customer to know how long a
deployment for a DeployItems needs, compared to how long a deployer requires to pick up a DeployItem.

- In case of an error and an immediate retry of the reconciliation of the DeployItem, should the required time of the
erroneous reconciliations from before be taken into account and removed for the allowed processing time?
Probably yes as it prevents infinite loops.


0 comments on commit 25b8e85

Please sign in to comment.