More resilient error handling for migrations #26143

spalger · 2018-11-24T23:34:22Z

We need to improve the error handling situation around migrations. I think we should start by making it very clear when a migration failed, why it failed, and trying to automatically recover. After that we can try automatically discarding saved objects that fail to migrate for one reason or another and offer the user some way to recover them with manual intervention.

elasticmachine · 2018-11-24T23:34:23Z

Pinging @elastic/kibana-operations

chrisdavies · 2019-02-07T13:56:50Z

There are a handful of problems:

A. Migrations repeatedly run if the mappings have an unexpected value (e.g. from Elasticsearch or index templates), solved by this PR: #28252
B. There is no visibility to the end-user that migrations are running vs in an error state
C. Recovering from failed migrations should be automatic in many circumstances

Improve UX

Users have waited hours before realizing that Kibana is failing to start. They just think it's migrating.

Show a migration progress screen (issue: #23489), and have it indicate that something is wrong if no progress in ~5 mins or if migrations are in an error state.

When migrations fail due to a failure in the migration process itself, cache the error in Elasticsearch so that all Kibana instances can read that and log it (issue: #26144).

Where should we cache this error? In _meta of the new .kibana_x index? In a migration-specific index .migration_status?

Automatic Recovery

Recovering from failed migrations should be automatic, if possible

Migrations fail if:

An error occurs while running migrations (e.g. a doc transform fails)
Kibana bombs while running migrations (e.g. some orthogonal failure, in which case we can re-attempt)
Elasticsearch or network failure (in which case we can reattempt)

We can probably implement some automated recovery by doing something like this:

If no progress has been made in {threshold} time, Kibana can delete the latest target index (if it exists), and re-attmept the migration
It will never delete the index that .kibana points to
It will never delete any index lower than the one .kibana points to
If migrations fail more than {N} times, we no longer automatically attempt to recover
We provide some simple mechanism for OPs folks to tell Kibana to reattempt (e.g. a tool / API that clears the stored migration status or something)
Maybe we have an increasing window of time between recovery attempts?
- e.g. 1 minute, 2 minutes, 4 minutes, 8 minutes, 16 minutes, fail

I'm a bit nervous about deleting indices, though, as a bug here risks data-loss.

Thoughts?

spalger · 2019-02-07T21:36:23Z

Yeah, those goals sound right to me. I tried to plot out how this might work in #26144, which is pretty complex, but I think the complexity is justified to make sure that two Kibanas don't try to delete the old index and restart at the same time.

This is prevented in the current implementation because index creation is the "lock" that we're using to make sure only one Kibana runs the migration, but a that doesn't work when Kibana is allowed to delete "old" migration target indexes in order to retry.

If a Kibana deletes a failed migration target index and a second does the same at the same time it might actually be deleting the new target index created by the first Kibana, and the first Kibana wouldn't have a way to know that.

That said, there are certainly other options that I haven't thought though, like maybe incrementing the .kibana_{n} on each retry which might allow us to make things simpler.

elasticmachine · 2020-03-26T22:20:39Z

Pinging @elastic/kibana-platform (Team:Platform)

rudolf · 2020-06-15T12:20:44Z

Closing in favour of #66056

spalger added the Team:Operations Team label for Operations Team label Nov 24, 2018

stacey-gammon mentioned this issue Jul 24, 2019

Improve handling of invalid data during migrations #38669

Closed

4 tasks

tylersmalley added Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Mar 26, 2020

tylersmalley removed the Team:Operations Team label for Operations Team label Mar 26, 2020

rudolf closed this as completed Jun 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More resilient error handling for migrations #26143

More resilient error handling for migrations #26143

spalger commented Nov 24, 2018

elasticmachine commented Nov 24, 2018

chrisdavies commented Feb 7, 2019

spalger commented Feb 7, 2019

elasticmachine commented Mar 26, 2020

rudolf commented Jun 15, 2020

More resilient error handling for migrations #26143

More resilient error handling for migrations #26143

Comments

spalger commented Nov 24, 2018

elasticmachine commented Nov 24, 2018

chrisdavies commented Feb 7, 2019

Improve UX

Automatic Recovery

spalger commented Feb 7, 2019

elasticmachine commented Mar 26, 2020

rudolf commented Jun 15, 2020