Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More resilient error handling for migrations #26143

Closed
spalger opened this issue Nov 24, 2018 · 5 comments
Closed

More resilient error handling for migrations #26143

spalger opened this issue Nov 24, 2018 · 5 comments
Labels
Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@spalger
Copy link
Contributor

spalger commented Nov 24, 2018

We need to improve the error handling situation around migrations. I think we should start by making it very clear when a migration failed, why it failed, and trying to automatically recover. After that we can try automatically discarding saved objects that fail to migrate for one reason or another and offer the user some way to recover them with manual intervention.

@spalger spalger added the Team:Operations Team label for Operations Team label Nov 24, 2018
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations

@chrisdavies
Copy link
Contributor

There are a handful of problems:

A. Migrations repeatedly run if the mappings have an unexpected value (e.g. from Elasticsearch or index templates), solved by this PR: #28252
B. There is no visibility to the end-user that migrations are running vs in an error state
C. Recovering from failed migrations should be automatic in many circumstances

Improve UX

Users have waited hours before realizing that Kibana is failing to start. They just think it's migrating.

Show a migration progress screen (issue: #23489), and have it indicate that something is wrong if no progress in ~5 mins or if migrations are in an error state.

When migrations fail due to a failure in the migration process itself, cache the error in Elasticsearch so that all Kibana instances can read that and log it (issue: #26144).

Where should we cache this error? In _meta of the new .kibana_x index? In a migration-specific index .migration_status?

Automatic Recovery

Recovering from failed migrations should be automatic, if possible

Migrations fail if:

  • An error occurs while running migrations (e.g. a doc transform fails)
  • Kibana bombs while running migrations (e.g. some orthogonal failure, in which case we can re-attempt)
  • Elasticsearch or network failure (in which case we can reattempt)

We can probably implement some automated recovery by doing something like this:

  • If no progress has been made in {threshold} time, Kibana can delete the latest target index (if it exists), and re-attmept the migration
  • It will never delete the index that .kibana points to
  • It will never delete any index lower than the one .kibana points to
  • If migrations fail more than {N} times, we no longer automatically attempt to recover
  • We provide some simple mechanism for OPs folks to tell Kibana to reattempt (e.g. a tool / API that clears the stored migration status or something)
  • Maybe we have an increasing window of time between recovery attempts?
    • e.g. 1 minute, 2 minutes, 4 minutes, 8 minutes, 16 minutes, fail

I'm a bit nervous about deleting indices, though, as a bug here risks data-loss.

Thoughts?

@spalger
Copy link
Contributor Author

spalger commented Feb 7, 2019

Yeah, those goals sound right to me. I tried to plot out how this might work in #26144, which is pretty complex, but I think the complexity is justified to make sure that two Kibanas don't try to delete the old index and restart at the same time.

This is prevented in the current implementation because index creation is the "lock" that we're using to make sure only one Kibana runs the migration, but a that doesn't work when Kibana is allowed to delete "old" migration target indexes in order to retry.

If a Kibana deletes a failed migration target index and a second does the same at the same time it might actually be deleting the new target index created by the first Kibana, and the first Kibana wouldn't have a way to know that.

That said, there are certainly other options that I haven't thought though, like maybe incrementing the .kibana_{n} on each retry which might allow us to make things simpler.

@tylersmalley tylersmalley added Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Mar 26, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@tylersmalley tylersmalley removed the Team:Operations Team label for Operations Team label Mar 26, 2020
@rudolf
Copy link
Contributor

rudolf commented Jun 15, 2020

Closing in favour of #66056

@rudolf rudolf closed this as completed Jun 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Saved Objects Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants