-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More resilient error handling for migrations #26143
Comments
Pinging @elastic/kibana-operations |
There are a handful of problems: A. Migrations repeatedly run if the mappings have an unexpected value (e.g. from Elasticsearch or index templates), solved by this PR: #28252 Improve UXUsers have waited hours before realizing that Kibana is failing to start. They just think it's migrating. Show a migration progress screen (issue: #23489), and have it indicate that something is wrong if no progress in ~5 mins or if migrations are in an error state. When migrations fail due to a failure in the migration process itself, cache the error in Elasticsearch so that all Kibana instances can read that and log it (issue: #26144).
Automatic RecoveryRecovering from failed migrations should be automatic, if possible Migrations fail if:
We can probably implement some automated recovery by doing something like this:
I'm a bit nervous about deleting indices, though, as a bug here risks data-loss. Thoughts? |
Yeah, those goals sound right to me. I tried to plot out how this might work in #26144, which is pretty complex, but I think the complexity is justified to make sure that two Kibanas don't try to delete the old index and restart at the same time. This is prevented in the current implementation because index creation is the "lock" that we're using to make sure only one Kibana runs the migration, but a that doesn't work when Kibana is allowed to delete "old" migration target indexes in order to retry. If a Kibana deletes a failed migration target index and a second does the same at the same time it might actually be deleting the new target index created by the first Kibana, and the first Kibana wouldn't have a way to know that. That said, there are certainly other options that I haven't thought though, like maybe incrementing the |
Pinging @elastic/kibana-platform (Team:Platform) |
Closing in favour of #66056 |
We need to improve the error handling situation around migrations. I think we should start by making it very clear when a migration failed, why it failed, and trying to automatically recover. After that we can try automatically discarding saved objects that fail to migrate for one reason or another and offer the user some way to recover them with manual intervention.
The text was updated successfully, but these errors were encountered: