Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

Closed
ymao1 opened this issue Nov 4, 2021 · 3 comments · Fixed by #124304
Closed

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

ymao1 opened this issue Nov 4, 2021 · 3 comments · Fixed by #124304
Assignees
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Nov 4, 2021

In this issue we have determined that for rules created prior to 7.11, the associated task manager document does not contain the schedule field:

Example pre-7.11 rule task doc
{
	"migrationVersion": {
		"task": "7.6.0"
	},
	"task": {
		"taskType": "[[elided - pmuellr]]",
		"retryAt": "2021-09-12T21:30:05.288Z",
		"runAt": "2021-09-12T20:44:02.033Z",
		"scope": [
			"alerting"
		],
		"startedAt": "2021-09-12T21:10:05.287Z",
		"state": """{"
		alertInstances ":{},"
		previousStartedAt ":"
		2021 - 09 - 12 T20: 43: 02.033 Z "}""",
		"params": "[[elided - pmuellr]]",
		"ownerId": "kibana:1ef3f039-cb28-486b-9046-01a41adb06e8",
		"scheduledAt": "2021-07-01T11:27:40.684Z",
		"attempts": 3,
		"status": "failed"
	},
	"references": [],
	"updated_at": "2021-09-12T21:10:05.339Z",
	"coreMigrationVersion": "7.14.1",
	"type": "task"
}

When rules are running normally and Kibana is upgraded to 7.11+, after the next normal execution, the task manager doc will be updated with the schedule field.

However, if a rule has reached its maxAttempts value of 3, when Kibana is upgraded to 7.11+, the task manager updateByQuery script will mark these rules as failed because it has no schedule and the number of attempts has reached the limit. We want to make sure these rules continue running so we propose to do 2 things to mitigate:

  1. Add a task manager migration to look for alerting tasks where the schedule field is missing. Reset attempts to 0 and status to idle. This should ensure that task manager can start claiming these tasks again.
  2. Add a step to the alerting task runner that checks to see if the associated task document contains the schedule field. If it does not, update the task document to include it. This should ensure that the alerting rule task will not reach this state again.
@ymao1 ymao1 added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 4, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

  1. Add a step to the alerting task runner that checks to see if the associated task document contains the schedule field. If it does not, update the task document to include it. This should ensure that the alerting rule task will not reach this state again.

This currently happens at the end of the rule execution here. By returning a schedule to the task manager at the end, it will override the schedule (not runAt) associated with the task. This way if we failed to update the task when the user changed the schedule, it will happen after the next execution.

  1. Add a task manager migration to look for alerting tasks where the schedule field is missing. Reset attempts to 0 and status to idle. This should ensure that task manager can start claiming these tasks again.

++ I think this will be sufficient to make the rule run again and update the task's schedule after the first run.

@mikecote
Copy link
Contributor

mikecote commented Jan 24, 2022

After speaking with @kobelb, it feels better to fix if the change turns small. It can be common for rules pre 7.11 to fail continuously, and people got the habit of disabling/enabling them for the fix.

Adding to 8.1/8.2 plans.

@ersin-erdal ersin-erdal self-assigned this Jan 28, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@ersin-erdal ersin-erdal moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Feb 3, 2022
Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants