[Alerting] Gracefully restore failed rules from pre-7.11 #117593

ymao1 · 2021-11-04T20:19:39Z

In this issue we have determined that for rules created prior to 7.11, the associated task manager document does not contain the schedule field:

Example pre-7.11 rule task doc

{
	"migrationVersion": {
		"task": "7.6.0"
	},
	"task": {
		"taskType": "[[elided - pmuellr]]",
		"retryAt": "2021-09-12T21:30:05.288Z",
		"runAt": "2021-09-12T20:44:02.033Z",
		"scope": [
			"alerting"
		],
		"startedAt": "2021-09-12T21:10:05.287Z",
		"state": """{"
		alertInstances ":{},"
		previousStartedAt ":"
		2021 - 09 - 12 T20: 43: 02.033 Z "}""",
		"params": "[[elided - pmuellr]]",
		"ownerId": "kibana:1ef3f039-cb28-486b-9046-01a41adb06e8",
		"scheduledAt": "2021-07-01T11:27:40.684Z",
		"attempts": 3,
		"status": "failed"
	},
	"references": [],
	"updated_at": "2021-09-12T21:10:05.339Z",
	"coreMigrationVersion": "7.14.1",
	"type": "task"
}

When rules are running normally and Kibana is upgraded to 7.11+, after the next normal execution, the task manager doc will be updated with the schedule field.

However, if a rule has reached its maxAttempts value of 3, when Kibana is upgraded to 7.11+, the task manager updateByQuery script will mark these rules as failed because it has no schedule and the number of attempts has reached the limit. We want to make sure these rules continue running so we propose to do 2 things to mitigate:

Add a task manager migration to look for alerting tasks where the schedule field is missing. Reset attempts to 0 and status to idle. This should ensure that task manager can start claiming these tasks again.
Add a step to the alerting task runner that checks to see if the associated task document contains the schedule field. If it does not, update the task document to include it. This should ensure that the alerting rule task will not reach this state again.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-04T20:19:53Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2022-01-24T13:52:06Z

Add a step to the alerting task runner that checks to see if the associated task document contains the schedule field. If it does not, update the task document to include it. This should ensure that the alerting rule task will not reach this state again.

This currently happens at the end of the rule execution here. By returning a schedule to the task manager at the end, it will override the schedule (not runAt) associated with the task. This way if we failed to update the task when the user changed the schedule, it will happen after the next execution.

Add a task manager migration to look for alerting tasks where the schedule field is missing. Reset attempts to 0 and status to idle. This should ensure that task manager can start claiming these tasks again.

++ I think this will be sufficient to make the rule run again and update the task's schedule after the first run.

mikecote · 2022-01-24T14:52:28Z

After speaking with @kobelb, it feels better to fix if the change turns small. It can be common for rules pre 7.11 to fail continuously, and people got the habit of disabling/enabling them for the fix.

Adding to 8.1/8.2 plans.

ymao1 added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 4, 2021

ymao1 mentioned this issue Nov 4, 2021

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

Closed

mikecote added this to AppEx: ResponseOps - Execution & Connectors Jan 4, 2022

ymao1 mentioned this issue Jan 13, 2022

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Closed

mikecote moved this to Todo in AppEx: ResponseOps - Execution & Connectors Jan 24, 2022

ersin-erdal self-assigned this Jan 28, 2022

ersin-erdal moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Jan 28, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

ersin-erdal mentioned this issue Feb 1, 2022

Migrate the pre-7.11 tasks that has no schedule field. #124304

Merged

1 task

ersin-erdal moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Feb 3, 2022

ersin-erdal closed this as completed in #124304 Feb 9, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Feb 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

ymao1 commented Nov 4, 2021

elasticmachine commented Nov 4, 2021

mikecote commented Jan 24, 2022

mikecote commented Jan 24, 2022 •

edited

Loading

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

Comments

ymao1 commented Nov 4, 2021

elasticmachine commented Nov 4, 2021

mikecote commented Jan 24, 2022

mikecote commented Jan 24, 2022 • edited Loading

mikecote commented Jan 24, 2022 •

edited

Loading