[Alerting] Smarter retry interval for ES Connectivity errors #122390

ymao1 · 2022-01-05T20:44:14Z

When an alerting execution encounters an error, it reschedules itself for the next schedule interval. With this PR Kibana Core introduced a specific error type EsUnavailableError for transient ES connectivity errors. We should look for these errors in the alerting task runner and consider handling them slightly differently than other errors based on the configured schedule interval of a rule. If a rule is scheduled to run every 24 hours and it encounters a transient connectivity issue, it would make more sense to schedule its retry for a shorter interval, like 5 minutes instead of waiting another 24 hours for execution.

Towards #119650

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-01-05T20:44:16Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2022-01-12T20:07:50Z

So we really just need to pick a number, right? 5 minutes feels about right to me. If the interval is < 5m, let it run on it's existing schedule. Otherwise retry in 5m.

It probably doesn't make sense to bake this kind of logic into task manager (yet anyway), right? Especially since rules do kind of wacky scheduling anyway ... and not sure TM could recognize ES connectivity errors.

Just thinking about eventual "cron" scheduling. I think this will work there as well. We'll have to come up with a way of taking whatever "cron" schedule we have, and the current time, and have it tell us the "next" time. So we'll have the next interval, and can make the same calculation there ...

ymao1 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:small Small Estimated Level of Effort labels Jan 5, 2022

mikecote added this to AppEx: ResponseOps - Execution & Connectors Jan 5, 2022

mikecote moved this to Todo in AppEx: ResponseOps - Execution & Connectors Jan 6, 2022

ymao1 mentioned this issue Jan 13, 2022

Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650

Closed

ymao1 self-assigned this Jan 19, 2022

ymao1 moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Jan 19, 2022

ymao1 mentioned this issue Jan 24, 2022

[Alerting] Smarter retry interval for ES Connectivity errors #123642

Merged

1 task

ymao1 moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jan 24, 2022

ymao1 closed this as completed in #123642 Jan 27, 2022

Repository owner moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jan 27, 2022

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Alerting] Smarter retry interval for ES Connectivity errors #122390

[Alerting] Smarter retry interval for ES Connectivity errors #122390

ymao1 commented Jan 5, 2022 •

edited

Loading

elasticmachine commented Jan 5, 2022

pmuellr commented Jan 12, 2022

[Alerting] Smarter retry interval for ES Connectivity errors #122390

[Alerting] Smarter retry interval for ES Connectivity errors #122390

Comments

ymao1 commented Jan 5, 2022 • edited Loading

elasticmachine commented Jan 5, 2022

pmuellr commented Jan 12, 2022

ymao1 commented Jan 5, 2022 •

edited

Loading