Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Smarter retry interval for ES Connectivity errors #122390

Closed
ymao1 opened this issue Jan 5, 2022 · 2 comments · Fixed by #123642
Closed

[Alerting] Smarter retry interval for ES Connectivity errors #122390

ymao1 opened this issue Jan 5, 2022 · 2 comments · Fixed by #123642
Assignees
Labels
estimate:small Small Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Jan 5, 2022

When an alerting execution encounters an error, it reschedules itself for the next schedule interval. With this PR Kibana Core introduced a specific error type EsUnavailableError for transient ES connectivity errors. We should look for these errors in the alerting task runner and consider handling them slightly differently than other errors based on the configured schedule interval of a rule. If a rule is scheduled to run every 24 hours and it encounters a transient connectivity issue, it would make more sense to schedule its retry for a shorter interval, like 5 minutes instead of waiting another 24 hours for execution.

Towards #119650

@ymao1 ymao1 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework estimate:small Small Estimated Level of Effort labels Jan 5, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jan 12, 2022

So we really just need to pick a number, right? 5 minutes feels about right to me. If the interval is < 5m, let it run on it's existing schedule. Otherwise retry in 5m.

It probably doesn't make sense to bake this kind of logic into task manager (yet anyway), right? Especially since rules do kind of wacky scheduling anyway ... and not sure TM could recognize ES connectivity errors.

Just thinking about eventual "cron" scheduling. I think this will work there as well. We'll have to come up with a way of taking whatever "cron" schedule we have, and the current time, and have it tell us the "next" time. So we'll have the next interval, and can make the same calculation there ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:small Small Estimated Level of Effort Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants