-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Expose Task Manager metrics to task executors #98634
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
We should timebox the effort in combination with leveraging this in the actions event logs. |
I think the first - task drift - is likely more interesting than the average, but both do seem interesting. When looking at the existing event log schema (these are the extensions to ECS we're using), I think we'll need a new object property under {
scheduled_date: Date,
schedule_delay: (task drift in ms or nanoseconds?)
average_duration: (from collected stats - nanoseconds?)
} Does that seem right? |
Not sure how difficult it is to tap into the current task manager stats, but if it is difficult, we could achieve the same thing if we had the just the calculated drift ( |
IIUC that would would actually be the drift aggregated over every server, where the current task manager stats are per server. I'm curious how we'll identify when these are wildly different, due to some machines behaving differently than others (presumably some much slower than others, so their drift would be higher). :-) It's also nice to actually have this value for the specific task, since we know when it was supposed to run - it's in the |
@pmuellr I think we are saying the same thing :) I think what you're calling |
But we don't want to do that for every connector / rule execution - and so I was thinking ... do we even need this in the actual event docs, since we CAN do aggs over the event docs ad hoc? And I believe we print these out in the TM health metrics anyway (if not, we should!). I haven't looked to see if it's hard to get the metrics out, guessing it might not be too difficult, but I'd probably have to thread a TM reference into some places we're currently not passing it, which is always the hard part :-) |
Yes! I meant ad hoc aggregations! Sorry for the confusion |
Looked into accessing the task manager stats:
In order to keep PR #102252 small, I think I'll defer on providing those metrics for now ... |
I was looking at the task manager docs of a running system: Occurs to me - at least based on this - that it's not clear how we would do "average duration" for rules and connectors. Given the raw fields, we could do this for rules via Ideally we'd like to have "average durations" for each rule and each connector, and it's not clear how best to do that with just task manager. Connectors are especially interesting as presumably the connector id is not in the task doc at all, but in the action task params doc, which is itself a different saved object. Another good reason to defer "average durations" for now ... |
resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),
…102252) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),
…lastic#102252) resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),
…lastic#102252) resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),
#103172) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
elastic#103172) resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in elastic#102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
#103172) (#103296) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
We'd like to allow Alerting tasks (and other tasks) to know the metrics around their execution.
We should pass the following into a task executor:
It would be useful for Alerting, as we could include these metrics in the Event Log, which will help with monitoring and SDH support.
The text was updated successfully, but these errors were encountered: