[Task Manager] Expose Task Manager metrics to task executors #98634

gmmorris · 2021-04-28T15:24:22Z

We'd like to allow Alerting tasks (and other tasks) to know the metrics around their execution.

We should pass the following into a task executor:

Task Drift of the specific task
Perhaps current average duration?

It would be useful for Alerting, as we could include these metrics in the Event Log, which will help with monitoring and SDH support.

elasticmachine · 2021-04-28T15:24:58Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2021-06-09T16:02:57Z

We should timebox the effort in combination with leveraging this in the actions event logs.

pmuellr · 2021-06-15T15:06:10Z

task drift would be runAt - now()?
current average duration would be pulled from the current stats being collected?

I think the first - task drift - is likely more interesting than the average, but both do seem interesting.

When looking at the existing event log schema (these are the extensions to ECS we're using), I think we'll need a new object property under kibana (peer of alerting) - I'll call it task for now. And it could have the following fields:

{
  scheduled_date: Date,
  schedule_delay: (task drift in ms or nanoseconds?)
  average_duration: (from collected stats - nanoseconds?)
}

Does that seem right?

ymao1 · 2021-06-15T15:12:33Z

Not sure how difficult it is to tap into the current task manager stats, but if it is difficult, we could achieve the same thing if we had the just the calculated drift (schedule_delay) value in the event log, using ES aggs, right?

pmuellr · 2021-06-15T17:59:38Z

if we had the just the calculated drift (schedule_delay) value in the event log, using ES aggs, right?

IIUC that would would actually be the drift aggregated over every server, where the current task manager stats are per server. I'm curious how we'll identify when these are wildly different, due to some machines behaving differently than others (presumably some much slower than others, so their drift would be higher). :-)

It's also nice to actually have this value for the specific task, since we know when it was supposed to run - it's in the runAt field of the task object, and we know the current time, so we'll see the actual "drift" for each execution IN the event doc for the execution. Assuming we can thread that runAt value to the point where we write the event log docs. Which I think we can (see current PR #102252).

ymao1 · 2021-06-15T18:26:01Z

@pmuellr I think we are saying the same thing :)

I think what you're calling schedule_delay is easy to calculate and useful (and you've already added it in your draft PR!). I think what you're calling average_duration, if it's difficult to pass it from the task manager, we could get using ES aggs on the event log. We could group by specific rule id, rule type id and even by kibana instance using the server_uuid value that's already written out in the event log.

pmuellr · 2021-06-15T18:56:18Z

we could get using ES aggs on the event log

But we don't want to do that for every connector / rule execution - and so I was thinking ... do we even need this in the actual event docs, since we CAN do aggs over the event docs ad hoc? And I believe we print these out in the TM health metrics anyway (if not, we should!).

I haven't looked to see if it's hard to get the metrics out, guessing it might not be too difficult, but I'd probably have to thread a TM reference into some places we're currently not passing it, which is always the hard part :-)

ymao1 · 2021-06-15T19:03:32Z

Yes! I meant ad hoc aggregations! Sorry for the confusion

pmuellr · 2021-06-16T01:42:46Z

Looked into accessing the task manager stats:

there's a good example of gathering the tm stats in the health route: https://github.com/elastic/kibana/blob/master/x-pack/plugins/task_manager/server/routes/health.ts
since that code is running all the time, we should make it's lastMonitoredStats available via an API provided in the task manager start API, and/or via task instances, so any code that wants access to it can get it
that means moving the lastMonitorStats calculation out of the route, etc

In order to keep PR #102252 small, I think I'll defer on providing those metrics for now ...

pmuellr · 2021-06-16T02:50:19Z

I was looking at the task manager docs of a running system:

Occurs to me - at least based on this - that it's not clear how we would do "average duration" for rules and connectors. Given the raw fields, we could do this for rules via _id - but of course the _id of a rule's task will change if it's disabled and then re-enabled. For connectors, we could task.taskType, but then all tasks of the same type will get lumped together - that seems problematic / not useful for the index connector, as an example.

Ideally we'd like to have "average durations" for each rule and each connector, and it's not clear how best to do that with just task manager. Connectors are especially interesting as presumably the connector id is not in the task doc at all, but in the action task params doc, which is itself a different saved object.

Another good reason to defer "average durations" for now ...

resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),

…102252) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),

…lastic#102252) resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ),

#103172) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.

elastic#103172) resolves elastic#98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in elastic#102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.

#103172) (#103296) resolves #98634 This adds a new object property to the event log kibana object named task, with two properties to track the time the task was scheduled to run, and the delay between when it was supposed to run and when it actually started. This task property is only added to the appropriate events. task: schema.maybe( schema.object({ scheduled: ecsDate(), schedule_delay: ecsNumber(), }) ), Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.

botelastic bot added the needs-team Issues missing a team label label Apr 28, 2021

gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels Apr 28, 2021

gmmorris mentioned this issue Apr 28, 2021

Project: Observability of Alerting #89667

Closed

pmuellr self-assigned this Jun 15, 2021

pmuellr mentioned this issue Jun 15, 2021

[alerting][actions] add task scheduled date and delay to event log #102252

Merged

9 tasks

pmuellr closed this as completed in #102252 Jun 23, 2021

pmuellr mentioned this issue Jun 23, 2021

[alerting][actions] add task scheduled date and delay to event log - 2 #103172

Merged

9 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Expose Task Manager metrics to task executors #98634

[Task Manager] Expose Task Manager metrics to task executors #98634

gmmorris commented Apr 28, 2021

elasticmachine commented Apr 28, 2021

mikecote commented Jun 9, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 16, 2021

pmuellr commented Jun 16, 2021

[Task Manager] Expose Task Manager metrics to task executors #98634

[Task Manager] Expose Task Manager metrics to task executors #98634

Comments

gmmorris commented Apr 28, 2021

elasticmachine commented Apr 28, 2021

mikecote commented Jun 9, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 15, 2021

ymao1 commented Jun 15, 2021

pmuellr commented Jun 16, 2021

pmuellr commented Jun 16, 2021