Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Expose Task Manager metrics to task executors #98634

Closed
gmmorris opened this issue Apr 28, 2021 · 10 comments · Fixed by #102252 or #103172
Closed

[Task Manager] Expose Task Manager metrics to task executors #98634

gmmorris opened this issue Apr 28, 2021 · 10 comments · Fixed by #102252 or #103172
Assignees
Labels
Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

We'd like to allow Alerting tasks (and other tasks) to know the metrics around their execution.

We should pass the following into a task executor:

  1. Task Drift of the specific task
  2. Perhaps current average duration?

It would be useful for Alerting, as we could include these metrics in the Event Log, which will help with monitoring and SDH support.

@botelastic botelastic bot added the needs-team Issues missing a team label label Apr 28, 2021
@gmmorris gmmorris added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed needs-team Issues missing a team label labels Apr 28, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor

mikecote commented Jun 9, 2021

We should timebox the effort in combination with leveraging this in the actions event logs.

@pmuellr pmuellr self-assigned this Jun 15, 2021
@pmuellr
Copy link
Member

pmuellr commented Jun 15, 2021

  • task drift would be runAt - now()?
  • current average duration would be pulled from the current stats being collected?

I think the first - task drift - is likely more interesting than the average, but both do seem interesting.

When looking at the existing event log schema (these are the extensions to ECS we're using), I think we'll need a new object property under kibana (peer of alerting) - I'll call it task for now. And it could have the following fields:

{
  scheduled_date: Date,
  schedule_delay: (task drift in ms or nanoseconds?)
  average_duration: (from collected stats - nanoseconds?)
}

Does that seem right?

@ymao1
Copy link
Contributor

ymao1 commented Jun 15, 2021

Not sure how difficult it is to tap into the current task manager stats, but if it is difficult, we could achieve the same thing if we had the just the calculated drift (schedule_delay) value in the event log, using ES aggs, right?

@pmuellr
Copy link
Member

pmuellr commented Jun 15, 2021

if we had the just the calculated drift (schedule_delay) value in the event log, using ES aggs, right?

IIUC that would would actually be the drift aggregated over every server, where the current task manager stats are per server. I'm curious how we'll identify when these are wildly different, due to some machines behaving differently than others (presumably some much slower than others, so their drift would be higher). :-)

It's also nice to actually have this value for the specific task, since we know when it was supposed to run - it's in the runAt field of the task object, and we know the current time, so we'll see the actual "drift" for each execution IN the event doc for the execution. Assuming we can thread that runAt value to the point where we write the event log docs. Which I think we can (see current PR #102252).

@ymao1
Copy link
Contributor

ymao1 commented Jun 15, 2021

@pmuellr I think we are saying the same thing :)

I think what you're calling schedule_delay is easy to calculate and useful (and you've already added it in your draft PR!). I think what you're calling average_duration, if it's difficult to pass it from the task manager, we could get using ES aggs on the event log. We could group by specific rule id, rule type id and even by kibana instance using the server_uuid value that's already written out in the event log.

@pmuellr
Copy link
Member

pmuellr commented Jun 15, 2021

we could get using ES aggs on the event log

But we don't want to do that for every connector / rule execution - and so I was thinking ... do we even need this in the actual event docs, since we CAN do aggs over the event docs ad hoc? And I believe we print these out in the TM health metrics anyway (if not, we should!).

I haven't looked to see if it's hard to get the metrics out, guessing it might not be too difficult, but I'd probably have to thread a TM reference into some places we're currently not passing it, which is always the hard part :-)

@ymao1
Copy link
Contributor

ymao1 commented Jun 15, 2021

Yes! I meant ad hoc aggregations! Sorry for the confusion

@pmuellr
Copy link
Member

pmuellr commented Jun 16, 2021

Looked into accessing the task manager stats:

In order to keep PR #102252 small, I think I'll defer on providing those metrics for now ...

@pmuellr
Copy link
Member

pmuellr commented Jun 16, 2021

I was looking at the task manager docs of a running system:

image

Occurs to me - at least based on this - that it's not clear how we would do "average duration" for rules and connectors. Given the raw fields, we could do this for rules via _id - but of course the _id of a rule's task will change if it's disabled and then re-enabled. For connectors, we could task.taskType, but then all tasks of the same type will get lumped together - that seems problematic / not useful for the index connector, as an example.

Ideally we'd like to have "average durations" for each rule and each connector, and it's not clear how best to do that with just task manager. Connectors are especially interesting as presumably the connector id is not in the task doc at all, but in the action task params doc, which is itself a different saved object.

Another good reason to defer "average durations" for now ...

pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 20, 2021
resolves elastic#98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),
pmuellr added a commit that referenced this issue Jun 23, 2021
…102252)

resolves #98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),
pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 23, 2021
…lastic#102252)

resolves elastic#98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),
ymao1 pushed a commit to ymao1/kibana that referenced this issue Jun 23, 2021
…lastic#102252)

resolves elastic#98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),
pmuellr added a commit that referenced this issue Jun 24, 2021
#103172)

resolves #98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),

Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
pmuellr added a commit to pmuellr/kibana that referenced this issue Jun 24, 2021
elastic#103172)

resolves elastic#98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),

Note that these changes were previously merged to master in elastic#102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
pmuellr added a commit that referenced this issue Jun 24, 2021
#103172) (#103296)

resolves #98634

This adds a new object property to the event log kibana object named
task, with two properties to track the time the task was scheduled to
run, and the delay between when it was supposed to run and when it
actually started. This task property is only added to the appropriate
events.

	task: schema.maybe(
	  schema.object({
	    scheduled: ecsDate(),
	    schedule_delay: ecsNumber(),
	  })
	),

Note that these changes were previously merged to master in #102252 which had to be reverted - this PR contains the same commits, plus some additional ones to resolve the tests that were broken during the bad merge.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
6 participants