Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Task Manager instrumentation #99160

Merged
merged 16 commits into from
Jun 4, 2021
Merged

Conversation

dgieselaar
Copy link
Member

@dgieselaar dgieselaar commented May 4, 2021

In #90955, I dialed down the cardinality of Task Manager transaction names due to the performance impact of breakdown metrics. Now that some performance issues are fixed in the APM Agent, and the need for better observability of the Task Manager is evident, it makes sense to revert those changes and add additional instrumentation.

The following changes were made:

  • Disparate transaction types for run, markAvailableTasksAsClaimed, markTaskAsRunning
  • Use the task type as the transaction name for run transaction types
  • Wrap task manager/action operations in custom spans, to group ES calls together
  • Store the traceparent in the task object, to enable distributed tracing for tasks (ie, if a task schedules another task, it will show up in the same trace)

Additionally, I've set refresh to false in some situations, as by default the SO client will use wait_for, and wait on a shard refresh. That could unnecessarily delay task completion, and consume resources that could otherwise be freed up. moved to #99919

For the screenshots, I disabled http/https instrumentation to prevent the http spans from showing up, and I locally worked around an instrumentation issue in the Node.js agent that messes up span relationships (see elastic/apm-agent-nodejs#1964 for a description of the problem).

Before:

image

After:

image

image

image

@dgieselaar dgieselaar requested a review from gmmorris May 4, 2021 08:33
@dgieselaar dgieselaar marked this pull request as ready for review May 4, 2021 10:00
@dgieselaar dgieselaar requested a review from a team as a code owner May 4, 2021 10:00
@dgieselaar dgieselaar added release_note:skip Skip the PR/issue when compiling release notes v7.14.0 labels May 4, 2021
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a comment about adding some tests. Do you think you can provide instructions for generating the charts you have in your PR locally? Is that possible? I'd love to see it all working.

@gmmorris
Copy link
Contributor

Hi @elastic/kibana-alerting-services

Did anyone discuss the state of this PR with @dgieselaar ?
Will Dario be continuing this when he's back from leave, or are we going to do something with it ourselves? 🤔

@ymao1
Copy link
Contributor

ymao1 commented May 24, 2021

Hi @elastic/kibana-alerting-services

Did anyone discuss the state of this PR with @dgieselaar ?
Will Dario be continuing this when he's back from leave, or are we going to do something with it ourselves? 🤔

Sorry, I dropped the ball on that. I'm happy to add unit tests for the apm calls but I still don't know how to get APM running to see the traces. I can look through the docs

@ymao1
Copy link
Contributor

ymao1 commented May 24, 2021

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smith
Copy link
Contributor

smith commented May 27, 2021

@elasticmachine merge upstream

@smith smith added the auto-backport Deprecated - use backport:version if exact versions are needed label May 27, 2021
Comment on lines +260 to +264
/**
* The serialized traceparent string of the current APM transaction or span.
*/
traceparent?: string;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a note in the Readme.md about using this field to trace the execution of tasks in apm.
Other plugins might want to utilise this.

return await this.store.schedule(modifiedTask);
return await this.store.schedule({
...modifiedTask,
traceparent: agent.currentTraceparent ?? '',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does an empty string tell APM that there is no trace here? Or is this going to cause all tasks that don't have a traceparent to appear in APM as related because they all get grouped under an "empty string" transaction? 😬

@gmmorris
Copy link
Contributor

gmmorris commented Jun 2, 2021

@elasticmachine merge upstream

Copy link
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but added a couple of questions.
I'd appreciate if another member of @elastic/kibana-alerting-services took a look as @ymao1 as not just reviewer here but also contributer. :)

@chrisronline chrisronline self-requested a review June 2, 2021 11:41
@ymao1 ymao1 added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jun 2, 2021
try {
this.task = this.definition.createTaskRunner(modifiedContext);
const result = await this.task.run();
const result = await withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why is this.task! necessary in this PR, but not in master? (on the above line)

Copy link
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Pretty cool to see how useful this integration is and will continue to be for us and users!

@ymao1
Copy link
Contributor

ymao1 commented Jun 4, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
taskManager 43 44 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ymao1 ymao1 merged commit c79a29a into elastic:master Jun 4, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jun 4, 2021
* Instrument task manager

* Don't refresh after SO updates

* Update snapshot test, remove beforeRun instrumentation

* Revert "Don't refresh after SO updates"

This reverts commit 54f848d.

* Fix task_store unit test

* Adding tests and updating ConcreteTaskInstance interface with traceparent

* Reverting unnecessary changes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Ying Mao <ying.mao@elastic.co>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Jun 4, 2021
* Instrument task manager

* Don't refresh after SO updates

* Update snapshot test, remove beforeRun instrumentation

* Revert "Don't refresh after SO updates"

This reverts commit 54f848d.

* Fix task_store unit test

* Adding tests and updating ConcreteTaskInstance interface with traceparent

* Reverting unnecessary changes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Ying Mao <ying.mao@elastic.co>

Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
Co-authored-by: Ying Mao <ying.mao@elastic.co>
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jun 7, 2021
* master: (90 commits)
  Fix UI breaks on providing long search keyword in 'Search Box' (elastic#101385)
  Adds css class to EuiDescriptionListDescription in order to break word on exception details card (elastic#101481)
  [Lens] Increase timings for drag and drop tests (elastic#101380)
  [Lens] Fix editor react error on configuration panel (elastic#101367)
  [Fleet] Move integrations to a separate app (elastic#99848)
  Fix incorrect message displayed on importing Timeline Templates (elastic#101288)
  [Cases] RBAC (elastic#95058)
  [APM] Visual improvements for new APM layout with left navigation (elastic#101360)
  [master] More precise alerts matching (elastic#99820)
  [Lens] Value in legend (elastic#101353)
  Revert "[Reporting] ILM policy for managing reporting indices (elastic#100130)" (elastic#101358)
  [Discover] Fix header row of data grid in Firefox (elastic#101374)
  Add link to advanced setting in Discover (elastic#101154)
  Url service locators (elastic#101045)
  [Timelion] Update the removal message to mention the exact version (elastic#100994)
  [Security Solution][Detection Engine] Test cases for alias failure test cases where we don't copy aliases correctly (elastic#101437)
  [Event Log] Adding `type_id` to saved object array in event log (elastic#100939)
  [Reporting] Add `location.url` info to console message logs (elastic#101427)
  [Security Solutions][Detection Engine] Fixes timestamp bugs within source indexes when the formats are not ISO8601 format (elastic#101349)
  Improve Task Manager instrumentation (elastic#99160)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Actions Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.14.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants