Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve operator tracing and make it E2E work #11360

Closed
wants to merge 1 commit into from

Conversation

xiaoxmeng
Copy link
Contributor

@xiaoxmeng xiaoxmeng commented Oct 28, 2024

Differential Revision: D64946367

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 28, 2024
Copy link

netlify bot commented Oct 28, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 7994bda
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6721cb87ca64f400080a2524

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 28, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 28, 2024
@duanmeng duanmeng self-requested a review October 28, 2024 06:29
Copy link
Collaborator

@duanmeng duanmeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiaoxmeng LGTM, thanks for your refactor.

@xiaoxmeng xiaoxmeng changed the title Operator trace refactor and support more trace summary Improve operator tracing and make it E2E work Oct 29, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Differential Revision: D64946367
velox/exec/Driver.cpp Outdated Show resolved Hide resolved
velox/tool/trace/TraceReplayer.cpp Outdated Show resolved Hide resolved
velox/tool/trace/TraceReplayRunner.cpp Show resolved Hide resolved
@@ -29,6 +30,52 @@ namespace facebook::velox::exec::trace {
/// Creates a directory to store the query trace metdata and data.
void createTraceDirectory(const std::string& traceDir);

/// Returns the trace directory for a given query.
std::string getQueryTraceDirectory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/get/compose/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will keep it as get

const std::string& queryId);

/// Returns the trace directory for a given query task.
std::string getTaskTraceDirectory(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, and below

velox/exec/TraceUtil.cpp Outdated Show resolved Hide resolved
velox/exec/PartitionedOutput.cpp Show resolved Hide resolved
Copy link
Contributor

@tanjialiang tanjialiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactoring, LGTM

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 29, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) Simplify the trace control logic by throwing if hit trace limit instead of logging in trace summary.
(6) Strict the check if trace plan not is not specified or has specified the wrong node as we are
only supposed to use trace for debugging purpose instead of production query running.
(7) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) Simplify the trace control logic by throwing if hit trace limit instead of logging in trace summary.
(6) Strict the check if trace plan not is not specified or has specified the wrong node as we are
only supposed to use trace for debugging purpose instead of production query running.
(7) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) Simplify the trace control logic by throwing if hit trace limit instead of logging in trace summary.
(6) Strict the check if trace plan not is not specified or has specified the wrong node as we are
only supposed to use trace for debugging purpose instead of production query running.
(7) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
xiaoxmeng added a commit to xiaoxmeng/velox that referenced this pull request Oct 30, 2024
Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) Simplify the trace control logic by throwing if hit trace limit instead of logging in trace summary.
(6) Strict the check if trace plan not is not specified or has specified the wrong node as we are
only supposed to use trace for debugging purpose instead of production query running.
(7) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

Summary:
Pull Request resolved: facebookincubator#11360

This PR improves the operator trace framework and replay tool implementation:
(1) Let driver execution framework to handle the trace input collection and summary file generation.
The finish trace is either called when trace limit exceeds or on operator close instead of no
more input. So it can capture more information in summary like peak memory usage. By removing
the trace input collection into the driver framework it eases the trace input collection for a spilling
operator. We add to capture the peak memory and input rows in summary so it helps identify the
hot operator or task on replay debugging.
(2) Change the trace storage layout to have root with query id for traces from different tasks
(3) Abstract the replay tool function into TraceReplayRunner class and derive in Meta internal code repo
to handle the Meta internal env setup and keep most common part in  TraceReplayRunner for OSS
(4) A couple of fixes in trace replay tool to make it E2E function in Meta for table writer use case
(5) Simplify the trace control logic by throwing if hit trace limit instead of logging in trace summary.
(6) Strict the check if trace plan not is not specified or has specified the wrong node as we are
only supposed to use trace for debugging purpose instead of production query running.
(7) A couple of file/class renaming to make the file/class name to be more specific as currently we only
support operator level trace collection and replay

Reviewed By: tanjialiang

Differential Revision: D64946367
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64946367

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in df5cff5.

Copy link

Conbench analyzed the 1 benchmark run on commit df5cff50.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

facebook-github-bot pushed a commit that referenced this pull request Oct 30, 2024
Summary:
A follow-up fix PR for #11360 to address the linking error:

```cmake
[1432/1435] Linking CXX executable velox/tool/trace/velox_query_replayer
FAILED: velox/tool/trace/velox_query_replayer
...
Undefined symbols for architecture arm64:
  "facebook::velox::tool::trace::TraceReplayRunner::run()", referenced from:
      _main in TraceReplayerMain.cpp.o
  "facebook::velox::tool::trace::TraceReplayRunner::init()", referenced from:
      _main in TraceReplayerMain.cpp.o
  "facebook::velox::tool::trace::TraceReplayRunner::TraceReplayRunner()", referenced from:
      _main in TraceReplayerMain.cpp.o
  "vtable for facebook::velox::tool::trace::TraceReplayRunner", referenced from:

Reviewed By: xiaoxmeng

Differential Revision: D65217021

Pulled By: kgpai

fbshipit-source-id: 2e0e44d916a27853366d77d883f0add70993fdc0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants