Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tooling to explain why a graph execution happens #5723

Merged
merged 8 commits into from
Oct 31, 2023

Conversation

JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented Oct 23, 2023

FYI @Liyang90 @AlexWertheim

example output

Execution Analysis: ================================================================================
Execution Analysis: Execution Cause
Execution Analysis:   mark_step in parallel loader at step end
Execution Analysis: Python Frame Triggered Execution: 
Execution Analysis:   mark_step (/src/pytorch/xla/torch_xla/core/xla_model.py:818)
Execution Analysis:   next (/src/pytorch/xla/torch_xla/distributed/parallel_loader.py:44)
Execution Analysis:   __next__ (/src/pytorch/xla/torch_xla/distributed/parallel_loader.py:32)
Execution Analysis:   train_loop_fn (/src/pytorch/xla/test/test_train_mp_imagenet.py:296)
Execution Analysis:   train_imagenet (/src/pytorch/xla/test/test_train_mp_imagenet.py:345)
Execution Analysis:   _mp_fn (/src/pytorch/xla/test/test_train_mp_imagenet.py:369)
Execution Analysis:   __call__ (/src/pytorch/xla/torch_xla/_internal/pjrt.py:189)
Execution Analysis:   _thread_fn (/src/pytorch/xla/torch_xla/_internal/pjrt.py:70)
Execution Analysis:   run (/usr/local/lib/python3.8/concurrent/futures/thread.py:57)
Execution Analysis:   _worker (/usr/local/lib/python3.8/concurrent/futures/thread.py:80)
Execution Analysis:   run (/usr/local/lib/python3.8/threading.py:870)
Execution Analysis:   _bootstrap_inner (/usr/local/lib/python3.8/threading.py:932)
Execution Analysis:   _bootstrap (/usr/local/lib/python3.8/threading.py:890)
Execution Analysis: --------------------------------------------------------------------------------
Execution Analysis: ================================================================================
Execution Analysis: ================================================================================
Execution Analysis: Execution Cause
Execution Analysis:   mark_step when existing a profiler StepTrace region
Execution Analysis: Python Frame Triggered Execution: 
Execution Analysis:   mark_step (/src/pytorch/xla/torch_xla/core/xla_model.py:818)
Execution Analysis:   __exit__ (/src/pytorch/xla/torch_xla/debug/profiler.py:160)
Execution Analysis:   train_loop_fn (/src/pytorch/xla/test/test_train_mp_imagenet.py:311)
Execution Analysis:   train_imagenet (/src/pytorch/xla/test/test_train_mp_imagenet.py:345)
Execution Analysis:   _mp_fn (/src/pytorch/xla/test/test_train_mp_imagenet.py:369)
Execution Analysis:   __call__ (/src/pytorch/xla/torch_xla/_internal/pjrt.py:189)
Execution Analysis:   _thread_fn (/src/pytorch/xla/torch_xla/_internal/pjrt.py:70)
Execution Analysis:   run (/usr/local/lib/python3.8/concurrent/futures/thread.py:57)
Execution Analysis:   _worker (/usr/local/lib/python3.8/concurrent/futures/thread.py:80)
Execution Analysis:   run (/usr/local/lib/python3.8/threading.py:870)
Execution Analysis:   _bootstrap_inner (/usr/local/lib/python3.8/threading.py:932)
Execution Analysis:   _bootstrap (/usr/local/lib/python3.8/threading.py:890)
Execution Analysis: --------------------------------------------------------------------------------
Execution Analysis: ================================================================================

<< "======================================================================"
"=========="
<< "\n";
if (frames[0].function == "mark_step") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it worths logging frames[0].function; for the cases other than mark_step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I want to expand this a bit later to cover most of the common cases of the execution.

@JackCaoG
Copy link
Collaborator Author

I need to add some test, will try to do that this week

@JackCaoG JackCaoG changed the title [WIP] Add tooling to explain why a graph execution happens Add tooling to explain why a graph execution happens Oct 28, 2023
@JackCaoG
Copy link
Collaborator Author

This should be ready for review.

@JackCaoG
Copy link
Collaborator Author

I will update troubleshotting in a separate pr.

@JackCaoG
Copy link
Collaborator Author

@will-cromar Can I get a review for this one?

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this information is already in the dumped graphs?

Will the UX be better with this change where the information is printed to the console? It may not be the case as the information will be interleaved with the user code.

@JackCaoG
Copy link
Collaborator Author

dumped graph is in a separate file and it always dump the HLO or IR which can be huge. The idea is that we should dump something more concise and provide some explanation on what is happening, this can be interleave with the debugging message user added to their model code.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JackCaoG JackCaoG merged commit b6a03d9 into master Oct 31, 2023
18 checks passed
@JackCaoG
Copy link
Collaborator Author

@carmocca FYI, this is a WIP I will add more to this tooling.

mbzomowski pushed a commit to mbzomowski-test-org/xla that referenced this pull request Nov 16, 2023
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
ManfeiBai pushed a commit that referenced this pull request Nov 29, 2023
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
* Initial commit for debugging tool

* minor format tweak

* Only master process should print the execution frame info

* add execution cause

* handle dynamo and everything else

* add test

* linter

* add test to the script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants