Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

guptaaryan16 · 2023-11-12T15:39:06Z

Fixes #2014

Description

This PR is in relation to the issue of ROCm support in PyTorch Profiler, which I have seen can give some bugs on testing with PyTorch API. The official files in PyTorch/Profiler does not have an option for its support yet as given here: https://github.com/pytorch/pytorch/blob/7f1cbc8b5a7de8794c36833baa47bb1343833589/torch/profiler/profiler.py#L35

ROCm usage can give some unexpected results so right now it's best to mark the usage unsupported. I plan to raise this issue with the PyTorch team as it will be important to support ROCm GPUs as their usage grows.

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

cc @aaronenyeshi @chaekit @jcarreiro @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen

pytorch-bot · 2023-11-12T15:39:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2674

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3aa5d06 with merge base dc448c2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2023-11-12T15:39:11Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

guptaaryan16 · 2023-11-12T15:39:31Z

@pytorchbot label "docathon-h2-2023"

malfet · 2023-11-14T21:20:20Z

intermediate_source/tensorboard_profiler_tutorial.py

@@ -7,7 +7,7 @@
 Introduction
 ------------
 PyTorch 1.8 includes an updated profiler API capable of
-recording the CPU side operations as well as the CUDA kernel launches on the GPU side.
+recording the CPU side operations as well as the CUDA kernel launches on the GPU side (``AMD ROCm™`` GPUs are not supported).


@jeffdaily is this indeed the case?

Also, why are you adding a backtricks here?

ROCm has full upstream support of both modern pytorch profiling via kineto + ROCm's libroctracer, as well as the older autograd profiler via ROCm's roctx.

Run any application, pytorch or otherwise, using ROCm's rocprof. The PyTorch GPU traces will be collected as part of your rocprof output.

@malfet I quoted this function supported_activities in official PyTorch API which does not mention anything about ROCm GPUs. I suppose they should have something here for the ROCm profiling if that's the case.

Also the backtricks is due to an error in PySpelling test where ROCm is not in dictionary, so the test caught it as a misspelled word. I suppose I can either do this or add it in a custom dictionary list.

We should update that comment. ROCm profiler uses roctracer library to trace on-device HIP kernels. Passing ProfilerActivity.CUDA is the correct activity to use for both CUDA and ROCm due to ROCm PyTorch's strategy of "masquerading" as CUDA so that users do not have to make any changes to their pytorch models when running on ROCm. PyTorch chose to expose "CUDA" in public APIs and we chose to reuse them to make the transition to ROCm from CUDA easier on users.

See https://pytorch.org/docs/stable/notes/hip.html.

malfet

To further substantiate the claim, can you please specify in PR description why your are making it (I.e. you've tried to run the tutorial on GPU such-n-such using PyTorch-X.Y.Z and instead of certain output you got something else. Because right now you just add a blank statement, which would be very hard to verify

guptaaryan16 · 2023-11-14T22:08:06Z

@malfet I was testing the issue #2014 but since I did not have specific hardware(i.e. AMD GPU) for the testing the trace I assumed that the person who raised the issue has provided a correct trace image which seems to give wrong output. Since the trace is fine for CPU and CUDA (on testing), and there as no mention of ROCm in profiler API if they are supported, I raised this PR. Can you verify this tutorial and https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html on ROCm and see if the output trace is correct? Seems like it is off in time for about 250ms which is not the case for CPU and CUDA devices. Thanks @jeffdaily

jeffdaily · 2023-11-14T22:13:08Z

I'm assigning this to @hongxiayang to verify.

hongxiayang · 2023-11-15T18:55:55Z

I'm assigning this to @hongxiayang to verify.

#2684

guptaaryan16 · 2023-11-15T19:23:59Z

Ok seems like this issue is resolved, so I am closing this PR

pytorch-bot bot added ciflow/rocm module: rocm labels Nov 12, 2023

facebook-github-bot added the cla signed label Nov 12, 2023

pytorch-bot bot added the docathon-h2-2023 label Nov 12, 2023

github-actions bot added module: profiler tensorboard medium and removed cla signed labels Nov 12, 2023

facebook-github-bot added the cla signed label Nov 12, 2023

guptaaryan16 added 2 commits November 14, 2023 23:21

Add fix for ROCm GPU in pytorch profiler

fe9814e

Fix CI spelling error

4c3dd3e

guptaaryan16 force-pushed the rocm_profiler_support branch from 532b321 to 4c3dd3e Compare November 14, 2023 17:52

github-actions bot removed the cla signed label Nov 14, 2023

Remove unnecessary changes

3aa5d06

facebook-github-bot added the cla signed label Nov 14, 2023

guptaaryan16 marked this pull request as draft November 14, 2023 18:03

guptaaryan16 marked this pull request as ready for review November 14, 2023 18:03

github-actions bot removed the cla signed label Nov 14, 2023

facebook-github-bot added the cla signed label Nov 14, 2023

malfet reviewed Nov 14, 2023

View reviewed changes

malfet requested changes Nov 14, 2023

View reviewed changes

guptaaryan16 closed this Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

guptaaryan16 commented Nov 12, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 12, 2023 •

edited

Loading

pytorch-bot bot commented Nov 12, 2023

guptaaryan16 commented Nov 12, 2023

malfet Nov 14, 2023

malfet Nov 14, 2023

jeffdaily Nov 14, 2023

guptaaryan16 Nov 14, 2023

jeffdaily Nov 14, 2023

jeffdaily Nov 14, 2023

malfet left a comment

guptaaryan16 commented Nov 14, 2023 •

edited

Loading

jeffdaily commented Nov 14, 2023

hongxiayang commented Nov 15, 2023 •

edited

Loading

guptaaryan16 commented Nov 15, 2023

Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

Add No Support Label for ROCm GPU in pytorch profiler tutorial #2674

Conversation

guptaaryan16 commented Nov 12, 2023 • edited by pytorch-bot bot Loading

Description

Checklist

pytorch-bot bot commented Nov 12, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2674

✅ No Failures

pytorch-bot bot commented Nov 12, 2023

guptaaryan16 commented Nov 12, 2023

malfet Nov 14, 2023

Choose a reason for hiding this comment

malfet Nov 14, 2023

Choose a reason for hiding this comment

jeffdaily Nov 14, 2023

Choose a reason for hiding this comment

guptaaryan16 Nov 14, 2023

Choose a reason for hiding this comment

jeffdaily Nov 14, 2023

Choose a reason for hiding this comment

jeffdaily Nov 14, 2023

Choose a reason for hiding this comment

malfet left a comment

Choose a reason for hiding this comment

guptaaryan16 commented Nov 14, 2023 • edited Loading

jeffdaily commented Nov 14, 2023

hongxiayang commented Nov 15, 2023 • edited Loading

guptaaryan16 commented Nov 15, 2023

guptaaryan16 commented Nov 12, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 12, 2023 •

edited

Loading

guptaaryan16 commented Nov 14, 2023 •

edited

Loading

hongxiayang commented Nov 15, 2023 •

edited

Loading