Register nms and roi_align Autocast policy for PyTorch Intel GPU backend #8541

fengyuan14 · 2024-07-23T03:29:34Z

Intel GPU backend has been introduced to PyTorch, https://github.com/pytorch/pytorch/blob/main/third_party/xpu.txt. Some Torchvision models fails on torchvision::nms and torchvision::roi_align when enabling AMP execution. The PR is filed to add Autocast policy for Intel GPU backend, aligning CUDA and CPU policy and calculating on float for these two operators.
See details in [RFC] Intel GPU Upstreaming · Issue #114723 · pytorch/pytorch (github.com) about PyTorch Intel GPU backend.

pytorch-bot · 2024-07-23T03:29:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8541

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Unrelated Failure

As of commit 9ef9377 with merge base 61bd547 ():

NEW FAILURES - The following jobs have failed:

CMake / windows (windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-linux (3.12, linux.12xlarge, cpu) / linux-job (gh)
test/test_ops.py::TestRoIAlign::test_autocast_cpu[rois_dtype1-x_dtype1-False-True]
Tests / unittests-windows (3.10, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.11, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.12, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.8, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.9, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Tests / unittests-macos (3.11, macos-m1-stable) / macos-job (gh) (trunk failure)
RuntimeError: operator torchvision::nms does not exist

This comment was automatically generated by Dr. CI and updates every 15 minutes.

EikanWang · 2024-07-23T04:38:14Z

Please refine the PR title and description.

…gn for XPU Signed-off-by: Feng Yuan <feng1.yuan@intel.com>

fengyuan14 · 2024-07-24T00:59:29Z

Irrelevant lint error,

fengyuan14 · 2024-07-24T01:04:37Z

Please refine the PR title and description.

Updated.

atalman · 2024-07-24T13:53:23Z

hi @fengyuan14 looks like there are coupe of CI failures RuntimeError: operator torchvision::nms does not exist please fix. These does not looks like existing errors, current state of torchvision is green: https://hud2.pytorch.org/hud/pytorch/vision/main

fengyuan14 · 2024-07-25T00:48:16Z

hi @fengyuan14 looks like there are coupe of CI failures RuntimeError: operator torchvision::nms does not exist please fix. These does not looks like existing errors, current state of torchvision is green: https://hud2.pytorch.org/hud/pytorch/vision/main

Sure. I am checking the difference here, since it works at my local.

fengyuan14 · 2024-07-25T01:11:04Z

hi @fengyuan14 looks like there are coupe of CI failures RuntimeError: operator torchvision::nms does not exist please fix. These does not looks like existing errors, current state of torchvision is green: https://hud2.pytorch.org/hud/pytorch/vision/main

Checked the failure. The failure occurs in the step, Install testing utilities, and only on Macos+Python3.11. The step is ahead of building the Torchvision with my changes. In addition, other tests, Macos/Linux+Python3.xx, passed the step.

  Installing collected packages: urllib3, pluggy, iniconfig, idna, expecttest, coverage, charset-normalizer, certifi, requests, pytest, pytest-mock, pytest-cov
  Successfully installed certifi-2024.7.4 charset-normalizer-3.3.2 coverage-7.6.0 expecttest-0.2.1 idna-3.7 iniconfig-2.0.0 pluggy-1.5.0 pytest-7.4.4 pytest-cov-5.0.0 pytest-mock-3.14.0 requests-2.32.3 urllib3-2.2.2
Traceback (most recent call last):
  File "/Users/ec2-user/runner/_work/vision/vision/pytorch/vision/test/smoke_test.py", line 7, in <module>
    import torchvision
  File "/Users/ec2-user/.local/lib/python3.11/site-packages/torchvision/__init__.py", line 6, in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
  File "/Users/ec2-user/.local/lib/python3.11/site-packages/torchvision/_meta_registrations.py", line 163, in <module>
    @torch._custom_ops.impl_abstract("torchvision::nms")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ec2-user/runner/_work/_temp/miniconda/envs/ci/lib/python3.11/site-packages/torch/library.py", line 744, in register
    use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
  File "/Users/ec2-user/runner/_work/_temp/miniconda/envs/ci/lib/python3.11/site-packages/torch/library.py", line 183, in _register_fake
    handle = entry.fake_impl.register(func_to_register, source)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/ec2-user/runner/_work/_temp/miniconda/envs/ci/lib/python3.11/site-packages/torch/_library/fake_impl.py", line 31, in register
    if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: operator torchvision::nms does not exist
ERROR conda.cli.main_run:execute(47): `conda run ./.github/scripts/unittest.sh` failed. (See above for error)
Error: Process completed with exit code 1.

Could you help to rerun the failure case?

EikanWang · 2024-07-25T07:24:46Z

@fengyuan14 , you can rebase this PR to trigger the CI again.

EikanWang · 2024-07-25T15:26:35Z

@pytorchbot rebase -b main

pytorch-bot · 2024-07-25T15:26:37Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

fengyuan14 · 2024-07-30T06:35:58Z

@NicolasHug Could you help to review the PR?

NicolasHug

Thank you

… GPU backend (#8541) Summary: Signed-off-by: Feng Yuan <feng1.yuan@intel.com> Differential Revision: D60903715 fbshipit-source-id: c0919c6b0473432b7ef7a30599c3dc143e3da1cf Co-authored-by: Nicolas Hug <nh.nicolas.hug@gmail.com>

facebook-github-bot added the cla signed label Jul 23, 2024

EikanWang approved these changes Jul 23, 2024

View reviewed changes

Register Autocast policy (Align with CUDA and CPU) of nms and roi_ali…

5892e65

…gn for XPU Signed-off-by: Feng Yuan <feng1.yuan@intel.com>

fengyuan14 changed the title ~~Register Autocast policy (Align with CUDA and CPU) of nms and roi_ali…~~ Register nms and roi_align Autocast policy for XPU Jul 24, 2024

fengyuan14 changed the title ~~Register nms and roi_align Autocast policy for XPU~~ Register nms and roi_align Autocast policy for PyTorch Intel GPU backend Jul 24, 2024

atalman requested a review from NicolasHug July 24, 2024 13:54

Merge branch 'main' into fy/autocast-xpu

dfb645f

NicolasHug mentioned this pull request Jul 29, 2024

Add xpu linux wheel build into torchvision build matrix #8542

Merged

Merge branch 'main' into fy/autocast-xpu

df4cedd

Merge branch 'main' into fy/autocast-xpu

9ef9377

NicolasHug approved these changes Aug 6, 2024

View reviewed changes

NicolasHug merged commit c8c496d into pytorch:main Aug 6, 2024
57 of 65 checks passed

NicolasHug added enhancement module: ops labels Aug 6, 2024

This was referenced Aug 8, 2024

Vision_maskrcnn RuntimeError got diff tensor dtype intel/torch-xpu-ops#496

Open

[E2E] Torchbench accuracy "roi_align_forward_kernel" not implemented for 'BFloat16' intel/torch-xpu-ops#713

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Register nms and roi_align Autocast policy for PyTorch Intel GPU backend #8541

Register nms and roi_align Autocast policy for PyTorch Intel GPU backend #8541

fengyuan14 commented Jul 23, 2024 •

edited

Loading

pytorch-bot bot commented Jul 23, 2024 •

edited

Loading

EikanWang commented Jul 23, 2024

fengyuan14 commented Jul 24, 2024

fengyuan14 commented Jul 24, 2024

atalman commented Jul 24, 2024

fengyuan14 commented Jul 25, 2024 •

edited

Loading

fengyuan14 commented Jul 25, 2024 •

edited

Loading

EikanWang commented Jul 25, 2024

EikanWang commented Jul 25, 2024

pytorch-bot bot commented Jul 25, 2024

fengyuan14 commented Jul 30, 2024

NicolasHug left a comment

Register nms and roi_align Autocast policy for PyTorch Intel GPU backend #8541

Register nms and roi_align Autocast policy for PyTorch Intel GPU backend #8541

Conversation

fengyuan14 commented Jul 23, 2024 • edited Loading

pytorch-bot bot commented Jul 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8541

❌ 7 New Failures, 1 Unrelated Failure

EikanWang commented Jul 23, 2024

fengyuan14 commented Jul 24, 2024

fengyuan14 commented Jul 24, 2024

atalman commented Jul 24, 2024

fengyuan14 commented Jul 25, 2024 • edited Loading

fengyuan14 commented Jul 25, 2024 • edited Loading

EikanWang commented Jul 25, 2024

EikanWang commented Jul 25, 2024

pytorch-bot bot commented Jul 25, 2024

fengyuan14 commented Jul 30, 2024

NicolasHug left a comment

Choose a reason for hiding this comment

fengyuan14 commented Jul 23, 2024 •

edited

Loading

pytorch-bot bot commented Jul 23, 2024 •

edited

Loading

fengyuan14 commented Jul 25, 2024 •

edited

Loading

fengyuan14 commented Jul 25, 2024 •

edited

Loading