Vision_maskrcnn RuntimeError got diff tensor dtype #496

mengfei25 · 2024-06-27T14:17:41Z

🐛 Describe the bug

torchbench_amp_fp16_training
xpu train vision_maskrcnn
Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2294, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 456, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/models/detection/generalized_rcnn.py", line 105, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/models/detection/roi_heads.py", line 761, in forward
box_features = self.box_roi_pool(features, proposals, image_shapes)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/poolers.py", line 314, in forward
return _multiscale_roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/poolers.py", line 204, in _multiscale_roi_align
result_idx_in_level = roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/roi_align.py", line 238, in roi_align
return torch.ops.torchvision.roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/ops.py", line 1064, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type torch.HalfTensor does not equal torch.FloatTensor (while checking arguments for roi_align_forward_kernel)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4177, in run
) = runner.load_model(
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 380, in load_model
self.validate_model(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2296, in validate_model
raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

eager_fail_to_run

Versions

torch-xpu-ops: 31c4001
pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120
device: PVC 1100, 803.61, 0.5.1

The text was updated successfully, but these errors were encountered:

retonym · 2024-07-15T08:52:43Z

The issue occurs exclusively in AMP mode and does not happen in BF16/FP16 modes. I suspect the crash might be due to the absence of the autocastxpu backend for the torchvision ROI align operator.
@fengyuan14 could you provide your insights on this?

chuanqi129 · 2024-07-18T08:03:04Z

https://github.com/pytorch/vision/blob/d23a6e1664d20707c11781299611436e1f0c104f/torchvision/csrc/ops/autocast/roi_align_kernel.cpp#L35-L49

chuanqi129 · 2024-08-08T07:32:06Z

@fengyuan14 has landed a PR to fix this issue pytorch/vision#8541, waiting for pytorch update torchvision commit

chuanqi129 · 2024-10-14T06:14:58Z

@mengfei25 please help to check whether the latest pytorch used torchvision commit included Feng's fix

mengfei25 · 2024-12-26T07:01:29Z

Not include, it is May 7, 2024 commit: d23a6e1664d20707c11781299611436e1f0c104f

mengfei25 added E2E Accuracy torchbench amp_fp16 training amp_bf16 labels Jun 27, 2024

chuanqi129 assigned retonym Jul 11, 2024

chuanqi129 added this to the PT2.5 milestone Jul 11, 2024

chuanqi129 assigned fengyuan14 and unassigned retonym Jul 18, 2024

chuanqi129 added the feature label Jul 18, 2024

chuanqi129 mentioned this issue Aug 8, 2024

[E2E] Torchbench accuracy "roi_align_forward_kernel" not implemented for 'BFloat16' #713

Open

11 tasks

chuanqi129 modified the milestones: PT2.5, PT2.6 Oct 14, 2024

fengyuan14 assigned mengfei25 and unassigned fengyuan14 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision_maskrcnn RuntimeError got diff tensor dtype #496

Vision_maskrcnn RuntimeError got diff tensor dtype #496

mengfei25 commented Jun 27, 2024

retonym commented Jul 15, 2024

chuanqi129 commented Jul 18, 2024

chuanqi129 commented Aug 8, 2024 •

edited

Loading

chuanqi129 commented Oct 14, 2024

mengfei25 commented Dec 26, 2024

Vision_maskrcnn RuntimeError got diff tensor dtype #496

Vision_maskrcnn RuntimeError got diff tensor dtype #496

Comments

mengfei25 commented Jun 27, 2024

🐛 Describe the bug

Versions

retonym commented Jul 15, 2024

chuanqi129 commented Jul 18, 2024

chuanqi129 commented Aug 8, 2024 • edited Loading

chuanqi129 commented Oct 14, 2024

mengfei25 commented Dec 26, 2024

chuanqi129 commented Aug 8, 2024 •

edited

Loading