Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision_maskrcnn RuntimeError got diff tensor dtype #496

Open
mengfei25 opened this issue Jun 27, 2024 · 5 comments
Open

Vision_maskrcnn RuntimeError got diff tensor dtype #496

mengfei25 opened this issue Jun 27, 2024 · 5 comments

Comments

@mengfei25
Copy link
Contributor

🐛 Describe the bug

torchbench_amp_fp16_training
xpu train vision_maskrcnn
Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2294, in validate_model
self.model_iter_fn(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 456, in forward_and_backward_pass
pred = mod(*cloned_inputs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/models/detection/generalized_rcnn.py", line 105, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/models/detection/roi_heads.py", line 761, in forward
box_features = self.box_roi_pool(features, proposals, image_shapes)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1566, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1575, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/poolers.py", line 314, in forward
return _multiscale_roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/poolers.py", line 204, in _multiscale_roi_align
result_idx_in_level = roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/roi_align.py", line 238, in roi_align
return torch.ops.torchvision.roi_align(
File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/ops.py", line 1064, in call
return self
._op(*args, **(kwargs or {}))
RuntimeError: Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type torch.HalfTensor does not equal torch.FloatTensor (while checking arguments for roi_align_forward_kernel)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4177, in run
) = runner.load_model(
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 380, in load_model
self.validate_model(model, example_inputs)
File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2296, in validate_model
raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

eager_fail_to_run

Versions

torch-xpu-ops: 31c4001
pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120
device: PVC 1100, 803.61, 0.5.1

@retonym
Copy link
Contributor

retonym commented Jul 15, 2024

The issue occurs exclusively in AMP mode and does not happen in BF16/FP16 modes. I suspect the crash might be due to the absence of the autocastxpu backend for the torchvision ROI align operator.
@fengyuan14 could you provide your insights on this?

@chuanqi129
Copy link
Contributor

chuanqi129 commented Aug 8, 2024

@fengyuan14 has landed a PR to fix this issue pytorch/vision#8541, waiting for pytorch update torchvision commit

@chuanqi129
Copy link
Contributor

@mengfei25 please help to check whether the latest pytorch used torchvision commit included Feng's fix

@chuanqi129 chuanqi129 modified the milestones: PT2.5, PT2.6 Oct 14, 2024
@fengyuan14 fengyuan14 assigned mengfei25 and unassigned fengyuan14 Oct 15, 2024
@mengfei25
Copy link
Contributor Author

Not include, it is May 7, 2024 commit: d23a6e1664d20707c11781299611436e1f0c104f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants