Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] Moving models from CUDA to XLA raise segmentation fault. #6010

Closed
ysiraichi opened this issue Dec 4, 2023 · 1 comment · Fixed by #6060
Closed

[torchbench] Moving models from CUDA to XLA raise segmentation fault. #6010

ysiraichi opened this issue Dec 4, 2023 · 1 comment · Fixed by #6060
Labels

Comments

@ysiraichi
Copy link
Collaborator

ysiraichi commented Dec 4, 2023

🐛 Bug

Trying to move a nn.Module from CUDA to XLA device causes a segmentation fault. This is probably related to #3466. Models that hit #6011 will also hit this issue.

device = xm.xla_device()
model = nn.Linear(5, 5, device="cuda")
model = model.to(device)
Fatal Python error: Segmentation fault

Thread 0x00007ff283043700 (most recent call first):
  File "/usr/local/lib/python3.8/selectors.py", line 415 in select
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 931 in wait
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 362 in _queue_management_worker
  File "/usr/local/lib/python3.8/threading.py", line 870 in run
  File "/usr/local/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/local/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007ff4fb086740 (most recent call first):
  File "torch/nn/modules/module.py", line 1150 in convert
  File "torch/nn/modules/module.py", line 825 in _apply
  File "torch/nn/modules/module.py", line 1152 in to
  File "example.py", line 9 in <module>
Segmentation fault (core dumped)

Affected benchmarks

  • DALLE2_pytorch
  • moco
  • simple_gpt
  • simple_gpt_tp_manual
  • tacotron2
  • timm_efficientdet
  • (train) yolov3

Environment

cc @JackCaoG @miladm

@ysiraichi
Copy link
Collaborator Author

In an internal discussion, I mentioned I thought we could just modify the benchmark code for fixing the error for DALLE2_pytorch. However, it's implemented with an external library. Given that, I think we should try to fix the actual error: the segmentation fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant