Undefined behavior of torch extensions with Pytorch >1.7 and CUDA 11 [with workaround information] #3324

benjaminum · 2021-04-22T13:24:37Z

Describe the bug
torch extensions such as the ops in the open3d.ml.torch.ops namespace have undefined behavior when using torch 1.7 or later with CUDA 11. This may result in segmentation faults or wrong results.

To Reproduce

The attached zip contains a minimal CMake project for reproducing the problem.
Cuda11Debug.zip

Expected behavior
The code is an example taken from the docs of the cub library. Due to the bug cub functions that cache return values may return unexpected values which cause the temporary memory allocation in the example to fail.
If the problem is present the test script will print temp_storage_bytes should not be 0!.

Environment (please complete the following information):

Pytorch 1.7.1
CUDA 11.0

Additional context
The problem is related to pytorch/pytorch#52663

Workaround
The problem can be avoided by compiling torch from source with the flags -Xcompiler -fno-gnu-unique as mentioned in pytorch/pytorch#52663

Wheels with this compile flag are here https://github.com/intel-isl/open3d_downloads/releases/tag/torch1.7.1

The text was updated successfully, but these errors were encountered:

ssheorey · 2022-07-28T03:57:15Z

With Python 1.12, CUDA 11.6, output for test_script.py:

$ python -c "import torch; print(torch.__version__)"
1.12.0+cu116
$ python test_script.py 
tensor([[1., 1., 1.]], device='cuda:0')
stream = stream 0 on device cuda:0
cuda_device_props = 0x55b3e4bba6e0
texture_alignment = 512
d_keys_in = 0x7f6b87c00000
d_keys_out = 0x7f6b87c78a00
d_values_in = 0x7f6b87cf1400
d_values_out = 0x7f6b87d69e00
d_temp_storage = 0
temp_storage_bytes = 1008639
(None, None, None)

ssheorey · 2022-07-28T04:10:07Z

Python 1.9, CUDA 11.1 works as well:

$ python -c "import torch; print(torch.__version__)"
1.9.0+cu111
$ python test_script.py 
tensor([[1., 1., 1.]], device='cuda:0')
stream = stream 0 on device cuda:0
cuda_device_props = 0x5565813033e0
texture_alignment = 512
d_keys_in = 0x7f160e200000
d_keys_out = 0x7f160e278a00
d_values_in = 0x7f160e2f1400
d_values_out = 0x7f160e369e00
d_temp_storage = 0
temp_storage_bytes = 1003519
(None, None, None)

benjaminum mentioned this issue Apr 22, 2021

print warning if compiling pytorch ops with torch 1.7 and cuda 11 #3325

Closed

yucedagonurcan mentioned this issue May 25, 2021

data.to(self.device) AttributeError: 'dict' object has no attribute 'to' isl-org/Open3D-ML#283

Closed

evelkey mentioned this issue Jun 1, 2021

parallel_for's throw_on_error results in terminate NVIDIA/thrust#1448

Closed

sanskar107 mentioned this issue Sep 28, 2021

RuntimeError: radix_sort: failed on 1st step: cudaErrorInvalidDevice: invalid device ordinal isl-org/Open3D-ML#299

Closed

theNded added build/install Build or installation issue ml labels Nov 21, 2021

conby mentioned this issue Apr 27, 2022

Summarize the bug "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" isl-org/Open3D-ML#510

Open

3 tasks

ssheorey mentioned this issue Jul 26, 2022

Python 3.10 #5320

Merged

yxlao closed this as completed in #5320 Jul 29, 2022

eddyhkchiu mentioned this issue Sep 17, 2022

Errors when running ./scripts/quick_run.sh UT-Austin-RPL/Coopernaut#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undefined behavior of torch extensions with Pytorch >1.7 and CUDA 11 [with workaround information] #3324

Undefined behavior of torch extensions with Pytorch >1.7 and CUDA 11 [with workaround information] #3324

benjaminum commented Apr 22, 2021 •

edited

Loading

ssheorey commented Jul 28, 2022

ssheorey commented Jul 28, 2022

Undefined behavior of torch extensions with Pytorch >1.7 and CUDA 11 [with workaround information] #3324

Undefined behavior of torch extensions with Pytorch >1.7 and CUDA 11 [with workaround information] #3324

Comments

benjaminum commented Apr 22, 2021 • edited Loading

ssheorey commented Jul 28, 2022

ssheorey commented Jul 28, 2022

benjaminum commented Apr 22, 2021 •

edited

Loading