Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point exception (core dumped) #725

Open
Kitsunetic opened this issue Oct 30, 2024 · 12 comments
Open

Floating point exception (core dumped) #725

Kitsunetic opened this issue Oct 30, 2024 · 12 comments

Comments

@Kitsunetic
Copy link

Kitsunetic commented Oct 30, 2024

I always get floating point exception while I'm using SubMConv3d.

Here is my test code:

import torch as th
from spconv.pytorch import SubMConv3d, SparseConvTensor

xyz = th.randint(0, 32, (1000, 4), dtype=th.int64, device='cuda')
xyz[:, 0] = 0
feat = th.randn(1000, 32, device='cuda', dtype=th.float32)
sp = SparseConvTensor(feat, xyz, (32, 32, 32), 1, 1, 1)

conv = SubMConv3d(32, 64, 3).cuda()
conv(sp)

>>> Floating point exception (core dumped)

I'm using PyTorch 2.3.0 with CUDA 11.8, and spconv-cu18==2.3.6.
Is there something wrong in my code, or someone knows the clue?

I have tested with A5000 and RTX 2080Ti GPUs but the result was always same.

@shim94kr
Copy link

shim94kr commented Nov 6, 2024

I'm experiencing the exact same issue.

I've found that it works fine with kernel_size=1, but consistently crashes with kernel_size=3 or any other size.

@Kitsunetic Have you fixed this issue?

@Kitsunetic
Copy link
Author

I'm experiencing the exact same issue.

I've found that it works fine with kernel_size=1, but consistently crashes with kernel_size=3 or any other size.

@Kitsunetic Have you fixed this issue?

No, I'm still figuring out the solution.

@shim94kr
Copy link

shim94kr commented Nov 7, 2024

I found that downgrading PyTorch to version 2.2.2 resolves the issue.

@Kitsunetic
Copy link
Author

which cuda version did you use?

@shim94kr
Copy link

shim94kr commented Nov 7, 2024

I use CUDA 12.1, and I installed spconv-cu120.

@Kitsunetic
Copy link
Author

Unfortunately, I'm still getting same issue with my retrial on nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04 docker image with Pytorch 2.2.2 with CUDA 12.1. I have tested with both ubuntu 22.04 and 20.04. Couly you give me more detail about your environment?

@shim94kr
Copy link

shim94kr commented Nov 11, 2024

I set up the environment using the following .yaml file with conda env create -f ***.yaml. This is a different .yaml file than the one referenced in Issue #317, particularly with torch and torchvision configurations.

name: pointcept
channels:
  - pyg
  - pytorch
  - nvidia/label/cuda-12.1.1
  - nvidia
  - bioconda
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pip
  - cuda
  - conda-forge::cudnn
  - gcc=12.1
  - gxx=12.1
  - pytorch=2.2.2
  - torchvision=0.17.2
  - pytorch-cuda=12.1
  - ninja
  - google-sparsehash
  - h5py
  - pyyaml
  - tensorboard
  - tensorboardx
  - yapf
  - addict
  - einops
  - scipy
  - plyfile
  - termcolor
  - timm
  - ftfy
  - regex
  - tqdm
  - matplotlib
  - black
  - open3d
  - pytorch-cluster
  - pytorch-scatter
  - pytorch-sparse
  - pip:
    - torch_geometric
#    - spconv-cu120
    - git+https://github.com/octree-nn/ocnn-pytorch.git
    - git+https://github.com/openai/CLIP.git
    - git+https://github.com/Dao-AILab/flash-attention.git
    - ./libs/pointops
    - ./libs/pointgroup_ops

After this setup, I installed the following additional components:

cd libs/pointops
python setup.py install 
cd ../..

pip install spconv-cu120

@Kitsunetic
Copy link
Author

Thank you for sharing.
However... I'm still getting same error even with environment based on provided yaml file.
I expect this is not only the problem of dependencies, but also entire environment like OS can be related. So, I'm still figuring out the reason.
Anyway, thank you again for your sharing! If you found another clue, please share with me!

@JunseoMin
Copy link

JunseoMin commented Nov 21, 2024

Hi,

I have the same issue...
Please share it with me if you solve this issue.

in my case, the exception occurs when the kernel size = 3

Thanks!

@Ecalpal
Copy link

Ecalpal commented Nov 24, 2024

same as you
python 3.11, pytorch 2.5.0, cuda 12.1

I found that downgrading PyTorch to version 2.2.2 resolves the issue.

and this works

@GCChen97
Copy link

It seems to do with numpy version >= 2.0.0. I installed numpy 1.26.4 and spconv.pytorch.ops.implicit_gemm can be called without raising Floating point exception (core dumped). Something is wrong with the arg masks of implicit_gemm which is also a numpy array.

@JunseoMin
Copy link

It seems to do with numpy version >= 2.0.0. I installed numpy 1.26.4 and spconv.pytorch.ops.implicit_gemm can be called without raising Floating point exception (core dumped). Something is wrong with the arg masks of implicit_gemm which is also a numpy array.

This worked for me! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants