Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model output is NAN for toy dataset #9

Closed
jonbakerfish opened this issue Dec 1, 2022 · 8 comments
Closed

model output is NAN for toy dataset #9

jonbakerfish opened this issue Dec 1, 2022 · 8 comments
Labels
bug Something isn't working

Comments

@jonbakerfish
Copy link

Hi, I run the infer.py script with the toy dataset. For some frames, the self.model's outputs (proj_output, last_feature) are NaN arrays. Why?

@MaxChanger
Copy link
Member

Hi, Thank you for your interest in our work.
A similar issue is #6. However, I'm not entirely sure where the problem is, maybe it's related to SoftPool. Can you provide more environment information like gpu, cuda version, conda env yaml file etc ?

@jonbakerfish
Copy link
Author

jonbakerfish commented Dec 2, 2022

It seems that the nan is caused by softpool, if I disable it things run fine.

tools version
GPU RTX 3070 Laptop
CUDA 11.6
torch 1.12.1+cu116
torchaudio 0.12.1+cu116
torchsparse 1.4.0
torchvision 0.13.1+cu116

@MaxChanger
Copy link
Member

Hi @jonbakerfish, if you are sure that it is a softpool problem, you can refer to alexandrosstergiou/SoftPool#12 and alexandrosstergiou/adaPool#2.

I tried to reproduce this nan issue on 3090, 2080Ti, v100, 2070 Super GPU servers, but failed. 😅
If you have time, please, I hope you can debug it in depth to help others.

@MaxChanger MaxChanger added the bug Something isn't working label Dec 9, 2022
@MaxChanger
Copy link
Member

Hi @jonbakerfish , I finally found the cause of this problem!
I ignored the version of softpool, its author updated the repo on 2022/04/07, and our project was implemented before that, using its historical version, the commits id is 2d2ec6d.

When I use the new version d056ab8 of softpool code for inference, nan will also appear, just like what you encounter in the process of inference or train.

The specific reason may need to carefully check the softpool code and discuss with the author, but rolling back the softpool version is a quick solution to this problem.

git clone https://github.com/alexandrosstergiou/SoftPool.git
cd SoftPool
git checkout 2d2ec6d # rollback to 2d2ec6dca10b7683ffd41061a27910d67816bfa5
cd pytorch
make install
--- (optional) ---
make test

I hope you can help to check it, if you have any questions, please contact me again.

@L-Reichardt
Copy link

Just for reference:
Model worked fine on the toy dataset, but got NAN on data from my own sensor. Reason was SoftPool, even with older Version 2d2ec6d*. Disabling fixed the issue.

@MaxChanger
Copy link
Member

Just for reference: Model worked fine on the toy dataset, but got NAN on data from my own sensor. Reason was SoftPool, even with older Version 2d2ec6d*. Disabling fixed the issue.

Hi @L-Reichardt, thank you for your feedback. The temporary plan of rolling back to 2d2ec6d has been verified by several developers, and there should be no problem with the data set used in this project.

Disabling Softpool may cause a slight decrease in performance, as demonstrated by the ablation experiments in the paper

I hope you can confirm two questions. First, whether the installation was successfully replaced with the new version after rolling back the version. Second, your own data is clean and does not contain nan.

@L-Reichardt
Copy link

@MaxChanger my bad, you are correct. Recently I updated to a newer SoftPool version in order to use PyTorch 1.13.1, but forgot about it.
I used your "const inf" suggestion and it works fine now.

@MaxChanger
Copy link
Member

@L-Reichardt Great to hear that 🙃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants