-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The HOI loss is NaN for rank 0 #35
Comments
I use TITAN Xp with torch1.9.1 to train this model,and i installed the packing and testing it works,the dataset is vcoco and download with the script.Thank you for a lot. |
Hi @OBVIOUSDAWN, Thanks for taking an interest in our work. The Nan loss problem was quite a pain. I ran into the issue a long time ago and managed to resolve it by using larger batch sizes. The problem was that the spatial encodings have bad scales, which made the training very unstable. I see that you are using only one GPU to train. So the batch size is most likely insufficient. Here are a few things you can try
Hope that resolves the issue. Cheers, |
Dir sir, |
Yes, it is implemented in Fred. |
yes,the effective bachsize is 16,it shows the same error.And i tride to change |
That's odd. If the batch size is 16, it should work now. Can you try some different seeds? Fred. |
Hi, @fredzzhang . I encountered the same error using the same command on 3090. |
Hi @leijue222, That should be an issue related to the batch size. I trained the model on 8 GPUs with a batch size of 2 per GPU—effectively a batch size of 16. So since you are training with one GPU, you need to set the batch size to 16. Let me know if that works. Fred. |
Towards end of the Model Zoo section, I added some stats for 8 TITAN X GPUs, which in the case of VCOCO, would be 40 minutes. I don't know how long it will take one 3090 to train it. It shouldn't be too long. Fred. |
Thanks again, I love this job. |
I meet the same error using the command on 3090.
Could you help me to solve the problem? Thanks. |
Hi @yuchen2199, Sometimes the training could be unstable even with batch size of 16. If possible, further increasing the batch size should make it happen less often. Fred. |
Thanks for your early reply. I solved the problem after increasing batch-size. This is really an interesting work. |
Hi, I am getting the HOI loss is NaN issue when training on a different dataset. The code used to work fine earlier. But, when I tried to train on images where there is only one human bbox and one object bbox, I started facing this issue. I have tried:
But, I am still getting this issue. Do you have any suggestions on how I can solve this issue? |
Dir sir,
I followed with readme to build this UPT network,but when i use the instruction
python main.py --world-size 1 --dataset vcoco --data-root ./v-coco --partitions trainval test --pretrained ../detr-r50-vcoco.pth --output-dir ./upt-r50-vcoco.pt
i got an error
`Traceback (most recent call last):
File "main.py", line 208, in
mp.spawn(main, nprocs=args.world_size, args=(args,))
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/upload/main.py", line 125, in main
engine(args.epochs)
File "/root/pocket/pocket/pocket/core/distributed.py", line 139, in call
self._on_each_iteration()
File "/root/autodl-tmp/upload/utils.py", line 138, in _on_each_iteration
raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 0`
I tried to train without pretrain model it works the same error.I tried to print the loss but it shown an empty tensor.As a beginner , i have no idea what it happened.If you could give me any help,i would be appreciated.
I look forward to receiving your reply.Thank you for a lot.
The text was updated successfully, but these errors were encountered: