The HOI loss is NaN for rank 0 #35

OBVIOUSDAWN · 2022-02-27T12:32:36Z

Dir sir,
I followed with readme to build this UPT network,but when i use the instruction
python main.py --world-size 1 --dataset vcoco --data-root ./v-coco --partitions trainval test --pretrained ../detr-r50-vcoco.pth --output-dir ./upt-r50-vcoco.pt

i got an error

`Traceback (most recent call last):
File "main.py", line 208, in
mp.spawn(main, nprocs=args.world_size, args=(args,))
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/autodl-tmp/upload/main.py", line 125, in main
engine(args.epochs)
File "/root/pocket/pocket/pocket/core/distributed.py", line 139, in call
self._on_each_iteration()
File "/root/autodl-tmp/upload/utils.py", line 138, in _on_each_iteration
raise ValueError(f"The HOI loss is NaN for rank {self._rank}")
ValueError: The HOI loss is NaN for rank 0`

I tried to train without pretrain model it works the same error.I tried to print the loss but it shown an empty tensor.As a beginner , i have no idea what it happened.If you could give me any help,i would be appreciated.
I look forward to receiving your reply.Thank you for a lot.

The text was updated successfully, but these errors were encountered:

OBVIOUSDAWN · 2022-02-27T12:38:21Z

I use TITAN Xp with torch1.9.1 to train this model,and i installed the packing and testing it works,the dataset is vcoco and download with the script.Thank you for a lot.

fredzzhang · 2022-02-27T23:38:44Z

Hi @OBVIOUSDAWN,

Thanks for taking an interest in our work.

The Nan loss problem was quite a pain. I ran into the issue a long time ago and managed to resolve it by using larger batch sizes. The problem was that the spatial encodings have bad scales, which made the training very unstable. I see that you are using only one GPU to train. So the batch size is most likely insufficient.

Here are a few things you can try

For the log terms in the pairwise positional encodings, use log(1+x) instead of log(x+epsilon).
Add batch norm in the spatial head that computes the pairwise positional encodings.
Increase batch size (probably the easiest option).

Hope that resolves the issue.

Cheers,
Fred.

OBVIOUSDAWN · 2022-02-28T07:45:46Z

Dir sir，
I tried this model on new server with 3090*4 ,bacisize =4,which shows the same error on rank3.In your second suggestion,do you mean the "Pairwise Box Positional Encodings" in the paper?I find a "PositionEmbeddingSine" in /detr/model/position_encoding.py to change its eps show the same error ,and i also tired change "binary_focal_loss_with_logits" /"compute_spatial_encodings"'s eps in /ops.py.I print out the whole network but i dont know which part belongs to Pairwise Box Positional Encodings model.I look forward to receiving your reply.Thank you for a lot.

fredzzhang · 2022-02-28T09:59:49Z

...do you mean the "Pairwise Box Positional Encodings" in the paper

Yes, it is implemented in ops.py. If you are running on 4 GPUs with batch size as 4, you should have an effective batch size of 16. I think that's sufficiently large. Are you still getting the error?

Fred.

OBVIOUSDAWN · 2022-02-28T10:50:25Z

yes,the effective bachsize is 16,it shows the same error.And i tride to change
features.append( torch.cat([f, torch.log(f + eps)], 1) )
by using log(1+x) instead of log(x+epsilon) in "compute_spatial_encodings",i got the same error.I look forward to receiving your reply.Thank you for a lot.

fredzzhang · 2022-02-28T11:39:46Z

That's odd. If the batch size is 16, it should work now. Can you try some different seeds?

Fred.

leijue222 · 2022-10-05T07:53:57Z

Hi, @fredzzhang .
Thank you for your contribution. I am very interested in your work. Therefore, I want to deepen my understanding of your paper with running code. But I can't run the code.

I encountered the same error using the same command on 3090.
python main.py --world-size 1 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco2
I haven't changed any code, just download code and checkpoint model according to Readme.
Then I want to run the training command, but failed with this error.
Could you give me some help to solve it?

fredzzhang · 2022-10-05T07:58:26Z

Hi @leijue222,

That should be an issue related to the batch size. I trained the model on 8 GPUs with a batch size of 2 per GPU—effectively a batch size of 16. So since you are training with one GPU, you need to set the batch size to 16.

Let me know if that works.

Fred.

leijue222 · 2022-10-05T08:10:56Z

Wow, Thanks Fred! It worked!
It is indeed a problem of batch size.

At present, the video memory changes from 12G to 23G. It is unknown whether the single card 3090 with bs=16 will explode the video memory later.
By the way, how much time did you spend training vcoco.

fredzzhang · 2022-10-05T08:19:39Z

Towards end of the Model Zoo section, I added some stats for 8 TITAN X GPUs, which in the case of VCOCO, would be 40 minutes. I don't know how long it will take one 3090 to train it. It shouldn't be too long.

Fred.

leijue222 · 2022-10-05T08:22:33Z

Thanks again, I love this job.

yuchen2199 · 2023-02-02T08:38:27Z

I meet the same error using the command on 3090.

python main.py --world-size 1 --batch-size 16 --dataset vcoco --data-root vcoco/ --partitions trainval test --pretrained checkpoints/detr-r50-vcoco.pth --output-dir checkpoints/upt-r50-vcoco

Could you help me to solve the problem? Thanks.

fredzzhang · 2023-02-02T09:01:08Z

Hi @yuchen2199,

Sometimes the training could be unstable even with batch size of 16. If possible, further increasing the batch size should make it happen less often.

Fred.

yuchen2199 · 2023-02-03T07:26:03Z

Thanks for your early reply. I solved the problem after increasing batch-size. This is really an interesting work.

anjugopinath · 2024-02-19T21:43:18Z

Hi,

I am getting the HOI loss is NaN issue when training on a different dataset. The code used to work fine earlier. But, when I tried to train on images where there is only one human bbox and one object bbox, I started facing this issue.

I have tried:

Setting batch size to 16 and 32
features.append(
torch.cat([f, torch.log(f + 1)], 1)
)
adding Batch Norm
self.spatial_head = nn.Sequential(
nn.Linear(36, 128),
nn.BatchNorm1d(128), # Batch normalization after the first linear layer
nn.ReLU(),
nn.Linear(128, 256),
nn.BatchNorm1d(256), # Batch normalization after the second linear layer
nn.ReLU(),
nn.Linear(256, representation_size),
nn.BatchNorm1d(representation_size), # Batch normalization after the third linear layer
nn.ReLU(),
)

But, I am still getting this issue.

Do you have any suggestions on how I can solve this issue?

anjugopinath · 2024-02-21T07:00:17Z

To show an example,

I modified the output of compute_spatial_encodings() function like this (adding 5000 so that it's easy to visualize. ):

So, the input to spatial head is

output is

spatial head is

What is a meaningful fix for this?
Scaling, replacing problematic values in the input etc?

fredzzhang added the Inactive label Mar 25, 2022

fredzzhang closed this as completed Mar 25, 2022

fredzzhang removed the Inactive label Feb 3, 2023

fredzzhang mentioned this issue Aug 14, 2023

The loss is NaN during training fredzzhang/pvic#29

Closed

JonasFerreiraSilva mentioned this issue Apr 17, 2024

NaN when I use a DETR that I fine tuned fredzzhang/pvic#48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The HOI loss is NaN for rank 0 #35

The HOI loss is NaN for rank 0 #35

OBVIOUSDAWN commented Feb 27, 2022

OBVIOUSDAWN commented Feb 27, 2022

fredzzhang commented Feb 27, 2022

OBVIOUSDAWN commented Feb 28, 2022

fredzzhang commented Feb 28, 2022

OBVIOUSDAWN commented Feb 28, 2022

fredzzhang commented Feb 28, 2022

leijue222 commented Oct 5, 2022 •

edited

Loading

fredzzhang commented Oct 5, 2022 •

edited

Loading

leijue222 commented Oct 5, 2022

fredzzhang commented Oct 5, 2022

leijue222 commented Oct 5, 2022

yuchen2199 commented Feb 2, 2023

fredzzhang commented Feb 2, 2023

yuchen2199 commented Feb 3, 2023

anjugopinath commented Feb 19, 2024

anjugopinath commented Feb 21, 2024 •

edited

Loading

The HOI loss is NaN for rank 0 #35

The HOI loss is NaN for rank 0 #35

Comments

OBVIOUSDAWN commented Feb 27, 2022

OBVIOUSDAWN commented Feb 27, 2022

fredzzhang commented Feb 27, 2022

OBVIOUSDAWN commented Feb 28, 2022

fredzzhang commented Feb 28, 2022

OBVIOUSDAWN commented Feb 28, 2022

fredzzhang commented Feb 28, 2022

leijue222 commented Oct 5, 2022 • edited Loading

fredzzhang commented Oct 5, 2022 • edited Loading

leijue222 commented Oct 5, 2022

fredzzhang commented Oct 5, 2022

leijue222 commented Oct 5, 2022

yuchen2199 commented Feb 2, 2023

fredzzhang commented Feb 2, 2023

yuchen2199 commented Feb 3, 2023

anjugopinath commented Feb 19, 2024

anjugopinath commented Feb 21, 2024 • edited Loading

leijue222 commented Oct 5, 2022 •

edited

Loading

fredzzhang commented Oct 5, 2022 •

edited

Loading

anjugopinath commented Feb 21, 2024 •

edited

Loading