Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

报错 !!!! NAN value tensor #1

Open
YYDSDD opened this issue Dec 4, 2024 · 2 comments
Open

报错 !!!! NAN value tensor #1

YYDSDD opened this issue Dec 4, 2024 · 2 comments

Comments

@YYDSDD
Copy link

YYDSDD commented Dec 4, 2024

文章很棒,希望能中好的期刊/会议!
有一些问题方便解答一下吗?非常感谢!

  1. 运行程序一直报错:
    !!!! NAN value tensor(8192, device='cuda:0') torch.Size([1, 128, 8, 8])
    !!!! NAN value tensor(32768, device='cuda:0') torch.Size([1, 128, 16, 16])
    !!!! NAN value tensor(81920, device='cuda:0') torch.Size([1, 80, 32, 32])
    !!!! NAN value tensor(131072, device='cuda:0') torch.Size([1, 32, 64, 64])
    导致打印的loss都是nan:
    2024-12-04 17:38:23,887 - mmseg - INFO - Iter [50/40000] lr: 1.958e-06, eta: 4 days, 16:29:22, time: 10.137, data_time: 0.011, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 41.7826, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 41.7826, decode_wo_seg.logits: nan
    2024-12-04 17:46:29,023 - mmseg - INFO - Iter [100/40000] lr: 3.950e-06, eta: 4 days, 13:56:37, time: 9.703, data_time: 0.007, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 38.8875, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 38.8875, decode_wo_seg.logits: nan
    2024-12-04 17:54:35,560 - mmseg - INFO - Iter [150/40000] lr: 5.938e-06, eta: 4 days, 13:06:30, time: 9.731, data_time: 0.008, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 42.2343, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 42.2343, decode_wo_seg.logits: nan
    2024-12-04 18:02:46,074 - mmseg - INFO - Iter [200/40000] lr: 7.920e-06, eta: 4 days, 12:50:34, time: 9.810, data_time: 0.008, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 40.0415, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 40.0415, decode_wo_seg.logits: nan
    请问您知道是什么问题吗?
  2. 还想请问一下你们运行完40000次迭代,总共用了多长时间,我看我运行完要四天,这个时间也太长了,是不是我的程序出现问题了?
    如果能解答我的疑惑,将会非常感激!
@Yux1angJi
Copy link
Owner

感谢关注。

  1. 我们之前也遇到过这个问题,原因大概是diffusers、transformers、torch、cuda版本不匹配造成的,甚至有可能是有些显卡对float16的不兼容(代码对SD推理部分用的float16),也可以尝试搜下diffusers stable diffusion pipeline nan看下相关问题,似乎挺多。我们的实验在a6000、3090上跑过是没问题的。另外附下我们的环境:
    cuda 11.8
    diffusers==0.15.0
    torch==2.0.1+cu118
    transformers==4.26.1

  2. 是的,跑完整的多步diffusion时间会很久(尤其是用上prompt的condition训练后)。可以尝试下减少步数,在我们的实验里五步就可以获得比较接近的效果。具体是将 mmseg/models/backbones/diff/configs/diff_config.yaml 内将原来对应的值改为
    scheduler_timesteps: [80, 60, 40, 20, 1]
    save_timestep: [4, 3, 2, 1, 0]
    num_timesteps: 5

另外也可以尝试下不用带prompt的condition训练,即设置 do_mask_steps: False,时间也会快很多。

@YYDSDD
Copy link
Author

YYDSDD commented Dec 5, 2024

感谢关注。

  1. 我们之前也遇到过这个问题,原因大概是diffusers、transformers、torch、cuda版本不匹配造成的,甚至有可能是有些显卡对float16的不兼容(代码对SD推理部分用的float16),也可以尝试搜下diffusers stable diffusion pipeline nan看下相关问题,似乎挺多。我们的实验在a6000、3090上跑过是没问题的。另外附下我们的环境:
    cuda 11.8
    diffusers==0.15.0
    torch==2.0.1+cu118
    transformers==4.26.1
  2. 是的,跑完整的多步diffusion时间会很久(尤其是用上prompt的condition训练后)。可以尝试下减少步数,在我们的实验里五步就可以获得比较接近的效果。具体是将 mmseg/models/backbones/diff/configs/diff_config.yaml 内将原来对应的值改为
    scheduler_timesteps: [80, 60, 40, 20, 1]
    save_timestep: [4, 3, 2, 1, 0]
    num_timesteps: 5

另外也可以尝试下不用带prompt的condition训练,即设置 do_mask_steps: False,时间也会快很多。

非常感谢您的回答!我按您说的去试一下,再次感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants