We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
文章很棒,希望能中好的期刊/会议! 有一些问题方便解答一下吗?非常感谢!
The text was updated successfully, but these errors were encountered:
感谢关注。
我们之前也遇到过这个问题,原因大概是diffusers、transformers、torch、cuda版本不匹配造成的,甚至有可能是有些显卡对float16的不兼容(代码对SD推理部分用的float16),也可以尝试搜下diffusers stable diffusion pipeline nan看下相关问题,似乎挺多。我们的实验在a6000、3090上跑过是没问题的。另外附下我们的环境: cuda 11.8 diffusers==0.15.0 torch==2.0.1+cu118 transformers==4.26.1
是的,跑完整的多步diffusion时间会很久(尤其是用上prompt的condition训练后)。可以尝试下减少步数,在我们的实验里五步就可以获得比较接近的效果。具体是将 mmseg/models/backbones/diff/configs/diff_config.yaml 内将原来对应的值改为 scheduler_timesteps: [80, 60, 40, 20, 1] save_timestep: [4, 3, 2, 1, 0] num_timesteps: 5
mmseg/models/backbones/diff/configs/diff_config.yaml
另外也可以尝试下不用带prompt的condition训练,即设置 do_mask_steps: False,时间也会快很多。
Sorry, something went wrong.
感谢关注。 我们之前也遇到过这个问题,原因大概是diffusers、transformers、torch、cuda版本不匹配造成的,甚至有可能是有些显卡对float16的不兼容(代码对SD推理部分用的float16),也可以尝试搜下diffusers stable diffusion pipeline nan看下相关问题,似乎挺多。我们的实验在a6000、3090上跑过是没问题的。另外附下我们的环境: cuda 11.8 diffusers==0.15.0 torch==2.0.1+cu118 transformers==4.26.1 是的,跑完整的多步diffusion时间会很久(尤其是用上prompt的condition训练后)。可以尝试下减少步数,在我们的实验里五步就可以获得比较接近的效果。具体是将 mmseg/models/backbones/diff/configs/diff_config.yaml 内将原来对应的值改为 scheduler_timesteps: [80, 60, 40, 20, 1] save_timestep: [4, 3, 2, 1, 0] num_timesteps: 5 另外也可以尝试下不用带prompt的condition训练,即设置 do_mask_steps: False,时间也会快很多。
非常感谢您的回答!我按您说的去试一下,再次感谢!
No branches or pull requests
文章很棒,希望能中好的期刊/会议!
有一些问题方便解答一下吗?非常感谢!
!!!! NAN value tensor(8192, device='cuda:0') torch.Size([1, 128, 8, 8])
!!!! NAN value tensor(32768, device='cuda:0') torch.Size([1, 128, 16, 16])
!!!! NAN value tensor(81920, device='cuda:0') torch.Size([1, 80, 32, 32])
!!!! NAN value tensor(131072, device='cuda:0') torch.Size([1, 32, 64, 64])
导致打印的loss都是nan:
2024-12-04 17:38:23,887 - mmseg - INFO - Iter [50/40000] lr: 1.958e-06, eta: 4 days, 16:29:22, time: 10.137, data_time: 0.011, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 41.7826, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 41.7826, decode_wo_seg.logits: nan
2024-12-04 17:46:29,023 - mmseg - INFO - Iter [100/40000] lr: 3.950e-06, eta: 4 days, 13:56:37, time: 9.703, data_time: 0.007, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 38.8875, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 38.8875, decode_wo_seg.logits: nan
2024-12-04 17:54:35,560 - mmseg - INFO - Iter [150/40000] lr: 5.938e-06, eta: 4 days, 13:06:30, time: 9.731, data_time: 0.008, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 42.2343, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 42.2343, decode_wo_seg.logits: nan
2024-12-04 18:02:46,074 - mmseg - INFO - Iter [200/40000] lr: 7.920e-06, eta: 4 days, 12:50:34, time: 9.810, data_time: 0.008, memory: 16063, consistency_loss: nan, decode_w_seg.loss_seg: nan, decode_w_seg.acc_seg: 40.0415, decode_w_seg.logits: nan, decode_wo_seg.loss_seg: nan, decode_wo_seg.acc_seg: 40.0415, decode_wo_seg.logits: nan
请问您知道是什么问题吗?
如果能解答我的疑惑,将会非常感激!
The text was updated successfully, but these errors were encountered: