Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss=nan on 1660 SUPER 6GB #293

Closed
martianbit opened this issue Mar 14, 2023 · 10 comments
Closed

loss=nan on 1660 SUPER 6GB #293

martianbit opened this issue Mar 14, 2023 · 10 comments

Comments

@martianbit
Copy link

martianbit commented Mar 14, 2023

Hey,
I have a NVIDIA GeForce 1660 SUPER 6GB card, and I wanted to train LoRA models with it.
This is my configuration:

accelerate launch --num_cpu_threads_per_process 4 train_network.py --network_module="networks.lora" --pretrained_model_name_or_path=/mnt/models/animefull-final-pruned.ckpt --vae=/mnt/models/animevae.pt --train_data_dir=/mnt/datasets/character --output_dir=/mnt/out --output_name=character --caption_extension=.txt --shuffle_caption --prior_loss_weight=1 --network_alpha=128 --resolution=512 --enable_bucket --min_bucket_reso=320 --max_bucket_reso=768 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=0.0001 --text_encoder_lr=0.00005 --max_train_epochs=20 --mixed_precision=fp16 --save_precision=fp16 --use_8bit_adam --xformers --save_every_n_epochs=1 --save_model_as=safetensors --clip_skip=2 --flip_aug --color_aug --face_crop_aug_range="2.0,4.0" --network_dim=128 --max_token_length=225 --lr_scheduler=constant

The train directory's name is 3_Concept1, so 3 repetitions are used.
The script does not throw any errors, but loss=nan and corrupted unets are produced.
I've tried setting mixed_precision to no, but then I've run out of VRAM.
I've also tried disabling xformers, but again, I've run out of VRAM.
I've compiled xformers myself, using pip install ninja && MAX_JOBS=4 pip install -v .
Also tried several other xformers versions, like 0.0.16 and the one suggested in the README.
Tried both CUDA 11.6 and 11.7.

Python version: 3.10.6
PyTorch version: torch==1.12.1+cu116 torchvision==0.13.1+cu116

Any help is much appreciated!
Thank you!

@TingTingin
Copy link
Contributor

have you tried mixed precision fp32?

@martianbit
Copy link
Author

Thanks for your response.
No, I haven't tried it yet, but it seems to say that it's not a valid option.
train_network.py: error: argument --mixed_precision: invalid choice: 'fp32' (choose from 'no', 'fp16', 'bf16')

@KhSTM
Copy link

KhSTM commented Mar 22, 2023

Just choice "no" instead of fp16 for mixed_precision

@martianbit martianbit closed this as not planned Won't fix, can't repro, duplicate, stale Apr 8, 2023
@jamszh
Copy link

jamszh commented Apr 13, 2023

Hey I just stumbled on this thread with the same problem. I have a regular GTX 1660 6GB.

I also run into VRAM issues if I do not go with fp16 precision or disabling xformers like you've described.

I see this thread was closed 5 days ago. Was there a resolution? I suppose "not planned" suggests that there wasn't. I just wanted to confirm.

In the meantime I've managed to just train some LoRAs on colab.

@luyijun
Copy link

luyijun commented Apr 14, 2023

I got the same NAN problem on my GTX 1660 6GB.
I traced this problem, and found the source of NAN is the process of loading image latents.
It's in library/train_util.py's BaseDataset.cache_latents.
The returns of latents = vae.encode(img_tensors).latent_dist.sample().to("cpu") contains NAN.
It has the same problem when I disable the lantens cache and load lantens directorly.

I traced it and found it is caused by the ResnetBlock2D in venv/Lib/site-packages/diffusers/models.resnet.py.
The returns of hidden_states = self.conv1(hidden_states) contains NAN.
(self.conv1 = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1))

I ran it on A5000 and the returns are correct. I think there are some errors in CUDA or GTX 1660.

@PaperOrb
Copy link

PaperOrb commented May 5, 2023

@kohya-ss
Just adding my voice to the mix to say I have this issue as well. I spoke to @bmaltais in bmaltais/kohya_ss#722 about this. The link will lead you back eventually to the a1111 repo #4407 where a potential fix was mentioned involving setting torch.backends.cudnn.benchmark = True inside a devices.py file. No clue if this works with kohya_ss since I don't know where to find this file or what the equivalent fix for kohya_ss would be. If anyone figures out how to use this for a local fix at least, let me know!

@kohya-ss
Copy link
Owner

kohya-ss commented May 7, 2023

Sorry for late reply. I think you can add immediately after train method, in train_network.py` like this:

https://github.com/kohya-ss/sd-scripts/blob/e6ad3cbc66130fdc3bf9ecd1e0272969b1d613f7/train_network.py#LL64C6-L64C6

def train(args):
    torch.backends.cudnn.benchmark = True
    session_id = random.randint(0, 2**32)

If this fix works, I will add an option to enable this. Please let me know the result!

@kohya-ss kohya-ss reopened this May 7, 2023
@martianbit
Copy link
Author

Yes, this works perfectly, thank you very much for your help!
Have a great day!

@kohya-ss
Copy link
Owner

kohya-ss commented May 8, 2023

That's good! I will add an option to enable it.

@yoinked-h
Copy link

original author of the webui PR, it causes some noticable slowdown on non-turing cards; also holy cow you can lora on 6gb vram or is it a modded card?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants