Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What the hell is wrong with the repo, getting all weird images, don't care the steps.😣 #230

Open
ZeroCool22 opened this issue Apr 19, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@ZeroCool22
Copy link

ZeroCool22 commented Apr 19, 2023

Describe the bug

I installed everything following this guide: https://pastebin.com/uE1WcSxD

The only steps i did different are:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
sudo apt-get install cuda=11.8.0-1

Everything seems to works fine but (while training), when i create images with AUTOS's GUI (i tried INVOKEAI too, same issue) all the images looks weird.

In the next image, you can see the training data used and the result images when using AUTO's:

ererererer

💡And this is not a matter of steps numbers, i tried with different steps counts (1000 - 4500) and always get same weirds images.

I used this script to do the conversion to ckpt: https://pastebin.com/ct6mTzAA

But as i said before, it's not a conversion script problem, because i used the model in Diffusers format in INVOKEAI and i get the same weird results.

If someone could tell me what is wrong with the repo would be great.

Reproduction

Training process console:

(diffusers) zerocool@DESKTOP-MMG43AJ:~/github/diffusers/examples/dreambooth$ ./my_training.sh
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/accelerate/accelerator.py:249: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/transformers/modeling_utils.py:429: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /home/zerocool/anaconda3/envs/diffusers did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: /usr/lib/wsl/lib: did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('runwayml/stable-diffusion-v1-5')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so'), PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 118
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
  warn(msg)
CUDA SETUP: Loading binary /home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118_nocublaslt.so...
/home/zerocool/anaconda3/envs/diffusers/lib/python3.9/site-packages/diffusers/configuration_utils.py:203: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Caching latents: 100%|██████████████████████████████████████████████████████████████████| 35/35 [00:05<00:00,  6.10it/s]
04/19/2023 19:46:10 - INFO - __main__ - ***** Running training *****
04/19/2023 19:46:10 - INFO - __main__ -   Num examples = 35
04/19/2023 19:46:10 - INFO - __main__ -   Num batches each epoch = 35
04/19/2023 19:46:10 - INFO - __main__ -   Num Epochs = 58
04/19/2023 19:46:10 - INFO - __main__ -   Instantaneous batch size per device = 1
04/19/2023 19:46:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
04/19/2023 19:46:10 - INFO - __main__ -   Gradient Accumulation steps = 1
04/19/2023 19:46:10 - INFO - __main__ -   Total optimization steps = 2000
Steps:   1%|▌                                                    | 19/2000 [00:33<45:12,  1.37s/it, loss=0.167, lr=1e-6]

Logs

No response

System Info

  • diffusers version: 0.15.0.dev0
  • Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.9.16
  • PyTorch version (GPU?): 2.0.0+cu118 (True)
  • Huggingface_hub version: 0.13.4
  • Transformers version: 4.28.1
  • Accelerate version: 0.18.0
  • xFormers version: 0.0.18
  • Using GPU in script?: 1080 TI
  • Using distributed or parallel set-up in script?:
@ZeroCool22 ZeroCool22 added the bug Something isn't working label Apr 19, 2023
@ViddleShtix
Copy link

Same here, but only when using 8 bit adam. Without it it works perfectly fine. It seems like 8 bit adam causes the model to overfit very quickly. I've been trying to find a workaround for days but haven't gotten anywhere so now I'm justing playing around with steps and learning rate until something works.

@2kpr
Copy link

2kpr commented Aug 1, 2023

Same here, but only when using 8 bit adam. Without it it works perfectly fine. It seems like 8 bit adam causes the model to overfit very quickly. I've been trying to find a workaround for days but haven't gotten anywhere so now I'm justing playing around with steps and learning rate until something works.

Check your bitsandbytes, there is a good chance you have a version above 0.35.0 and if so downgrade your bitsandbytes to 0.35.0 and train again. Basically there have been known issues with any bitsandbytes above 0.35.0 since the end of 2022 when using AdamW8bit, etc.

kohya-ss/sd-scripts#523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants