Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDXL training cannot continue from latest #851

Closed
zhuliyi0 opened this issue Aug 22, 2024 · 8 comments
Closed

SDXL training cannot continue from latest #851

zhuliyi0 opened this issue Aug 22, 2024 · 8 comments
Labels
bug Something isn't working regression This bug has regressed behaviour that previously worked.

Comments

@zhuliyi0
Copy link

2024-08-23 07:26:53,615 [INFO] (main) Resuming from checkpoint checkpoint-1000
Could not load model: 'Namespace' object has no attribute 'unet', traceback: Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 429, in _load_full_model
if self.args.controlnet or self.args.unet:
AttributeError: 'Namespace' object has no attribute 'unet'

Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/train.py", line 2490, in
main()
File "/root/autodl-tmp/SimpleTuner/train.py", line 1225, in main
accelerator.load_state(os.path.join(args.output_dir, path))
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 3131, in load_state
hook(models, input_dir)
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 469, in load_model_hook
self._load_full_model(models=models, input_dir=input_dir)
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 452, in _load_full_model
raise Exception(return_exception)
Exception: Could not load model: 'Namespace' object has no attribute 'unet', traceback: Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 429, in _load_full_model
if self.args.controlnet or self.args.unet:
AttributeError: 'Namespace' object has no attribute 'unet'

@bghira bghira added bug Something isn't working work-in-progress This issue relates to some currently in-progress work. regression This bug has regressed behaviour that previously worked. pending This issue has a fix that is awaiting test results. labels Aug 22, 2024
@bghira
Copy link
Owner

bghira commented Aug 22, 2024

can you check main?

bghira added a commit that referenced this issue Aug 22, 2024
(#851) remove shard merge code on load hook
@zhuliyi0
Copy link
Author

yes main is working fine. I was on release.

@zhuliyi0
Copy link
Author

well, somehow I keep getting OOM at 1002 step after resume from 1000 step. I was hitting 2000 steps before, but I was on release, so something in main must have increase vram usage. I can't change batch size when resuming correct?

@bghira
Copy link
Owner

bghira commented Aug 23, 2024

you can change batch size at any time

@bghira
Copy link
Owner

bghira commented Aug 23, 2024

did you have quanto enabled before?

@bghira
Copy link
Owner

bghira commented Aug 23, 2024

the only thing since the last stable release is this one which disabled quanto for base model training. i just didn't want people leaving it enabled by accident. but if anyone needs this, let me know

image

@bghira bghira removed work-in-progress This issue relates to some currently in-progress work. pending This issue has a fix that is awaiting test results. labels Aug 23, 2024
@bghira bghira closed this as completed Aug 23, 2024
@zhuliyi0
Copy link
Author

no quanto, I was doing full finetune.

@zhuliyi0
Copy link
Author

you can change batch size at any time

Good to know. I remember you said learning rate and schedule is not changable? What if learning rate is set to linear or sine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working regression This bug has regressed behaviour that previously worked.
Projects
None yet
Development

No branches or pull requests

2 participants