SDXL training cannot continue from latest #851

zhuliyi0 · 2024-08-22T23:30:46Z

2024-08-23 07:26:53,615 [INFO] (main) Resuming from checkpoint checkpoint-1000
Could not load model: 'Namespace' object has no attribute 'unet', traceback: Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 429, in _load_full_model
if self.args.controlnet or self.args.unet:
AttributeError: 'Namespace' object has no attribute 'unet'

Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/train.py", line 2490, in
main()
File "/root/autodl-tmp/SimpleTuner/train.py", line 1225, in main
accelerator.load_state(os.path.join(args.output_dir, path))
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 3131, in load_state
hook(models, input_dir)
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 469, in load_model_hook
self._load_full_model(models=models, input_dir=input_dir)
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 452, in _load_full_model
raise Exception(return_exception)
Exception: Could not load model: 'Namespace' object has no attribute 'unet', traceback: Traceback (most recent call last):
File "/root/autodl-tmp/SimpleTuner/helpers/training/save_hooks.py", line 429, in _load_full_model
if self.args.controlnet or self.args.unet:
AttributeError: 'Namespace' object has no attribute 'unet'

bghira · 2024-08-22T23:41:53Z

can you check main?

(#851) remove shard merge code on load hook

zhuliyi0 · 2024-08-22T23:50:52Z

yes main is working fine. I was on release.

zhuliyi0 · 2024-08-23T00:01:40Z

well, somehow I keep getting OOM at 1002 step after resume from 1000 step. I was hitting 2000 steps before, but I was on release, so something in main must have increase vram usage. I can't change batch size when resuming correct?

bghira · 2024-08-23T01:49:11Z

you can change batch size at any time

bghira · 2024-08-23T02:03:27Z

did you have quanto enabled before?

bghira · 2024-08-23T02:06:02Z

the only thing since the last stable release is this one which disabled quanto for base model training. i just didn't want people leaving it enabled by accident. but if anyone needs this, let me know

zhuliyi0 · 2024-08-23T02:38:21Z

no quanto, I was doing full finetune.

zhuliyi0 · 2024-08-23T05:49:10Z

you can change batch size at any time

Good to know. I remember you said learning rate and schedule is not changable? What if learning rate is set to linear or sine?

bghira pushed a commit that referenced this issue Aug 22, 2024

(#851) remove shard merge code on load hook

4d4dabe

bghira added bug Something isn't working work-in-progress This issue relates to some currently in-progress work. regression This bug has regressed behaviour that previously worked. pending This issue has a fix that is awaiting test results. labels Aug 22, 2024

bghira added a commit that referenced this issue Aug 22, 2024

Merge pull request #853 from bghira/main

ea2a536

(#851) remove shard merge code on load hook

bghira removed work-in-progress This issue relates to some currently in-progress work. pending This issue has a fix that is awaiting test results. labels Aug 23, 2024

bghira closed this as completed Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SDXL training cannot continue from latest #851

SDXL training cannot continue from latest #851

zhuliyi0 commented Aug 22, 2024

bghira commented Aug 22, 2024

zhuliyi0 commented Aug 22, 2024

zhuliyi0 commented Aug 23, 2024

bghira commented Aug 23, 2024

bghira commented Aug 23, 2024

bghira commented Aug 23, 2024

zhuliyi0 commented Aug 23, 2024

zhuliyi0 commented Aug 23, 2024

SDXL training cannot continue from latest #851

SDXL training cannot continue from latest #851

Comments

zhuliyi0 commented Aug 22, 2024

bghira commented Aug 22, 2024

zhuliyi0 commented Aug 22, 2024

zhuliyi0 commented Aug 23, 2024

bghira commented Aug 23, 2024

bghira commented Aug 23, 2024

bghira commented Aug 23, 2024

zhuliyi0 commented Aug 23, 2024

zhuliyi0 commented Aug 23, 2024