Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--save_state doesn't produce anything? #1921

Open
alexgilseg opened this issue Feb 5, 2025 · 3 comments
Open

--save_state doesn't produce anything? #1921

alexgilseg opened this issue Feb 5, 2025 · 3 comments

Comments

@alexgilseg
Copy link

When I train Loras with Kohya I want to be able to resume my training in case I need to pause it for some reason. I've been experimenting with the --Save_state command but it doesn't do anything.. Nothing get's created in my set --output_dir..

Am I missing something? I thought a folder with relevant items should get created with each safetensor file no ?

Also when I get this to work.. Do I use the --resume command like this --resume /folder/with/resume/files ?

@Gtonero
Copy link

Gtonero commented Feb 6, 2025

Is there a log save_state while training?
Image

@alexgilseg
Copy link
Author

Is there a log save_state while training? Image

Intressting.. No, nothing like that.. It just says the first part "saving checkpoint C:.............."

Does one have to set any specific after --save_state ? I thought if one used --Save_state it would save after every epoch generated?

I'm going to test --save_state_on_train_end now that I read in the readme.. but still it would be nice if it saved on every checkpoint generated..

@DKnight54
Copy link
Contributor

Double checking the code for saving states suggests that it will only save state when also saving a checkpoint, ie, if you set it so save every N steps or every N epoch, if you have save_state set to true, then it'll save a state along with the checkpoint that you can resume training with.

Without know your exact settings, I can only assume that you are probably missing out on the save every N steps or N epoch option

sd-scripts/train_network.py

Lines 1032 to 1044 in 6e3c1d0

if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0:
accelerator.wait_for_everyone()
if accelerator.is_main_process:
ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, global_step)
save_model(ckpt_name, accelerator.unwrap_model(network), global_step, epoch)
if args.save_state:
train_util.save_and_remove_state_stepwise(args, accelerator, global_step)
remove_step_no = train_util.get_remove_step_no(args, global_step)
if remove_step_no is not None:
remove_ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, remove_step_no)
remove_model(remove_ckpt_name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants