--save_state doesn't produce anything? #1921

alexgilseg · 2025-02-05T23:03:53Z

When I train Loras with Kohya I want to be able to resume my training in case I need to pause it for some reason. I've been experimenting with the --Save_state command but it doesn't do anything.. Nothing get's created in my set --output_dir..

Am I missing something? I thought a folder with relevant items should get created with each safetensor file no ?

Also when I get this to work.. Do I use the --resume command like this --resume /folder/with/resume/files ?

Gtonero · 2025-02-06T03:04:21Z

Is there a log save_state while training?

alexgilseg · 2025-02-06T06:55:10Z

Is there a log save_state while training?

Intressting.. No, nothing like that.. It just says the first part "saving checkpoint C:.............."

Does one have to set any specific after --save_state ? I thought if one used --Save_state it would save after every epoch generated?

I'm going to test --save_state_on_train_end now that I read in the readme.. but still it would be nice if it saved on every checkpoint generated..

DKnight54 · 2025-02-10T09:16:11Z

Double checking the code for saving states suggests that it will only save state when also saving a checkpoint, ie, if you set it so save every N steps or every N epoch, if you have save_state set to true, then it'll save a state along with the checkpoint that you can resume training with.

Without know your exact settings, I can only assume that you are probably missing out on the save every N steps or N epoch option

sd-scripts/train_network.py

Lines 1032 to 1044 in 6e3c1d0

    
           if args.save_every_n_steps is not None and global_step % args.save_every_n_steps == 0: 
        
               accelerator.wait_for_everyone() 
        
               if accelerator.is_main_process: 
        
                   ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, global_step) 
        
                   save_model(ckpt_name, accelerator.unwrap_model(network), global_step, epoch) 
        
                   if args.save_state: 
        
                       train_util.save_and_remove_state_stepwise(args, accelerator, global_step) 
        
                   remove_step_no = train_util.get_remove_step_no(args, global_step) 
        
                   if remove_step_no is not None: 
        
                       remove_ckpt_name = train_util.get_step_ckpt_name(args, "." + args.save_model_as, remove_step_no) 
        
                       remove_model(remove_ckpt_name)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--save_state doesn't produce anything? #1921

--save_state doesn't produce anything? #1921

alexgilseg commented Feb 5, 2025

Gtonero commented Feb 6, 2025

alexgilseg commented Feb 6, 2025

DKnight54 commented Feb 10, 2025

--save_state doesn't produce anything? #1921

--save_state doesn't produce anything? #1921

Comments

alexgilseg commented Feb 5, 2025

Gtonero commented Feb 6, 2025

alexgilseg commented Feb 6, 2025

DKnight54 commented Feb 10, 2025