-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resuming from checkpoint #72
Comments
Did you find an answer to this? |
Looks like a recent commit added the training script as a text file, so though I haven't tried it, you can probably run the script manually and supply the option to resume from a checkpoint there. |
I think I've found how but i'm facing some issues myself, so I was going to open a new issue when I found yours What I found out is you can't resume if you launch the training script with fluxgym.
What you can do is press Control + C right after you're clicked on start training. Just wait for the script to save all latents files. You can do control + C when it reaches "running training / 学習開始" or after that. If you Control+C before these files will be corrupt and you will have to get through the UI again to set-up your stuff and do it over again. Then go into fluxgym\outputs[yourLoraName] and open via Notepad++ the train.bat files You will have to remove all the "^" and make it a one liner using Control+J in notepad++ What it will do is everytime you do save your training in a .safetensor file in the fluxgym\outputs[YourLoraFolder] it will also create a folder [YourLoraName]-[epochNumber]-state Maybe save your new command line somewhere if you want to keep the original fluxgym one intact. Now open your powershell, go in fluxgym dir, switch on the env\scripts\activate and paste the updated command
Find back your command line and add this parameter : --resume "c:\fluxgym\outputs[YourLoraName][Folder of the state you want to resume from]" For example, if your training last save was epoch 9 it would be : --resume "c:\fluxgym\outputs\MyGreatLora\MyGreatLora-000009-state" Now this could be enough, it will work, I've done it but now that's where my own issue begins :-( I added the following parameters to the command line, along with --resume, to resume a Lora that stopped at epoch 22 and step 12650 : --initial_epoch 23 I had to read the code and ask many questions to chatGPT to try to understand how this worked but it always come back to resuming from step 0 and saving my further train_state.json with steps starting back from 0 as the resume starts The Lora is still improving though, so it's still worth doing so, but I'd be relieved to have the correct number of steps so I'm sure it is not messing with the optimizer or else. I've tried edition train_state.json to set the correct step, but long story short I had to force it with --initial_step and --skip_until_initial_step But the result is it does this strange backwards counting of steps from the number i give it to 0, substracting steps at each epoch. I've seen a discussion on the accelerate (name of the script used to train) github and they say something about checkpointing.py to save the step etc... For now I really don't get it. To get here has taken me a lot of time already believe me. I hope this shortcut helps you and that somebody can also help me :( Any insights ? |
Resuming from checkpoint works from the standpoint of using the Kohya training script, and the --resume option pointing to the desired checkpoint. |
@quarterturn can you give command lines examples please, it's not as easy as it seems. For starters, the --resume argument is waiting for a folder containing the training state. It doesn't work with a checkpoint like you say. I've just done it again to be 100% sure, definitely doesn't work doing --resume "[path][lora].safetensors" Have you actually done it (I mean successfully resuming a training a flux lora) ? If not, please read what I wrote earlier before posting a oneliner like this. If yes, what were the results ? Are you sure it did resume the training taking into account all the previous steps ? Because it generates samples and training states with steps resetting from zero. So i'm not too sure if it's working properly at the moment I've spent hours reading the code of this script today training_network.py to solve this |
Additional proof of what i'm saying, from the official readme.md, translated from japanese. Now the bad news is these instructions are insufficient :
|
Still researching the --resume feature in accelerate, I'm convinced it can't be trusted now Refer to kohya-ss/sd-scripts#772 - I got resuming step = 0 when using --resume, after it accesses the saved training state I do hope I'm wrong, but I don't see how |
Definitely need a resume option that works. |
Actually, cheer up because since I last wrote that I've spent a lot of time in the code and putting loggers. Resume actually works fine, you just have to use --save_states from the start then use --resume "path\yourlora-0000xx-state" and then it'll work. This is backed up by watching the loss curve using tensorboard, it does not start from scratch. I've been using it for maybe a week now and nothing special to report. The progress bar is very misleading, starting from zero again, and also there is a bug in the train_state.json saving so the current_step value is incorrect, so further ones will be saved starting again from zero at every resume But don't worry like I did... it doesn't change the internal trainer step, which is reset at zero at the end of every epoch. The only case where they might be a problem is if you're using a parameter to save at a given state, then i'm not sure the trainer internal step is properly saved in the binary state file. I've done a local fix for the progress bar and train_state file that I need to clean up and suggest to kohya but then I have a lots of other things to hack. |
Awesome work Tablaski, if i didn't misunderstand, we could open the advanced options in FLuxgym by Pinokio, pick --save_states from the get-go, else --resume only works by going to the last epoch but not the intermediate steps in each epoch? How much time does pass between Epochs for you? Update from my side, when using in the train.bat ( just starting it in CMD line then there is no need to remove al the "^" and parse it as one line, just start the train.bat ) then it actually shows that its starting from step X, aka whatever you specified, atleast thats what it writes in the CMD, didnt finish training yet from an advanced state to see if it works properly. where can we learn about using tensorboard though? |
Yes you absolute need --save_state during training if you want to resume further ; impossible otherwise Then you will need --save_every_n_epochs n where n = number of epochs when you want to save. I use 1 so it's saved every 1 epoch You can also use --save_state_on_train_end if you just want to save at the end (last epoch) but then you have to ensure nothing crashes your training, and that setting sucks because you won't be able to resume from an earlier epoch if your model starts overfitting If you are paranoid you can use --save_every_n_steps n where n is the number of saves so you could save every half epoch for instance. I don't use that, one per epoch is enough for me For the time between epochs it really depends how much repeats you are using and if you are using regularization pictures (it doubles the steps) for tensorboard it's really easy to set up, a 5 minute job. you basically do "pip install tensorboard", activate your env\script\activate in the fluxgym folder, then "tensorboard --logdir [log path] You got to use --log_with tensorboard and --logging_dir [log path] to set up the logs that can be then open with tensorboard I've found tensorboard REALLY useful to monitor what you're doing with your training, anticipating, and learning about training Btw best course is asking chatgpt about training stuff ; you have no idea how much i've learned about training by asking chat gpt about every option and concept. I knew more in a few weeks than a friend I know who has been training for like a year or more. Try to compare what chatgpt says about stuff with forums and articles from times to times, for a few concepts, people experience was more accurate, but overall it gives good advices |
On slow systems saving every epoch takes quite some time though idk what i broke right now, increased number of workers to 40 and its generating in lightning speed compared to before :O even the image generation is going at 2 iterations per second (even more weird, it didnt work 3 times with exactly the same settings then all of a sudden goes berserk speed) using the 8GB option now, 40 workers in fluxgym by pinokio. on the other hand, i barely see improvements in face, aka it doesnt look closer to me, only some clothes... Do we have to crop and focus on the face? or did you get better results by changing the prompt? |
What setup do you have ? Whats the batch size ? Whats the resolution ? I did not gain anything significant in terms of speed by playing with dataloaders, workers, etc When training faces i crop everything but faces |
EDIT::: lol because i was too lazy to close the CMD window i actually got everything (saved it to a notepad now) saving checkpoint: D:\Users\Eddy\Desktop\AiTools\Pinokio\api\fluxgym.git\outputs\kh-2e\kh-2e.safetensors thats how long it took from start to finish, so with picking 16GB in the GUI of FLuxGym, will scour through the Laptop RTX 4080 12GBVram, 64GB RAM, i9-12900 resolution was 512x512, batch size 1, what does batch size use for resources?´ (the logs seem to indicate threads per process doesn't do anything and is turned back to 1, changing that didn't affect anything either) In any case, do you want me to parse the whole txt file or should i look for something specific?
|
there are several of them though train_batch_size and train_encoder_batch_size and regular batch_size also Vae_batchsize which ones did you modify for the performance increase? |
This is getting further for the main subject so I will try to make it to the point. The stuff that made the most difference for me (RTX4090 16 GB VRAM) are : --fp8_base : if i remove that, terrible performance (needs more VRAM) resolution in dataset.toml : if i go to 1024, it's much slower batch size in dataset.toml : you got to monitor your VRAM usage and see what is the optimal you can attain. => This is the most important setting for improving speed, going from 1 to 6 reduced by 2,5x time my total training time. Other settings like mixed precision, loaders and cpu workers, I have played with them but not a big difference for me that I would remember. Just use a sufficient amount of loaders and cpu workers, but not too much otherwise it actually slows you down because the script has to synchronize many threads Also, I cannot emphasize that enough : --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" THIS SETTING IS ABSOLUTE COMPLETE SHIT, to the point I'd like to write to @cocktailpeanut to remove it and replace by this which is amazing : --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" Why ? Constant with warmup : starts the learning rate extremely slow and increases using a linear fonction. That makes learning extremely slow at the start, takes many epoch to have results, then overfits the model when reaching later epochs **Cosine + warmup + warmup 20% + decay 80% :**starts increasing learning rate using a linear fonction for 20% of your steps, then stays at the maximum for some time, then slowly decreases it, then a bit faster, then even faster during second half of steps, eventually landing very low as you reach the final epoch. This allow you to train carefully at the beginning before the trainer has seen your dataset fully, then learn properly, then slow down before risking overfitting, and then refine with more details while not risking much on later epochs THIS IS EVEN MORE IMPORTANT THAN SPEED We're talking about a setting that will save you many resumes or complete re-trainings because you've just wasted your training time. This new setting is very solid and made me very confident, whereas I would be very anxiously reading the average loss curve before with constant_with_warmup, and resuming multiple times to manually adjust learning rates because things would turn very wrong at some point I am NEVER using again the default setting. |
That's it, I just opened a new issue about cosine with warmup, we need to get rid of linear_with_warmup ASAP |
hi everybody, can someone help me?
Resume file path
Train script
Complete logattached |
I believe you need to have included "--save-state" in your train.bat file |
Is there a way to resume training from a checkpoint? Let's say maybe you can run it non-stop due to demand-based electric pricing.
The text was updated successfully, but these errors were encountered: