Experimental Redux conditioning for Flux Lora training #1838

recris · 2024-12-15T21:32:06Z

This PR adds support for training Flux.1 LoRA using conditioning from the Redux image encoder.

Instead of relying on text captions to condition the model, why not use the image itself to provide a "perfect" caption instead?

Redux+SigLIP provide a T5 compatible embedding that generates images very close to the target. I thought this could be used instead of relying on text descriptions that may or may not match the concepts as understood by the base model.

To use this I've added the following new parameters:

redux_model_path: Safetensors file for the Redux model (downloadable from here)
- Note: the code will also pull the SigLIP model from HuggingFace (google/siglip-so400m-patch14-384)
vision_cond_ratio: this controls an interpolation between the text-based embedding and the Redux embedding. 0.0 is pure text conditioning (same as before), 1.0 is pure Redux vision conditioning. The effect is similar to the "Conditioning Average" node in ComfyUI.
vision_cond_dropout: probability of drop-out for the vision conditioning. During a training step this will randomly chose to ignore the vision conditioning and use the text conditioning instead. For example 0.2 means it will use Redux 80% of the time and use regular captions for the other 20%

Experimental Notes:

Redux is extremely good at describing a target image, to the point where a LoRA trained solely with it becomes very weak when used without Redux. Because the conditioning is so good, it lowers the average loss significantly and the resulting LoRA learns a lot less - it essentially learns the "difference" between Base model + Redux and the training data. To mitigate this I added the dropout parameter so that during training it sees normal text prompts and avoids becoming dependent on Redux for inference.
The conditioning from the vision encoder is very strong, when using vision_cond_ratio I usually have to set it to 0.2 or lower before I start seeing meaningful differences on what gets learned.
Using vision_cond_dropout = 0.5 seems to work well enough, I noticed an improvement on the end result, less "broken" images (bad anatomy, etc.) during inference.
This might be a good option for training styles, given that use-case tends to require better quality, more complete descriptions in captions
Using this with full finetune is not supported, but there should be no technical restriction to support it. I just don't have the hardware to test it.
This is not a replacement for text captions, the changes only affect T5 conditioning, CLIP still needs text captions like before.
The interpolation method behind vision_cond_ratio feels very crude and unsound to me, maybe there is a better approach?

I don't expect this PR to be merged anytime soon, had to make some sub-optimal code changes to make this work. I am just posting this for visibility, so that people can play with it and gather feedback.

FurkanGozukara · 2024-12-15T21:37:33Z

@recris amazing work

did you notice this is solving issue of training multiple same class concept?

like 2 man at the same time

or when you train a man it makes all other mans to turn into you.

is this solving this problem

moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2

recris · 2024-12-15T21:41:34Z

@recris amazing work

did you notice this is solving issue of training multiple same class concept?

like 2 man at the same time

or when you train a man it makes all other mans to turn into you.

is this solving this problem

moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2

This has nothing to do with either of those issues. For multiple concepts you would need something like pivotal tuning which currently is not supported either.

This PR is only an attempt to improve overall quality in the presence of poorly captioned training data.

FurkanGozukara · 2024-12-15T21:47:50Z

@recris thanks but you still recommend vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 and then we can use trained lora without flux redux right?

recris · 2024-12-15T21:56:54Z

Please read the notes fully before posting - these are not "recommendations", this hardly has been tested in a comprehensive way and it probably is not ready for widespread use.

That said, you can probably start with vision_cond_dropout = 0.5, vision_cond_ratio = 1.0. Beware that this could also require changes to the learning rate or total number of steps trained to achieve same results as before.

dxqbYD · 2024-12-19T07:13:11Z

Interesting concept!
About the LoRA only learning the difference between (base model + image conditioning) and training data:

if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned. Similar to this concept:

Nerogar/OneTrainer#505

but not using the base model as teacher, but the base model conditioned by Redux.
A LoRA that replicates the Redux conditioning - but without Redux - could be the result, which could then be improved by regular training on data.

recris · 2024-12-20T12:56:59Z

if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned

This is what the vision_cond_dropout can be used for - you feed the model a mix of caption conditioned plus Redux conditioned samples so it learns to not become dependent on Redux. A value of at least 0.5 seems to do the trick, but maybe you can even go lower.

Experimental Redux conditioning for Flux Lora training

9d28701

recris marked this pull request as draft December 15, 2024 21:32

recris mentioned this pull request Dec 15, 2024

Implement pseudo Huber loss for Flux and SD3 #1808

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Redux conditioning for Flux Lora training #1838

Experimental Redux conditioning for Flux Lora training #1838

recris commented Dec 15, 2024 •

edited

Loading

FurkanGozukara commented Dec 15, 2024

recris commented Dec 15, 2024

FurkanGozukara commented Dec 15, 2024

recris commented Dec 15, 2024

dxqbYD commented Dec 19, 2024

recris commented Dec 20, 2024 •

edited

Loading

Experimental Redux conditioning for Flux Lora training #1838

Are you sure you want to change the base?

Experimental Redux conditioning for Flux Lora training #1838

Conversation

recris commented Dec 15, 2024 • edited Loading

Experimental Notes:

FurkanGozukara commented Dec 15, 2024

recris commented Dec 15, 2024

FurkanGozukara commented Dec 15, 2024

recris commented Dec 15, 2024

dxqbYD commented Dec 19, 2024

recris commented Dec 20, 2024 • edited Loading

recris commented Dec 15, 2024 •

edited

Loading

recris commented Dec 20, 2024 •

edited

Loading