Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Redux conditioning for Flux Lora training #1838

Draft
wants to merge 1 commit into
base: sd3
Choose a base branch
from

Conversation

recris
Copy link

@recris recris commented Dec 15, 2024

This PR adds support for training Flux.1 LoRA using conditioning from the Redux image encoder.

Instead of relying on text captions to condition the model, why not use the image itself to provide a "perfect" caption instead?

Redux+SigLIP provide a T5 compatible embedding that generates images very close to the target. I thought this could be used instead of relying on text descriptions that may or may not match the concepts as understood by the base model.

To use this I've added the following new parameters:

  • redux_model_path: Safetensors file for the Redux model (downloadable from here)
    • Note: the code will also pull the SigLIP model from HuggingFace (google/siglip-so400m-patch14-384)
  • vision_cond_ratio: this controls an interpolation between the text-based embedding and the Redux embedding. 0.0 is pure text conditioning (same as before), 1.0 is pure Redux vision conditioning. The effect is similar to the "Conditioning Average" node in ComfyUI.
  • vision_cond_dropout: probability of drop-out for the vision conditioning. During a training step this will randomly chose to ignore the vision conditioning and use the text conditioning instead. For example 0.2 means it will use Redux 80% of the time and use regular captions for the other 20%

Experimental Notes:

  • Redux is extremely good at describing a target image, to the point where a LoRA trained solely with it becomes very weak when used without Redux. Because the conditioning is so good, it lowers the average loss significantly and the resulting LoRA learns a lot less - it essentially learns the "difference" between Base model + Redux and the training data. To mitigate this I added the dropout parameter so that during training it sees normal text prompts and avoids becoming dependent on Redux for inference.
  • The conditioning from the vision encoder is very strong, when using vision_cond_ratio I usually have to set it to 0.2 or lower before I start seeing meaningful differences on what gets learned.
  • Using vision_cond_dropout = 0.5 seems to work well enough, I noticed an improvement on the end result, less "broken" images (bad anatomy, etc.) during inference.
  • This might be a good option for training styles, given that use-case tends to require better quality, more complete descriptions in captions
  • Using this with full finetune is not supported, but there should be no technical restriction to support it. I just don't have the hardware to test it.
  • This is not a replacement for text captions, the changes only affect T5 conditioning, CLIP still needs text captions like before.
  • The interpolation method behind vision_cond_ratio feels very crude and unsound to me, maybe there is a better approach?

I don't expect this PR to be merged anytime soon, had to make some sub-optimal code changes to make this work. I am just posting this for visibility, so that people can play with it and gather feedback.

@recris recris marked this pull request as draft December 15, 2024 21:32
@FurkanGozukara
Copy link

@recris amazing work

did you notice this is solving issue of training multiple same class concept?

like 2 man at the same time

or when you train a man it makes all other mans to turn into you.

is this solving this problem

moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2

@recris
Copy link
Author

recris commented Dec 15, 2024

@recris amazing work

did you notice this is solving issue of training multiple same class concept?

like 2 man at the same time

or when you train a man it makes all other mans to turn into you.

is this solving this problem

moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2

This has nothing to do with either of those issues. For multiple concepts you would need something like pivotal tuning which currently is not supported either.

This PR is only an attempt to improve overall quality in the presence of poorly captioned training data.

@FurkanGozukara
Copy link

@recris thanks but you still recommend vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 and then we can use trained lora without flux redux right?

@recris
Copy link
Author

recris commented Dec 15, 2024

Please read the notes fully before posting - these are not "recommendations", this hardly has been tested in a comprehensive way and it probably is not ready for widespread use.

That said, you can probably start with vision_cond_dropout = 0.5, vision_cond_ratio = 1.0. Beware that this could also require changes to the learning rate or total number of steps trained to achieve same results as before.

@dxqbYD
Copy link

dxqbYD commented Dec 19, 2024

Interesting concept!
About the LoRA only learning the difference between (base model + image conditioning) and training data:

if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned. Similar to this concept:

Nerogar/OneTrainer#505

but not using the base model as teacher, but the base model conditioned by Redux.
A LoRA that replicates the Redux conditioning - but without Redux - could be the result, which could then be improved by regular training on data.

@recris
Copy link
Author

recris commented Dec 20, 2024

if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned

This is what the vision_cond_dropout can be used for - you feed the model a mix of caption conditioned plus Redux conditioned samples so it learns to not become dependent on Redux. A value of at least 0.5 seems to do the trick, but maybe you can even go lower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants