-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Redux conditioning for Flux Lora training #1838
base: sd3
Are you sure you want to change the base?
Conversation
@recris amazing work did you notice this is solving issue of training multiple same class concept? like 2 man at the same time or when you train a man it makes all other mans to turn into you. is this solving this problem moreover, after training, you dont need to use redux right with vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 |
This has nothing to do with either of those issues. For multiple concepts you would need something like pivotal tuning which currently is not supported either. This PR is only an attempt to improve overall quality in the presence of poorly captioned training data. |
@recris thanks but you still recommend vision_cond_dropout = 0.5 + vision_cond_ratio = 0.2 and then we can use trained lora without flux redux right? |
Please read the notes fully before posting - these are not "recommendations", this hardly has been tested in a comprehensive way and it probably is not ready for widespread use. That said, you can probably start with |
Interesting concept! if you consider this a downside and want a stand-alone LoRA as output, you could try to (gradually?) remove the image conditioning from the model prediction, but still expect the model to have learned making the same prediction as if it was still conditioned. Similar to this concept: but not using the base model as teacher, but the base model conditioned by Redux. |
This is what the |
This PR adds support for training Flux.1 LoRA using conditioning from the Redux image encoder.
Instead of relying on text captions to condition the model, why not use the image itself to provide a "perfect" caption instead?
Redux+SigLIP provide a T5 compatible embedding that generates images very close to the target. I thought this could be used instead of relying on text descriptions that may or may not match the concepts as understood by the base model.
To use this I've added the following new parameters:
redux_model_path
: Safetensors file for the Redux model (downloadable from here)google/siglip-so400m-patch14-384
)vision_cond_ratio
: this controls an interpolation between the text-based embedding and the Redux embedding.0.0
is pure text conditioning (same as before),1.0
is pure Redux vision conditioning. The effect is similar to the "Conditioning Average" node in ComfyUI.vision_cond_dropout
: probability of drop-out for the vision conditioning. During a training step this will randomly chose to ignore the vision conditioning and use the text conditioning instead. For example0.2
means it will use Redux 80% of the time and use regular captions for the other 20%Experimental Notes:
vision_cond_ratio
I usually have to set it to0.2
or lower before I start seeing meaningful differences on what gets learned.vision_cond_dropout = 0.5
seems to work well enough, I noticed an improvement on the end result, less "broken" images (bad anatomy, etc.) during inference.vision_cond_ratio
feels very crude and unsound to me, maybe there is a better approach?I don't expect this PR to be merged anytime soon, had to make some sub-optimal code changes to make this work. I am just posting this for visibility, so that people can play with it and gather feedback.