Replies: 1 comment
-
I'm not particularly knowledgeable on this, but I read here that more or less, it's relatively bad to train with a guidance scale because the model's expecting the teacher model's cfg'd noise predictions during training, so training with a guidance scale destroys said functionality and now you have to use cfg. (i could very much be subtly or majorly wrong in my understanding here) Training against guidance_scale=1, while probably out of range during training, at least has the effect of removing cfg, and (i think) the teacher model out of the equation, so it seemingly makes more semantic sense for our training setup without flux pro. I do find it interesting that the author of that Medium article is claiming that training against an "undistilled-via-training" flux dev produces better results for loras, wonder if anyone could corroborate, he does provide the model he trained against. |
Beta Was this translation helpful? Give feedback.
-
I suspect that flux dev probably underwent a process similar to the distillation in latent consistency models, so it should have been completed within a certain guidance scale range. This is likely the original intention behind simpletuner providing the '
random-range
' mode.However, when using the
constant
mode, a specificflux_guidance_value
needs to be determined. The author suggests in the explanation: 'Using a value of 1.0 seems to preserve the CFG distillation for the Dev model.' Since flux dev is a distilled version, it actually only has some conditional generation capabilities within a certain range, so what does that mean “preserve the CFG distillation”?Maybe I got sth wrong with the guidance distillation stuff. Anyone care to ELI5?
Beta Was this translation helpful? Give feedback.
All reactions