-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dango/timesteps fix #1768
Dango/timesteps fix #1768
Conversation
Dango233
commented
Nov 7, 2024
- Remove diffusers dependency in ts & sigma calc
- Support Shift Setting
- Support timesteps range setting
- Add uniform distribution
- Default to Uniform distribution and shift 1
* Remove diffusers dependency in ts & sigma calc * support Shift setting * Add uniform distribution * Default to Uniform distribution and shift 1
With default setting, training should catch patterns/details much quicker and reduce overfitting on early/mid timesteps |
Thank you for this! |
library/sd3_train_utils.py
Outdated
indices = (u * (t_max-t_min) + t_min).long() | ||
timesteps = indices.to(device=device, dtype=dtype) | ||
|
||
# sigmas according to dlowmatching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flowmatching*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be fixed in bafd10d :)
@Dango233 you guys are seeing better results than the normal flux schedule of sigmoid sampling? |
It's dataset and training purpose dependent. Sigmoid/logit normal works really well on initial construction of denoising capabilities but not necessarily best for small scale finetuning - uniform distribution is more universal for downstream usecases.
If a training focus heavily on learning overall structures and can ignore details (details/objects/patterns already in model base weight), logit normal (with a shift > 1) still performs great;
but for dataset that needs a lot of attention into details (like anime characters) - uniform distribution should work better.
s
Some extrem cases even need to have shift < 1 to emphasis details (like if you are training a detailed pattern).
So it really depends on what you want to achieve.
|
does this scale with batch size such that around 2048 we really want to weight sampling or is uniform alright then as well? the explanation about early structure during pretraining does make sense. |
I would have to say that's task-dependent, but in general, if your dataset is very large and the samples share similar details, logit_normal still makes sense sometimes.
|
I'm having problems to set the LR for the TE and the unet independently, it trains the TE with the same LR I set to the unet. That's is in my config file "learning_rate": 1e-06, Additional parameters: --fused_backward_pass --use_t5xxl_cache_only --train_text_encoder I think the problem started after the last update but I'm not sure, |
This PR is not related to the learning rates. I will check it sooner. |