-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to use Scheduled Huber Loss in all training pipelines to improve resilience to data corruption #1228
Conversation
Thank you for submitting this pull request! Although I don't fully understand the contents of the discussion in #294, it seems that a very interesting discussion is taking place. This PR appears to have the objective of learning while balancing the large features and fine details of the image. If I understand correctly, this PR has a similar objective to the method proposed by cheald in #294, but takes a different approach. Is this correct? |
These are quite different approaches. @cheald makes adjustments to latents, then feeds them into the standard loss. And as cheald is using mse_loss in the end, I believe these approaches can exist in synergy! |
Thank you clarification! I will merge this sooner. |
This is very exciting! I've been experimenting with manipulating the mean/stddev of slices of the noise prior to forward noising, and have found that I can directly manipulate the level of detail trained with certain permutations of noise. However, it still suffers from outliers early in training having too large an impact, which I've had to manage through very careful tuning of noise. I'm very excited to try adding this into my experiments - if it performs how I imagine, then it could solve a number of problems which could lead to faster and more controlled training. |
Tried a training, the results were impressive. |
@gesen2egee Thank you for giving a test run! Would you mind sharing some of them? |
Here's my quick training experiment. I just ran each for 9 epochs. Overall, results look very promising. Ground truth: General settings: adamw8bit, CosineAnnealingWarmRestarts w/ restart every 2 epochs, unet_lr 1e-4, and I am using masks and masked loss here. Each image pair is l2 on the left, huber_scheduled on the right. Marginal improvement, I think! Here's where it gets fun: I'm experimenting with recentering and rescaling each noise channel individually, and then also experimenting with shifting and rescaling all noise channels together, as well as independently. This is conceptually similar to an expansion of the idea encoded in the Dependent scale, dependent shift: Independent channel scaling, dependent channel shift: Dependent scaling, independent shift: and finally, independent scaling, independent shift: My general observation here is that the huber_scheduled loss definitely does improve detail retention (look at the brick in the background!), but isn't quite learning as well. However, I suspect that is likely due to nothing more than the lower loss values damping the rate of change in the learned weights. It might be that using huber_loss would permit a larger learning rate relative to l2 loss, which would be great if it can retain the same improvement in details. |
Here's another example. I'm working on extending the dynamic range of my trained samples, which I'm doing by pushing the mean of the first channel in the noise. I also cranked my LR up to 4e-4 to see if this would give the huber noise an edge, but it appears to actually not be working that way. Same seed and training params for all 4 images, the left images are a "night time photo" and the right images are a "bright daytime photo". Upper is l2, lower is huber. The huber examples have distorted less, but are certainly less like the ground truth overall. Is there guidance on how to set the delta parameter (huber_c?) to achieve a middle ground?
Loss curves for each: |
Excited to try this out. Any idea how this interacts with Min-SNR-gamma, which weights the loss based on timestep? |
@cheald @drhead we chose the exponential because of its simplicity, to test the claim that it should decrease with the (forward) diffusion timestep. It indeed quite likely may be suboptimal, and the idea about the snr-scheduling sounds very reasonable! Adding some selectable huber_c schedules, I think, would be a great next addition
|
There is another alternative for huber function: pseudo-Huber loss divided by delta: whereas the "math" version is: The difference is math version converging to zero when the delta goes to zero, while the modified version transitions between L1 and L2 While the divided version gave us worse results for resilience, it may be better perceptually. (didn't analyze the perceptual part much, unfortunately) The former version, suggested by OpenAI, is used in Diffusers for LCM training. Here's the gif comparison: outfile.mp4I think it may be worth adding it here too and making some more experiments Edit: fixed parabola's coefficient to suit the losses (1/2 a^2) The problem may lie in the beginning of the delta values: when delta is zero (snr = 0), as the OpenAI's Huber function is above L2, it will make the model actively learn, and thus vulnerable to outliers, while our math variant will not really take into account what is happening at pure noise. It may explain, why the OpenAI's loss function fails in our resilience experiments greatly (huber_scheduled_old) A good compromise may be adding a minimal delta value to pad it at zero snr, and that's quite what we did in our experiments Another Edit: pseudo huber loss computation needs to get *2 multiplier to to better correspond to MSE's coeffs, as from Taylor expansion, the pseudo-huber loss's asymptotic is ~1/2 * a^2, that leads to discrepancy when the MSE's a^2 parabola is far away from the formed curve at a=1. |
@drhead the second main guy in my team agreed that |
The Taylor expansion of sqrt near zero gives 1/2 a^2, which differs from a^2 of the standard MSE loss. This change scales them better against one another
Good suggestion! Btw I'm making the experiments with different losses/schedules right now. Will post it here soon |
@drhead (on David Revoy dataset https://drive.google.com/drive/folders/1Z4gVNugFK2RXQEO2yiohFrIbhP00tIOo?usp=drive_link) I think, subjectively, I divide between SNR Smooth L1 and SNR Huber (they both have strong and weak sides) Constant L2's parameter is 1, because it's the final delta value Robustness is another thing, and it needs its own tests All the generation samples from the schedules/loss types experiments above (16 per each): https://drive.google.com/drive/folders/1DnU-o_TT9JH8l1JS_WuQfeZ-k6uafCJo?usp=drive_link |
Thanks again for the PR and great discussion! I have created a brief description to add to the release notes to explain the new features that this PR offers. Any comments would be appreciated. Scheduled Huber Loss has been introduced to each training scripts. This is a method to improve robustness against outliers or anomalies (data corruption) in the training data. With the traditional MSE (L2) loss function, the impact of outliers could be significant, potentially leading to a degradation in the quality of generated images. On the other hand, while the Huber loss function can suppress the influence of outliers, it tends to compromise the reproduction of fine details in images. To address this, the proposed method employs a clever application of the Huber loss function. By scheduling the use of Huber loss in the early stages of training (when noise is high) and MSE in the later stages, it strikes a balance between outlier robustness and fine detail reproduction. Experimental results have confirmed that this method achieves higher accuracy on data containing outliers compared to pure Huber loss or MSE. The increase in computational cost is minimal. The newly added arguments loss_type, huber_schedule, and huber_c allow for the selection of the loss function type (Huber, smooth L1, MSE), scheduling method (exponential, constant, SNR), and Huber's parameter. This enables optimization based on the characteristics of the dataset. See #1228 for details. |
I think it will be better to use 'snr' by default as it have been shown to have better quality |
Is there any information about when different scheduling method and loss function type should be applied? For example, for a large dataset one, for a small dataset another. For training on photos one, for training on anime artwork another |
Thank you! I have updated it. |
sd-scripts/library/train_util.py Line 3249 in dfa3079
Edit: Oops, I refered to the dev branch, and it's corrected on the main |
Sorry, I updated directry the main🙇♂️ Thank you for your confirmation! |
I did want to follow up on this since some further experiments led me to notice some tradeoffs with the outcomes of training on this loss function. While the benefits to image coherency still seem very clear, it seems that on my training setup, one of the side effects of this loss function is a loss of control over image contrast -- manifesting as things like "dark" in a prompt being less powerful on zero terminal SNR models. Incidentally, I have a very good idea of what the root cause might be, and it makes perfect sense given what this loss function does. TL;DR of the above post is that the VAE used by SD 1.5 (and SD 2.1 and Midjourney and the Pixart series of models and DALL-E 3 and EDM2 and probably many others), tends to produce a high magnitude artifact, especially on large images and desaturated images (noting that desaturated in latent space is closer to sepia). My theory is that the purpose of this artifact is to abuse the normalization layers of the model by placing an artifact that serves to desaturate the image by having a high magnitude that will cause other values to decrease significantly as they pass through the model normalization layers. @madebyollin made some very helpful graphics demonstrating some of these effects: https://gist.github.com/madebyollin/ff6aeadf27b2edbc51d05d5f97a595d9 With this in mind, it does make perfect sense why this would happen, especially on my training setup. I'm training a high resolution model (actually with several different resolution groups) with v-prediction. With v-prediction, higher noise timesteps are closer to x_0 prediction, and the terminal timestep is outright x_0 prediction. This means that the prediction target has something that is both high magnitude and extremely important for image reconstruction. Errors are generally likely to be proportional to the magnitude of the target, which is not good when our loss function is more relaxed on large errors at this point. The loss function can't take into account the fact that this will cause a massive error in pixel space -- for MSE loss, this would be less of a problem, since larger errors would be pulled in much harder. While the loss objective does effectively ignore large outliers from the data in this way, the VAE artifact is sadly not separable from this. It is possible that it could be mitigated by finding some metric for a latent that would indicate consistency in its saturation level, but the solution that makes more sense is to just move to a model that doesn't use the SD1.5 VAE. As a final note, I have found the effects of this to be very similar to the effects of FreeU, both in its improvements of image coherence and in having a similar issue with weakening of saturation (and they do combine together well, I was using it with my samples above). FreeU's underlying theory is that high-frequency information in the skip connections converges too fast in the decoder and therefore the authors choose to modulate backbone features and skip connections to place more emphasis on low frequency feature maps. That might be related to why this improves coherency -- mitigating the impact of outliers allowing for low frequency features to be better represented. I'm sure there's a lot more research to be done on optimizing hierarchical denoising with things like this. All of that being said, I do think that these issues are not likely to be a problem for epsilon prediction since the VAE artifact is not directly part of the prediction target, and it is not likely to be much of an issue on SDXL or for several other models using a different VAE or for pixel-space models. I'll be trying it out on SD3 when I can, hopefully their VAE doesn't have the same issues. |
Can i ask which function im supposed to use to actually use scheduled huber loss in kohya? there is no explanation.. Do i need to set smooth_l1? or huber? What are the settings im supposed to use? |
There is some details explained on the main README about-scheduled-huber-loss |
Hey, I just tried out this new Huber loss parameter... and it's amazing. Thanks for all the hard work by the people here! |
What’s the practical effect of increasing or decreasing the value of huber_c? |
My experience is total destruction of the training almost instantly. Even a 0.01 in either direction resulted in total devastation/noise. |
As heavily discussed in #294, the presence even of a small number of outliers can heavily distort the image output.
While there have been proposed alternative losses, such as Huber loss, the problem was the loss of fine details when training the pictures. After researching the mathematical formulations of diffusion models, we came to the conclusion that it's possible to schedule the Huber loss, making it smoothly transition from Huber loss with L1 asymptotics on the first reverse-diffusion timesteps (when the image only begins to form and is most vulnerable to concept outliers) to the standard L2 MSE loss, for which diffusion models are originally formulated, on the last reverse-diffusion timesteps, when the fine-details of the image are forming.
Our method shows greater stability and resilience (similarity to clean pictures on corrupted runs - similarity to clean pictures on clean runs), than both pure Huber and L2 losses.
The experiments confirm that indeed this schedule improves the resilience greatly
Our paper: https://arxiv.org/abs/2403.16728
Diffusers discussion huggingface/diffusers#7488
Most importantly, this approach has virtually no computational costs over the standard L2 computation. (and minimal code changes to the training scripts)
cc @cheald @kohya-ss