-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add grad norm monitoring/logging #1407
Comments
@gau-nernst thanks for this suggestion. Actually we had a similar discussion a few months back in #897, maybe I was too much of a stickler about it at that time 😅. In addition to the comments I left there, I agree with your point on not slowing down training. I think your proposal to only calculate at logging step is reasonable, but in practice I think many of our configs set One alternative to using Otherwise there is the Personally I am open to either of these approaches, would be interested to hear your thoughts on the pros and cons here as well. Agree that doing this properly for FSDP will need a bit more thought (I assume we would want the norm across all ranks and not per-rank? Also I believe if we use ). But fine to punt it for now. |
From the discussion in #897, I agree with you that we should not enable gradient clipping by default. It should be set explicitly by the user (burned many times when default hparams are different across HF models 🌚) Default In terms of benchmark, we can also check how much calculating grad norm every step is gonna cost us. Maybe it's not that much? 🤔
I think this is nice on paper but a nightmare to design and maintain 😢. Realistically, I'm not sure if there are that many other useful metrics to log (except from task-specific things, which should be hard-coded in their own respective recipes already). For now, I think it is reasonable to have
So now the question is what to do when
|
@gau-nernst I think your proposal makes sense.
In the absence of any data, I would lean towards the second option just to be safe. However, if we do find that the perf impact of logging grad norm is negligible, the first option would be fine too (and simpler). For benchmarking purposes we may want to look at distributed too since inevitably we will want to add it at some point and the clip_grad_norm call is likely to be more expensive in that case. |
Hey @gau-nernst are you working on this one already? If not we may have someone who can help out here |
I'm not working on this. You can assign this to someone else. |
This is not implemented for distributed recipes yet right? So maybe can keep this issue open. I was adding this feature to my codebase w/ FSDP2, thought it might be useful for torchtune too.
So it's pretty straight-forward to support this feature with FSDP2. Another separate issue. I think all metrics logged by torchtune is "local" metric e.g. loss value is the loss on rank 0 only. To get "accurate" loss value, need to do all-reduce. Might not be so important... |
Personally I have found that monitoring grad norm is useful to understand stability of training. It is also useful to set an appropriate clipping value (though I don't think torchtune supports grad norm clipping atm?).
Some considerations:
The text was updated successfully, but these errors were encountered: