Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Gradient clipping hooks in the LightningModule #6346

Closed
carmocca opened this issue Mar 4, 2021 · 11 comments · Fixed by #9584
Closed

[RFC] Gradient clipping hooks in the LightningModule #6346

carmocca opened this issue Mar 4, 2021 · 11 comments · Fixed by #9584
Assignees
Labels
design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on refactor
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Mar 4, 2021

🚀 Feature

Add clipping hooks to the LightningModule

Motivation

It's currently very difficult to change the clipping logic

Pitch

class LightningModule:
    def clip_gradients(self, optimizer, optimizer_idx):
        ...

The default implementation would be the same as we currently provide, where the trainer's clipping flags are used.

Maybe those would be deprecated in favor of LightningModule properties.

class LightningOptimizer
    def step(closure=closure)
        if closure is None:
            closure = do_nothing_closure
        def wrapper_closure()
            closure()
            self._trainer.call_hook("clip_gradients", self.optimizer)
        self.optimizer.step(closure=wrapper_closure)

Need to evaluate the limitations since clipping is currently tied to plugins

Additional context

This would fix #5096, #6123 (comment), #5671, #5982, and allow easily implementing new clipping techniques without having to merge them into Lightning

cc: @rohitgr7 who has been pushing for this for a while

@carmocca carmocca added feature Is an improvement or enhancement help wanted Open to be worked on refactor design Includes a design discussion labels Mar 4, 2021
@carmocca carmocca added this to the 1.3 milestone Mar 4, 2021
@tchaton
Copy link
Contributor

tchaton commented Mar 4, 2021

@SeanNaren @awaelchli Any thoughts ?

@rohitgr7
Copy link
Contributor

rohitgr7 commented Mar 4, 2021

Thanks @carmocca for bringing this up ☺️. This is on my TODO list actually but I kept pushing it further due to refactors. It would be great to have this since it's a part of optimization just like optimizer_step.

@edenlightning edenlightning removed this from the v1.3 milestone Apr 27, 2021
@edenlightning edenlightning added this to the v1.4 milestone May 9, 2021
@edenlightning
Copy link
Contributor

@rohitgr7 firendly ping?

@edenlightning edenlightning modified the milestones: v1.4, v1.5 Jul 1, 2021
@ericharper
Copy link
Contributor

Being able to customize gradient clipping is still really important to the NeMo team. I hope these hooks can be added soon.

@rohitgr7
Copy link
Contributor

yes, working on it.. will update soon.

@ananthsub
Copy link
Contributor

ananthsub commented Sep 17, 2021

@rohitgr7 @carmocca Some of the issues we ought to resolve here:

Using self.trainer.accelerator as part of the default implementation in the LightningModule is too fragile: someone will override the whole method with a custom implementation, and it'll fail when using different trainer flags. At the same time, it's not good if we have 2 different places where gradient clipping is implemented: once in the plugin, and once in the LIghtningModule.

It also goes against the principle of #7315

@tchaton
Copy link
Contributor

tchaton commented Sep 22, 2021

  • on_after_backward

Hey @ananthsub,

1 ) I believe we could pass the clip_val and gradient_clip_algorithm as extra arguments to the function. However, users might ignore both if they have a custom implementation.

2 ) DeepSpeed and FSDP don't support gradient clipping as the gradients are concatenated. I don't think we should block the features entirely for 2 plugins which are non mature for this feature.

  1. The PrecisionPlugin should raise a MisConfiguration if the values are being provided and the plugins doesn't support it. If the users override the clip_gradient, we would just put a warning.

  2. on_after_backward is actually the wrong place to implement this. This won't work with accumulated_grad_batches as it is called after each backward, it should be done in on_before_optimizer_step hook. The fact we are confused ourself where to apply gradient clipping makes me think users will be too. A dedicated hook seems simpler

  3. Should we do loop -> lightning module -> trainer -> accelerator or loop -> accelerator -> lightning module. I would prefer the first one as it is makes maintenance simpler and the code more minimalist. But don't have a strong opinion there.

Best,
T.C

@carmocca
Copy link
Contributor Author

+1 to passing the existing arguments to the hook so that they can be used for a default implementation

@rohitgr7
Copy link
Contributor

+1 to passing the existing arguments to the hook so that they can be used for a default implementation

  1. so are you suggesting keeping both trainer arguments add adding arguments in the new clip_gradients hook? I'd prefer not to since it might confuse users a little bit.. like having the same hparams at 2 different locations.. but that's just a personal opinion.

  2. regarding loop -> lightning module -> trainer -> accelerator part I think we have the same workflow for optimizer_step so this looks ok to me.

@rohitgr7
Copy link
Contributor

ok seems like keeping trainer arguments is a good option and we can pass them to this hook directly as defaults.

for now I can think of 2 ways for api design.

def clip_gradients(self, opt, opt_idx, clip_val, clip_algo):
    # if someone wants to use the internal implementation, use super
   if opt_idx == 0:
     super().clip_gradients(opt, clip_val, clip_algo)
   else:
      # implement your own
   

on this one, pointed out by @awaelchli that setting the values in the hooks signature and then calling super() is not something we have anywhere in Lightning so it will be very unfamiliar to users.

  1. or we can put the current implementation inside clip_gradients and make configure_gradient_clipping available to users
def configure_gradient_clipping(self, optimizer, optimizer_idx, clip_val, clip_algo):
    if optimizer_idx == 0:
        self.clip_gradients(optimizer, clip_val, clip_algo). # lightning will handle this
    else:
        # implement your own

but open to suggestions :)

@tchaton
Copy link
Contributor

tchaton commented Sep 27, 2021

Hey @rohitgr7,

Interesting. It is a great point !

I like the option 2.

@ananthsub @carmocca Any thoughts ?

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement help wanted Open to be worked on refactor
Projects
None yet
6 participants