Trainer(precision=16) fails with optim.lr_scheduler.ReduceLROnPlateau #2078

naokishibuya · 2020-06-05T04:36:43Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Create a pl.LightningModule that returns your optimizer along with a optim.lr_scheduler.ReduceLROnPlateau scheduler from configure_optimizers
Create a pl.Trainer wit precision=16
Run your training (i.e., trainer.fit(model))
See error

Traceback (most recent call last):                                                                                                  
  File "main.py", line 65, in <module>                                                                                              
    main()                                                                                                                          
  File "main.py", line 61, in main                                                                                                  
    trainer.fit(model)                                                                                                              
  File "/workspace/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 889, in fit                             
    self.dp_train(model)                                                                                                            
  File "/workspace/pytorch-lightning/pytorch_lightning/trainer/distrib_parts.py", line 223, in dp_train                  
    self.reinit_scheduler_properties(optimizers, self.lr_schedulers)                                                                
  File "/workspace/pytorch-lightning/pytorch_lightning/trainer/optimizers.py", line 122, in reinit_scheduler_properties  
    scheduler.__class__.__mro__[idx].__init__(scheduler, optimizer)                                                                 
UnboundLocalError: local variable 'idx' referenced before assignment

The error occurs in pytorch-lightning/pytorch_lightning/trainer/optimizers.py", line 122.

def reinit_scheduler_properties(self, optimizers: list, schedulers: list):
    # Reinitialize optimizer.step properties added by schedulers
    for scheduler in schedulers:
        for optimizer in optimizers:
            scheduler = scheduler['scheduler']
            # check that we dont mix users optimizers and schedulers
            if scheduler.optimizer == optimizer:
                # Find the mro belonging to the base lr scheduler class
                for i, mro in enumerate(scheduler.__class__.__mro__):
                    if mro == optim.lr_scheduler._LRScheduler:
                        idx = i
                scheduler.__class__.__mro__[idx].__init__(scheduler, optimizer)

The idx local variable is unassigned because optim.lr_scheduler.ReduceLROnPlateau is not a subclass of optim.lr_scheduler._LRScheduler.

I could work around the error by adding a specific check for optim.lr_scheduler.ReduceLROnPlateau but I'm not sure if this is a good solution.

def reinit_scheduler_properties(self, optimizers: list, schedulers: list):
    # Reinitialize optimizer.step properties added by schedulers
    for scheduler in schedulers:
        for optimizer in optimizers:
            scheduler = scheduler['scheduler']
            # check that we dont mix users optimizers and schedulers
            if scheduler.optimizer == optimizer:
                # Find the mro belonging to the base lr scheduler class
                for i, mro in enumerate(scheduler.__class__.__mro__):
                    if mro == optim.lr_scheduler._LRScheduler:
                        idx = i
                    elif mro == optim.lr_scheduler.ReduceLROnPlateau:
                        idx = i
                scheduler.__class__.__mro__[idx].__init__(scheduler, optimizer)

Related issue in PyTorch:

ReduceLROnPlateau parent class is not _LRScheduler #21981
pytorch/pytorch#21981

The text was updated successfully, but these errors were encountered:

github-actions · 2020-06-05T04:37:41Z

Hi! thanks for your contribution!, great first issue!

SkafteNicki · 2020-06-09T13:11:30Z

@naokishibuya good catch. It seems like a problem that should be solved upstream in pytorch, but for now we can solve this locally. Would you be up for a PR?

Anjum48 · 2020-06-09T19:37:00Z

When I tried this fix, it solved the error but unfortunately ReduceLROnPlateau stopped working for me (i.e. there was no indication of the LR decreasing with verbose=True or on TensorBoard). If I switched back to precision=32, it works normally again

SkafteNicki · 2020-06-09T20:57:23Z

I think that the fix is actually working, however only calling __init__(scheduler, optimizer) will reset all other arguments (patience, mode, ect) to default values for the ReduceLrOnPlauteau scheduler. A solution to this is to copy over these properties:

__init__(scheduler, optimizer, patience=scheduler.patience,mode=scheduler.mode,...)

Again I think this is a bit hacky, and a proper solution upstream in pytorch is better.

Anjum48 · 2020-06-11T16:04:35Z

I think this does the trick for me:

def reinit_scheduler_properties(self, optimizers: list, schedulers: list):
    # Reinitialize optimizer.step properties added by schedulers
    for scheduler in schedulers:
        for optimizer in optimizers:
            scheduler = scheduler["scheduler"]
            # check that we dont mix users optimizers and schedulers
            if scheduler.optimizer == optimizer:
                # Find the mro belonging to the base lr scheduler class
                for i, mro in enumerate(scheduler.__class__.__mro__):
                    if (
                        mro == optim.lr_scheduler._LRScheduler
                        or mro == optim.lr_scheduler.ReduceLROnPlateau
                    ):
                        idx = i
                        state = scheduler.state_dict()
                    else:
                        state = None
                scheduler.__class__.__mro__[idx].__init__(scheduler, optimizer)
                if state is not None:
                    scheduler.load_state_dict(state)

Happy to open a PR if it looks ok to you guys

naokishibuya added the help wanted Open to be worked on label Jun 5, 2020

SkafteNicki mentioned this issue Jun 24, 2020

use_amp and multiple optimizers bug #2330

Closed

williamFalcon mentioned this issue Jun 25, 2020

swaps lr sched order #2356

Merged

williamFalcon closed this as completed in #2356 Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer(precision=16) fails with optim.lr_scheduler.ReduceLROnPlateau #2078

Trainer(precision=16) fails with optim.lr_scheduler.ReduceLROnPlateau #2078

naokishibuya commented Jun 5, 2020

github-actions bot commented Jun 5, 2020

SkafteNicki commented Jun 9, 2020

Anjum48 commented Jun 9, 2020

SkafteNicki commented Jun 9, 2020

Anjum48 commented Jun 11, 2020

Trainer(precision=16) fails with optim.lr_scheduler.ReduceLROnPlateau #2078

Trainer(precision=16) fails with optim.lr_scheduler.ReduceLROnPlateau #2078

Comments

naokishibuya commented Jun 5, 2020

🐛 Bug

To Reproduce

Related issue in PyTorch:

github-actions bot commented Jun 5, 2020

SkafteNicki commented Jun 9, 2020

Anjum48 commented Jun 9, 2020

SkafteNicki commented Jun 9, 2020

Anjum48 commented Jun 11, 2020