diff --git a/CHANGELOG.md b/CHANGELOG.md index 1087b684622d8..50701237fafa1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). ### Changed +- Changed the order of `backward`, `step`, `zero_grad` to `zero_grad`, `backward`, `step` ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147)) + ### Deprecated @@ -30,6 +32,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Fixed multiple early stopping callbacks ([#6197](https://github.com/PyTorchLightning/pytorch-lightning/pull/6197)) +- Fixed LBFGS optimizer support which didn't converge in automatic optimization ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147)) + + - Prevent `WandbLogger` from dropping values ([#5931](https://github.com/PyTorchLightning/pytorch-lightning/pull/5931)) diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst index 3f7cd7f224a97..4f77f23cab605 100644 --- a/docs/source/common/optimizers.rst +++ b/docs/source/common/optimizers.rst @@ -23,17 +23,16 @@ to manually manage the optimization process. To do so, do the following: * Override your LightningModule ``automatic_optimization`` property to return ``False`` * Drop or ignore the optimizer_idx argument -* Use `self.manual_backward(loss)` instead of `loss.backward()`. +* Use ``self.manual_backward(loss)`` instead of ``loss.backward()``. -.. note:: This is only recommended for experts who need ultimate flexibility. Lightning will handle only precision and accelerators logic. The users are left with zero_grad, accumulated_grad_batches, model toggling, etc.. +.. note:: This is only recommended for experts who need ultimate flexibility. Lightning will handle only precision and accelerators logic. The users are left with ``optimizer.zero_grad()``, gradient accumulation, model toggling, etc.. -.. warning:: Before 1.2, ``optimzer.step`` was calling ``zero_grad`` internally. From 1.2, it is left to the users expertize. +.. warning:: Before 1.2, ``optimzer.step`` was calling ``optimizer.zero_grad()`` internally. From 1.2, it is left to the users expertize. .. tip:: To perform ``accumulate_grad_batches`` with one optimizer, you can do as such. .. tip:: ``self.optimizers()`` will return ``LightningOptimizer`` objects. You can access your own optimizer with ``optimizer.optimizer``. However, if you use your own optimizer to perform a step, Lightning won't be able to support accelerators and precision for you. - .. code-block:: python def training_step(batch, batch_idx, optimizer_idx): @@ -41,14 +40,14 @@ to manually manage the optimization process. To do so, do the following: loss = self.compute_loss(batch) self.manual_backward(loss) - opt.step() # accumulate gradient batches if batch_idx % 2 == 0: + opt.step() opt.zero_grad() -.. tip:: It is a good practice to provide the optimizer with a ``closure`` function that performs a ``forward`` and ``backward`` pass of your model. It is optional for most optimizers, but makes your code compatible if you switch to an optimizer which requires a closure. +.. tip:: It is a good practice to provide the optimizer with a ``closure`` function that performs a ``forward`` and ``backward`` pass of your model. It is optional for most optimizers, but makes your code compatible if you switch to an optimizer which requires a closure. See also `the PyTorch docs `_. Here is the same example as above using a ``closure``. @@ -71,7 +70,6 @@ Here is the same example as above using a ``closure``. .. code-block:: python # Scenario for a GAN. - def training_step(...): opt_gen, opt_dis = self.optimizers() @@ -137,8 +135,12 @@ Here is an example on how to use it: Automatic optimization ====================== -With Lightning most users don't have to think about when to call .backward(), .step(), .zero_grad(), since -Lightning automates that for you. +With Lightning most users don't have to think about when to call ``.zero_grad()``, ``.backward()`` and ``.step()`` +since Lightning automates that for you. + +.. warning:: + Before 1.2.2, ``.zero_grad()`` was called after ``.backward()`` and ``.step()`` internally. + From 1.2.2, Lightning calls ``.zero_grad()`` before ``.backward()``. Under the hood Lightning does the following: @@ -147,33 +149,33 @@ Under the hood Lightning does the following: for epoch in epochs: for batch in data: loss = model.training_step(batch, batch_idx, ...) + optimizer.zero_grad() loss.backward() optimizer.step() - optimizer.zero_grad() - for scheduler in schedulers: - scheduler.step() + for lr_scheduler in lr_schedulers: + lr_scheduler.step() In the case of multiple optimizers, Lightning does the following: .. code-block:: python for epoch in epochs: - for batch in data: - for opt in optimizers: - disable_grads_for_other_optimizers() - train_step(opt) - opt.step() + for batch in data: + for opt in optimizers: + loss = model.training_step(batch, batch_idx, optimizer_idx) + opt.zero_grad() + loss.backward() + opt.step() - for scheduler in schedulers: - scheduler.step() + for lr_scheduler in lr_schedulers: + lr_scheduler.step() Learning rate scheduling ------------------------ -Every optimizer you use can be paired with any `LearningRateScheduler `_. -In the basic use-case, the scheduler (or multiple schedulers) should be returned as the second output from the ``.configure_optimizers`` -method: +Every optimizer you use can be paired with any `Learning Rate Scheduler `_. +In the basic use-case, the scheduler (or multiple schedulers) should be returned as the second output from the ``.configure_optimizers`` method: .. testcode:: @@ -262,7 +264,7 @@ returned as a dict which can contain the following keywords: Use multiple optimizers (like GANs) ----------------------------------- -To use multiple optimizers return > 1 optimizers from :meth:`pytorch_lightning.core.LightningModule.configure_optimizers` +To use multiple optimizers return two or more optimizers from :meth:`pytorch_lightning.core.LightningModule.configure_optimizers` .. testcode:: @@ -283,13 +285,15 @@ Lightning will call each optimizer sequentially: .. code-block:: python for epoch in epochs: - for batch in data: - for opt in optimizers: - train_step(opt) - opt.step() + for batch in data: + for opt in optimizers: + loss = train_step(batch, batch_idx, optimizer_idx) + opt.zero_grad() + loss.backward() + opt.step() - for scheduler in schedulers: - scheduler.step() + for lr_scheduler in lr_schedulers: + lr_scheduler.step() ---------- @@ -334,7 +338,7 @@ Here we add a learning-rate warm up # update params optimizer.step(closure=closure) -.. note:: The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step. It handles TPUs, AMP, accumulate_grad_batches, zero_grad, and much more ... +.. note:: The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step. It handles TPUs, AMP, accumulate_grad_batches and much more ... .. testcode:: @@ -364,6 +368,11 @@ Using the closure functions for optimization When using optimization schemes such as LBFGS, the `second_order_closure` needs to be enabled. By default, this function is defined by wrapping the `training_step` and the backward steps as follows +.. warning:: + Before 1.2.2, ``.zero_grad()`` was called outside the closure internally. + From 1.2.2, the closure calls ``.zero_grad()`` inside, so there is no need to define your own closure + when using similar optimizers to :class:`torch.optim.LBFGS` which requires reevaluation of the loss with the closure in ``optimizer.step()``. + .. testcode:: def second_order_closure(pl_module, split_batch, batch_idx, opt_idx, optimizer, hidden): diff --git a/docs/source/starter/introduction_guide.rst b/docs/source/starter/introduction_guide.rst index 9a78310d17400..2ee31304299e0 100644 --- a/docs/source/starter/introduction_guide.rst +++ b/docs/source/starter/introduction_guide.rst @@ -361,9 +361,9 @@ The training step is what happens inside the training loop. # TRAINING STEP # .... # TRAINING STEP + optimizer.zero_grad() loss.backward() optimizer.step() - optimizer.zero_grad() In the case of MNIST, we do the following @@ -377,9 +377,9 @@ In the case of MNIST, we do the following loss = F.nll_loss(logits, y) # ------ TRAINING STEP END ------ + optimizer.zero_grad() loss.backward() optimizer.step() - optimizer.zero_grad() In Lightning, everything that is in the training step gets organized under the :func:`~pytorch_lightning.core.LightningModule.training_step` function in the LightningModule. diff --git a/docs/source/starter/new-project.rst b/docs/source/starter/new-project.rst index 0a07e9903b348..156db1e060023 100644 --- a/docs/source/starter/new-project.rst +++ b/docs/source/starter/new-project.rst @@ -248,7 +248,7 @@ as long as you return a loss with an attached graph from the `training_step`, Li .. code-block:: python def training_step(self, batch, batch_idx): - loss = self.encoder(batch[0]) + loss = self.encoder(batch) return loss .. _manual_opt: @@ -267,19 +267,18 @@ Turn off automatic optimization and you control the train loop! def training_step(self, batch, batch_idx, optimizer_idx): # access your optimizers with use_pl_optimizer=False. Default is True - (opt_a, opt_b, opt_c) = self.optimizers(use_pl_optimizer=True) + opt_a, opt_b = self.optimizers(use_pl_optimizer=True) - loss_a = self.generator(batch[0]) - - # use this instead of loss.backward so we can automate half precision, etc... - self.manual_backward(loss_a, opt_a, retain_graph=True) - self.manual_backward(loss_a, opt_a) - opt_a.step() + loss_a = self.generator(batch) opt_a.zero_grad() + # use `manual_backward()` instead of `loss.backward` to automate half precision, etc... + self.manual_backward(loss_a) + opt_a.step() - loss_b = self.discriminator(batch[0]) - self.manual_backward(loss_b, opt_b) - ... + loss_b = self.discriminator(batch) + opt_b.zero_grad() + self.manual_backward(loss_b) + opt_b.step() Predict or Deploy diff --git a/pytorch_lightning/core/optimizer.py b/pytorch_lightning/core/optimizer.py index b6aca3b457a1f..8b6548f438756 100644 --- a/pytorch_lightning/core/optimizer.py +++ b/pytorch_lightning/core/optimizer.py @@ -129,15 +129,10 @@ def toggle_model(self, sync_grad: bool = True): def __optimizer_step(self, closure: Optional[Callable] = None, profiler_name: str = None, **kwargs): trainer = self._trainer optimizer = self._optimizer - model = trainer.lightning_module with trainer.profiler.profile(profiler_name): trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs) - if self._trainer.train_loop.automatic_optimization: - trainer.train_loop.on_before_zero_grad(optimizer) - model.optimizer_zero_grad(trainer.current_epoch, trainer.batch_idx, optimizer, self._optimizer_idx) - def step(self, *args, closure: Optional[Callable] = None, **kwargs): """ Call this directly from your training_step when doing optimizations manually. diff --git a/pytorch_lightning/trainer/training_loop.py b/pytorch_lightning/trainer/training_loop.py index 2acfcd59b722c..23c0abf877a97 100644 --- a/pytorch_lightning/trainer/training_loop.py +++ b/pytorch_lightning/trainer/training_loop.py @@ -742,7 +742,13 @@ def training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer, result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) self._curr_step_result = result - if not self._skip_backward and self.trainer.train_loop.automatic_optimization: + if not self._skip_backward and self.automatic_optimization: + is_first_batch_to_accumulate = batch_idx % self.trainer.accumulate_grad_batches == 0 + + if is_first_batch_to_accumulate: + self.on_before_zero_grad(optimizer) + self.optimizer_zero_grad(batch_idx, optimizer, opt_idx) + # backward pass if result is not None: with self.trainer.profiler.profile("model_backward"): diff --git a/tests/callbacks/test_callbacks.py b/tests/callbacks/test_callbacks.py index 3eaaf81ca0a1e..8d01841f3636c 100644 --- a/tests/callbacks/test_callbacks.py +++ b/tests/callbacks/test_callbacks.py @@ -73,20 +73,20 @@ def test_trainer_callback_system(torch_save, tmpdir): call.on_train_epoch_start(trainer, model), call.on_batch_start(trainer, model), call.on_train_batch_start(trainer, model, ANY, 0, 0), - call.on_after_backward(trainer, model), call.on_before_zero_grad(trainer, model, trainer.optimizers[0]), + call.on_after_backward(trainer, model), call.on_train_batch_end(trainer, model, ANY, ANY, 0, 0), call.on_batch_end(trainer, model), call.on_batch_start(trainer, model), call.on_train_batch_start(trainer, model, ANY, 1, 0), - call.on_after_backward(trainer, model), call.on_before_zero_grad(trainer, model, trainer.optimizers[0]), + call.on_after_backward(trainer, model), call.on_train_batch_end(trainer, model, ANY, ANY, 1, 0), call.on_batch_end(trainer, model), call.on_batch_start(trainer, model), call.on_train_batch_start(trainer, model, ANY, 2, 0), - call.on_after_backward(trainer, model), call.on_before_zero_grad(trainer, model, trainer.optimizers[0]), + call.on_after_backward(trainer, model), call.on_train_batch_end(trainer, model, ANY, ANY, 2, 0), call.on_batch_end(trainer, model), call.on_train_epoch_end(trainer, model, ANY), diff --git a/tests/core/test_lightning_module.py b/tests/core/test_lightning_module.py index 17449b96d7cab..8270867ae863c 100644 --- a/tests/core/test_lightning_module.py +++ b/tests/core/test_lightning_module.py @@ -11,7 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from unittest.mock import Mock, patch +from unittest.mock import Mock import pytest from torch import nn @@ -74,7 +74,7 @@ def test_property_logger(tmpdir): assert model.logger == logger -def test_automatic_optimization(tmpdir): +def test_automatic_optimization_raises(tmpdir): class TestModel(BoringModel): @@ -95,70 +95,6 @@ def optimizer_step(self, *_, **__): trainer.fit(model) -def test_automatic_optimization_num_calls(tmpdir): - - with patch("torch.optim.SGD.step") as sgd_step, \ - patch("torch.optim.SGD.zero_grad") as sgd_zero_grad, \ - patch("torch.optim.Adam.step") as adam_step, \ - patch("torch.optim.Adam.zero_grad") as adam_zero_grad: - - class TestModel(BoringModel): - - def training_step(self, batch, batch_idx, optimizer_idx): - output = self.layer(batch) - loss = self.loss(batch, output) - return {"loss": loss} - - def configure_optimizers(self): - optimizer = SGD(self.layer.parameters(), lr=0.1) - optimizer_2 = Adam(self.layer.parameters(), lr=0.1) - return [optimizer, optimizer_2] - - def optimizer_step( - self, - epoch, - batch_idx, - optimizer, - optimizer_idx, - optimizer_closure, - on_tpu, - using_native_amp, - using_lbfgs, - ): - - assert optimizer_closure.__name__ == "train_step_and_backward_closure" - - # update generator opt every 2 steps - if optimizer_idx == 0: - if batch_idx % 2 == 0: - assert isinstance(optimizer, SGD) - optimizer.step(closure=optimizer_closure) - - # update discriminator opt every 4 steps - if optimizer_idx == 1: - if batch_idx % 4 == 0: - assert isinstance(optimizer, Adam) - optimizer.step(closure=optimizer_closure) - - model = TestModel() - model.training_epoch_end = None - - trainer = Trainer( - max_epochs=1, - default_root_dir=tmpdir, - limit_train_batches=8, - limit_val_batches=1, - accumulate_grad_batches=1, - ) - - trainer.fit(model) - - assert sgd_step.call_count == 4 - assert sgd_zero_grad.call_count == 4 - assert adam_step.call_count == 2 - assert adam_zero_grad.call_count == 2 - - def test_params_groups_and_state_are_accessible(tmpdir): class TestModel(BoringModel): diff --git a/tests/core/test_lightning_optimizer.py b/tests/core/test_lightning_optimizer.py index fab6ceccbfd88..fd37f812c2337 100644 --- a/tests/core/test_lightning_optimizer.py +++ b/tests/core/test_lightning_optimizer.py @@ -11,10 +11,10 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from unittest.mock import patch +from unittest.mock import patch, DEFAULT import torch -from torch.optim import Adam, Optimizer +from torch.optim import Adam, Optimizer, SGD from pytorch_lightning import Trainer from pytorch_lightning.core.optimizer import LightningOptimizer @@ -76,11 +76,9 @@ def configure_optimizers(self): assert trainer._lightning_optimizers[0].__repr__() == expected -@patch("torch.optim.Adam.step", autospec=True) -@patch("torch.optim.SGD.step", autospec=True) -def test_lightning_optimizer_manual_optimization(mock_sgd_step, mock_adam_step, tmpdir): +def test_lightning_optimizer_manual_optimization_and_accumulated_gradients(tmpdir): """ - Test that the user can use our LightningOptimizer. Not recommended for now. + Test that the user can use our LightningOptimizer. Not recommended. """ class TestModel(BoringModel): @@ -90,29 +88,26 @@ def __init__(self): self.automatic_optimization = False def training_step(self, batch, batch_idx, optimizer_idx=None): - (opt_1, opt_2) = self.optimizers() + opt_1, opt_2 = self.optimizers() assert isinstance(opt_1, LightningOptimizer) assert isinstance(opt_2, LightningOptimizer) - output = self.layer(batch) - loss_1 = self.loss(batch, output) - self.manual_backward(loss_1) - opt_1.step() - opt_1.zero_grad() - - output = self.layer(batch) - loss_2 = self.loss(batch, output) - self.manual_backward(loss_2) + def closure(opt): + output = self.layer(batch) + loss = self.loss(batch, output) + opt.zero_grad() + self.manual_backward(loss) if batch_idx % 2 == 0: - opt_2.step() - opt_2.zero_grad() + closure(opt_1) + opt_1.step() + + closure(opt_2) + opt_2.step() def configure_optimizers(self): optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1) optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1) - optimizer_1 = LightningOptimizer(optimizer_1) - lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1) return [optimizer_1, optimizer_2], [lr_scheduler] @@ -125,66 +120,18 @@ def configure_optimizers(self): limit_val_batches=1, max_epochs=1, weights_summary=None, + accumulate_grad_batches=999, # does not do anything if manual optimization ) - trainer.fit(model) - - assert len(mock_sgd_step.mock_calls) == 8 - assert len(mock_adam_step.mock_calls) == 4 - - -@patch("torch.optim.Adam.step", autospec=True) -@patch("torch.optim.SGD.step", autospec=True) -def test_lightning_optimizer_manual_optimization_and_accumulated_gradients(mock_sgd_step, mock_adam_step, tmpdir): - """ - Test that the user can use our LightningOptimizer. Not recommended. - """ - - class TestModel(BoringModel): - - def __init__(self): - super().__init__() - self.automatic_optimization = False - - def training_step(self, batch, batch_idx, optimizer_idx=None): - (opt_1, opt_2) = self.optimizers() - assert isinstance(opt_1, LightningOptimizer) - assert isinstance(opt_2, LightningOptimizer) - - output = self.layer(batch) - loss_1 = self.loss(batch, output) - self.manual_backward(loss_1) - opt_1.step() - - def closure(): - output = self.layer(batch) - loss_2 = self.loss(batch, output) - self.manual_backward(loss_2) - opt_2.step(closure=closure) + with patch.multiple(torch.optim.SGD, zero_grad=DEFAULT, step=DEFAULT) as sgd, \ + patch.multiple(torch.optim.Adam, zero_grad=DEFAULT, step=DEFAULT) as adam: + trainer.fit(model) - def configure_optimizers(self): - optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1) - optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1) - optimizer_1 = LightningOptimizer(optimizer_1) + assert sgd["step"].call_count == 4 + assert adam["step"].call_count == 8 - lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1) - return [optimizer_1, optimizer_2], [lr_scheduler] - - model = TestModel() - model.training_step_end = None - model.training_epoch_end = None - trainer = Trainer( - default_root_dir=tmpdir, - limit_train_batches=8, - limit_val_batches=1, - max_epochs=1, - weights_summary=None, - accumulate_grad_batches=2, - ) - trainer.fit(model) - - assert len(mock_sgd_step.mock_calls) == 8 - assert len(mock_adam_step.mock_calls) == 8 + assert sgd["zero_grad"].call_count == 4 + assert adam["zero_grad"].call_count == 8 def test_state(tmpdir): @@ -237,93 +184,119 @@ def test_state(tmpdir): assert optimizer.state == lightning_optimizer.state -def test_lightning_optimizer_automatic_optimization(tmpdir): +def test_lightning_optimizer_automatic_optimization_optimizer_zero_grad(tmpdir): """ - Test lightning optimize works with in automatic_optimization + Test overriding zero_grad works in automatic_optimization + """ + class TestModel(BoringModel): + + def training_step(self, batch, batch_idx, optimizer_idx=None): + return super().training_step(batch, batch_idx) + + def training_epoch_end(self, outputs): + ... + + def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx): + if isinstance(optimizer, SGD) and batch_idx % 2 == 0: + optimizer.zero_grad() + if isinstance(optimizer, Adam) and batch_idx % 5 == 0: + optimizer.zero_grad() + + def configure_optimizers(self): + optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1) + optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1) + lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1) + return [optimizer_1, optimizer_2], [lr_scheduler] + + model = TestModel() + trainer = Trainer( + default_root_dir=tmpdir, + limit_train_batches=20, + limit_val_batches=1, + max_epochs=1, + weights_summary=None, + ) + + with patch("torch.optim.Adam.zero_grad") as adam_zero_grad, \ + patch("torch.optim.SGD.zero_grad") as sgd_zero_grad: + trainer.fit(model) + + assert adam_zero_grad.call_count == 4 + assert sgd_zero_grad.call_count == 10 + + +def test_lightning_optimizer_automatic_optimization_optimizer_step(tmpdir): + """ + Test overriding step works in automatic_optimization """ class TestModel(BoringModel): def training_step(self, batch, batch_idx, optimizer_idx=None): - output = self.layer(batch) - loss = self.loss(batch, output) - return {"loss": loss} + return super().training_step(batch, batch_idx) def training_epoch_end(self, outputs): - outputs = sum(outputs, []) - torch.stack([x["loss"] for x in outputs]).mean() + ... - def optimizer_step( - self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs - ): + def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **_): assert optimizer_closure.__name__ == "train_step_and_backward_closure" - optimizer_closure() - if batch_idx % 2 == 0: + # not passing the closure to the optimizer because step is mocked + # zero_grad is called inside the closure + if isinstance(optimizer, SGD) and batch_idx % 2 == 0: + optimizer_closure() optimizer.step() - optimizer.zero_grad() + if isinstance(optimizer, Adam) and batch_idx % 4 == 0: + optimizer_closure() + optimizer.step() # not passing the closure here because it's a mock def configure_optimizers(self): optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1) optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1) - optimizer_1 = LightningOptimizer(optimizer_1) - lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1) return [optimizer_1, optimizer_2], [lr_scheduler] model = TestModel() + trainer = Trainer( default_root_dir=tmpdir, - limit_train_batches=10, + limit_train_batches=8, limit_val_batches=1, max_epochs=1, weights_summary=None, ) - trainer.fit(model) + with patch.multiple(torch.optim.SGD, zero_grad=DEFAULT, step=DEFAULT) as sgd, \ + patch.multiple(torch.optim.Adam, zero_grad=DEFAULT, step=DEFAULT) as adam: + trainer.fit(model) -def test_lightning_optimizer_automatic_optimization_optimizer_zero_grad(tmpdir): + assert sgd["step"].call_count == 4 + assert adam["step"].call_count == 2 + + assert sgd["zero_grad"].call_count == 4 + assert adam["zero_grad"].call_count == 2 + + +def test_lightning_optimizer_automatic_optimization_lbfgs_zero_grad(tmpdir): """ - Test lightning optimize works with optimizer_zero_grad overrides in automatic_optimization + Test zero_grad is called the same number of times as LBFGS requires + for reevaluation of the loss in automatic_optimization. """ + class TestModel(BoringModel): + def configure_optimizers(self): + return torch.optim.LBFGS(self.parameters()) - with patch("torch.optim.Adam.zero_grad") as adam_zero_grad, \ - patch("torch.optim.SGD.zero_grad") as sgd_zero_grad: - - class TestModel(BoringModel): + model = TestModel() + trainer = Trainer( + default_root_dir=tmpdir, + limit_train_batches=1, + limit_val_batches=1, + max_epochs=1, + weights_summary=None, + ) - def training_step(self, batch, batch_idx, optimizer_idx=None): - output = self.layer(batch) - loss = self.loss(batch, output) - return {"loss": loss} - - def training_epoch_end(self, outputs): - outputs = sum(outputs, []) - torch.stack([x["loss"] for x in outputs]).mean() - - def optimizer_zero_grad(self, epoch: int, batch_idx: int, optimizer: Optimizer, optimizer_idx: int): - if optimizer_idx == 0: - if batch_idx % 2 == 0: - optimizer.zero_grad() - - if optimizer_idx == 1: - if batch_idx % 5 == 0: - optimizer.zero_grad() - - def configure_optimizers(self): - optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1) - optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1) - lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1) - return [optimizer_1, optimizer_2], [lr_scheduler] - - model = TestModel() - trainer = Trainer( - default_root_dir=tmpdir, - limit_train_batches=20, - limit_val_batches=1, - max_epochs=1, - weights_summary=None, - ) + with patch("torch.optim.LBFGS.zero_grad") as zero_grad: trainer.fit(model) - assert adam_zero_grad.call_count == 4 - assert sgd_zero_grad.call_count == 10 + lbfgs = model.optimizers() + max_iter = lbfgs.param_groups[0]["max_iter"] + assert zero_grad.call_count == max_iter diff --git a/tests/models/test_hooks.py b/tests/models/test_hooks.py index 416a858537245..62a252eaa3128 100644 --- a/tests/models/test_hooks.py +++ b/tests/models/test_hooks.py @@ -446,12 +446,12 @@ def teardown(self, stage: str): 'on_epoch_start', 'on_train_epoch_start', 'on_train_batch_start', - 'on_after_backward', 'on_before_zero_grad', + 'on_after_backward', 'on_train_batch_end', 'on_train_batch_start', - 'on_after_backward', 'on_before_zero_grad', + 'on_after_backward', 'on_train_batch_end', 'on_train_epoch_end', 'on_epoch_end', diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py index 167930425dab1..23861cdd2563b 100644 --- a/tests/trainer/test_trainer.py +++ b/tests/trainer/test_trainer.py @@ -26,6 +26,7 @@ import pytest import torch from omegaconf import OmegaConf +from torch.optim import SGD from torch.utils.data import DataLoader import tests.helpers.utils as tutils @@ -199,129 +200,32 @@ def test_strict_model_load(monkeypatch, tmpdir, tmpdir_server, url_ckpt): assert not failed, "Model should be loaded due to strict=False." -@pytest.mark.parametrize( - ["schedule", "expected"], - [ - pytest.param({ - 1: 2, - 3: 4 - }, [1, 2, 4]), - pytest.param(3, [3, 3, 3]), - pytest.param(4, [4, 4, 4]), - ], -) -def test_gradient_accumulation_scheduling(tmpdir, schedule, expected): - """ - Test grad accumulation by the freq of optimizer updates - """ - - # test incorrect configs - with pytest.raises(IndexError): - assert Trainer(accumulate_grad_batches={-1: 3, 1: 4, 4: 6}) - with pytest.raises(IndexError): - assert Trainer(accumulate_grad_batches={-2: 3}) - - with pytest.raises(TypeError): - assert Trainer(accumulate_grad_batches={}) - with pytest.raises(TypeError): - assert Trainer(accumulate_grad_batches=[[2, 3], [4, 6]]) - with pytest.raises(TypeError): - assert Trainer(accumulate_grad_batches={1: 2, 3.0: 4}) - with pytest.raises(TypeError): - assert Trainer(accumulate_grad_batches={1: 2.5, 3: 5}) - - model = EvalModelTemplate() - - trainer = Trainer( - accumulate_grad_batches=schedule, - limit_train_batches=0.7, # not to be divisible by accumulate_grad_batches on purpose - limit_val_batches=0.8, - max_epochs=4, - default_root_dir=tmpdir, - ) - - model.old_optimizer_step = model.optimizer_step - - # test optimizer call freq matches scheduler - def _optimizer_step( - epoch, - batch_idx, - optimizer, - optimizer_idx, - second_order_closure=None, - on_tpu=False, - using_native_amp=False, - using_lbfgs=False, - ): - # only test the first 12 batches in epoch - if batch_idx < 12: - if epoch == 0: - # reset counter when starting epoch - if batch_idx == expected[0] - 1: - model.prev_called_batch_idx = expected[0] - 1 - - # use this opportunity to test once - assert trainer.accumulate_grad_batches == expected[0] - - # separate check for last batch with accumulate 1 step - if expected[0] == 1 and (batch_idx + 1) == trainer.num_training_batches: - assert batch_idx == model.prev_called_batch_idx - elif (batch_idx + 1) == trainer.num_training_batches: - # prev_called_batch_idx - schedule + modulus remainder - assert batch_idx == (model.prev_called_batch_idx - expected[0] + (batch_idx + 1) % expected[0]) - else: - assert batch_idx == model.prev_called_batch_idx - model.prev_called_batch_idx += expected[0] - - elif 1 <= epoch <= 2: - # reset counter when starting epoch - if batch_idx == expected[1] - 1: - model.prev_called_batch_idx = expected[1] - 1 - - # use this opportunity to test once - assert trainer.accumulate_grad_batches == expected[1] - - if trainer.num_training_batches == batch_idx + 1: - # prev_called_batch_idx - schedule + modulus remainder - assert batch_idx == (model.prev_called_batch_idx - expected[1] + (batch_idx + 1) % expected[1]) - else: - assert batch_idx == model.prev_called_batch_idx - model.prev_called_batch_idx += expected[1] - - else: - if batch_idx == expected[2] - 1: - model.prev_called_batch_idx = expected[2] - 1 - - # use this opportunity to test once - assert trainer.accumulate_grad_batches == expected[2] - - if (batch_idx + 1) == trainer.num_training_batches: - # prev_called_batch_idx - schedule + modulus remainder - assert batch_idx == (model.prev_called_batch_idx - expected[2] + (batch_idx + 1) % expected[2]) - else: - assert batch_idx == model.prev_called_batch_idx - model.prev_called_batch_idx += expected[2] - - model.old_optimizer_step( - epoch, batch_idx, optimizer, optimizer_idx, second_order_closure, on_tpu, using_native_amp, using_lbfgs +@pytest.mark.parametrize("accumulate_grad_batches", (1, 2, 3)) +def test_trainer_accumulate_grad_batches_zero_grad(tmpdir, accumulate_grad_batches): + with patch("torch.optim.SGD.zero_grad") as sgd_zero_grad: + model = BoringModel() + trainer = Trainer( + default_root_dir=tmpdir, + limit_train_batches=20, + limit_val_batches=1, + max_epochs=1, + weights_summary=None, + accumulate_grad_batches=accumulate_grad_batches, ) + trainer.fit(model) + + assert sgd_zero_grad.call_count == math.ceil(trainer.limit_train_batches / accumulate_grad_batches) @pytest.mark.parametrize( ["accumulate_grad_batches", "limit_train_batches"], [ - pytest.param({ - 1: 2, - 3: 4 - }, 1.0), - pytest.param({ - 1: 2, - 3: 4 - }, 0.5), # not to be divisible by accumulate_grad_batches on purpose - pytest.param(3, 1.0), - pytest.param(3, 0.8), # not to be divisible by accumulate_grad_batches on purpose - pytest.param(4, 1.0), - pytest.param(4, 0.7), # not to be divisible by accumulate_grad_batches on purpose + ({1: 2, 3: 4}, 1.0), + ({1: 2, 3: 4}, 0.5), # not to be divisible by accumulate_grad_batches on purpose + (3, 1.0), + (3, 0.8), # not to be divisible by accumulate_grad_batches on purpose + (4, 1.0), + (4, 0.7), # not to be divisible by accumulate_grad_batches on purpose ], ) def test_gradient_accumulation_scheduling_last_batch(tmpdir, accumulate_grad_batches, limit_train_batches): @@ -329,20 +233,19 @@ def test_gradient_accumulation_scheduling_last_batch(tmpdir, accumulate_grad_bat class CurrentModel(BoringModel): - def on_batch_start(self, batch, batch_idx, dataloader_idx): + def on_batch_start(self, *_): self.on_train_batch_start_state_dict = self.state_dict() - def on_batch_end(self, outputs, batch, batch_idx, dataloader_idx): + def on_batch_end(self, outputs, batch, batch_idx, *_): self.on_train_batch_start_end_dict = self.state_dict() for key in self.on_train_batch_start_end_dict.keys(): + equal = torch.equal( + self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key] + ) if (batch_idx + 1) == self.trainer.num_training_batches: - assert torch.equal( - self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key] - ) + assert equal else: - assert not torch.equal( - self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key] - ) + assert not equal model = CurrentModel() @@ -1809,3 +1712,91 @@ def training_step(self, batch, batch_idx): model = TestModel() trainer = Trainer(default_root_dir=tmpdir, fast_dev_run=True, gpus=1) trainer.fit(model, train_data) + + +def test_train_loop_system(tmpdir): + """ + Test the following methods are called in the order in automatic optimization. + 1. optimizer.step (skip when gradient accumulation) + 2. model.training_step + 3. optimizer.zero_grad (run when the first batch of gradient accumulation) + 4. model.backward + + Note that the order is NOT `training_step`->`zero_grad`->`backward`->`step`. + This is because `optimizer.step(closure)` calls `closure()` which then calls + the three remaining methods `training_step`, `zero_grad` and `backward` inside. + """ + called_methods = [] + + trainer_options = dict( + default_root_dir=tmpdir, + max_epochs=1, + limit_train_batches=5, + limit_val_batches=1, + limit_test_batches=1, + progress_bar_refresh_rate=0, + ) + + class TestOptimizer(SGD): + def step(self, *args, **kwargs): + called_methods.append("step") + return super().step(*args, **kwargs) + + def zero_grad(self, *args, **kwargs): + called_methods.append("zero_grad") + return super().zero_grad(*args, **kwargs) + + class TestModel(BoringModel): + def configure_optimizers(self): + return TestOptimizer(self.parameters(), lr=0.1) + + def training_step(self, *args, **kwargs): + called_methods.append("training_step") + return super().training_step(*args, **kwargs) + + def backward(self, *args, **kwargs): + called_methods.append("backward") + return super().backward(*args, **kwargs) + + model = TestModel() + trainer = Trainer(**trainer_options) + + # No methods are called yet. + assert called_methods == [] + + trainer.fit(model) + assert called_methods == [ + "step", + "training_step", + "zero_grad", + "backward", + ] * trainer.limit_train_batches + + called_methods.clear() + trainer = Trainer(**trainer_options, accumulate_grad_batches=3) + + # No methods are called yet. + assert called_methods == [] + + trainer.fit(model) + assert called_methods == [ + # 0 + "training_step", + "zero_grad", + "backward", + # 1 + "training_step", + "backward", + # 2 + "step", + "training_step", + "backward", + # 3 + "training_step", + "zero_grad", + "backward", + # 4 + "step", + "training_step", + "backward", + ]