diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1087b684622d8..50701237fafa1 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -15,6 +15,8 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 
 ### Changed
 
+- Changed the order of `backward`, `step`, `zero_grad` to `zero_grad`, `backward`, `step` ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))
+
 
 ### Deprecated
 
@@ -30,6 +32,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
 - Fixed multiple early stopping callbacks ([#6197](https://github.com/PyTorchLightning/pytorch-lightning/pull/6197))
 
 
+- Fixed LBFGS optimizer support which didn't converge in automatic optimization ([#6147](https://github.com/PyTorchLightning/pytorch-lightning/pull/6147))
+
+
 - Prevent `WandbLogger` from dropping values ([#5931](https://github.com/PyTorchLightning/pytorch-lightning/pull/5931))
 
 
diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst
index 3f7cd7f224a97..4f77f23cab605 100644
--- a/docs/source/common/optimizers.rst
+++ b/docs/source/common/optimizers.rst
@@ -23,17 +23,16 @@ to manually manage the optimization process. To do so, do the following:
 
 * Override your LightningModule ``automatic_optimization`` property to return ``False``
 * Drop or ignore the optimizer_idx argument
-* Use `self.manual_backward(loss)` instead of `loss.backward()`.
+* Use ``self.manual_backward(loss)`` instead of ``loss.backward()``.
 
-.. note:: This is only recommended for experts who need ultimate flexibility. Lightning will handle only precision and accelerators logic. The users are left with zero_grad, accumulated_grad_batches, model toggling, etc..
+.. note:: This is only recommended for experts who need ultimate flexibility. Lightning will handle only precision and accelerators logic. The users are left with ``optimizer.zero_grad()``, gradient accumulation, model toggling, etc..
 
-.. warning:: Before 1.2, ``optimzer.step`` was calling ``zero_grad`` internally. From 1.2, it is left to the users expertize.
+.. warning:: Before 1.2, ``optimzer.step`` was calling ``optimizer.zero_grad()`` internally. From 1.2, it is left to the users expertize.
 
 .. tip:: To perform ``accumulate_grad_batches`` with one optimizer, you can do as such.
 
 .. tip:: ``self.optimizers()`` will return ``LightningOptimizer`` objects. You can access your own optimizer with ``optimizer.optimizer``. However, if you use your own optimizer to perform a step, Lightning won't be able to support accelerators and precision for you.
 
-
 .. code-block:: python
 
     def training_step(batch, batch_idx, optimizer_idx):
@@ -41,14 +40,14 @@ to manually manage the optimization process. To do so, do the following:
 
         loss = self.compute_loss(batch)
         self.manual_backward(loss)
-        opt.step()
 
         # accumulate gradient batches
         if batch_idx % 2 == 0:
+            opt.step()
             opt.zero_grad()
 
 
-.. tip:: It is a good practice to provide the optimizer with a ``closure`` function that performs a ``forward`` and ``backward`` pass of your model. It is optional for most optimizers, but makes your code compatible if you switch to an optimizer which requires a closure.
+.. tip:: It is a good practice to provide the optimizer with a ``closure`` function that performs a ``forward`` and ``backward`` pass of your model. It is optional for most optimizers, but makes your code compatible if you switch to an optimizer which requires a closure. See also `the PyTorch docs <https://pytorch.org/docs/stable/optim.html#optimizer-step-closure>`_.
 
 Here is the same example as above using a ``closure``.
 
@@ -71,7 +70,6 @@ Here is the same example as above using a ``closure``.
 .. code-block:: python
 
     # Scenario for a GAN.
-
     def training_step(...):
         opt_gen, opt_dis = self.optimizers()
 
@@ -137,8 +135,12 @@ Here is an example on how to use it:
 
 Automatic optimization
 ======================
-With Lightning most users don't have to think about when to call .backward(), .step(), .zero_grad(), since
-Lightning automates that for you.
+With Lightning most users don't have to think about when to call ``.zero_grad()``, ``.backward()`` and ``.step()``
+since Lightning automates that for you.
+
+.. warning::
+   Before 1.2.2, ``.zero_grad()`` was called after ``.backward()`` and ``.step()`` internally.
+   From 1.2.2, Lightning calls ``.zero_grad()`` before ``.backward()``.
 
 Under the hood Lightning does the following:
 
@@ -147,33 +149,33 @@ Under the hood Lightning does the following:
     for epoch in epochs:
         for batch in data:
             loss = model.training_step(batch, batch_idx, ...)
+            optimizer.zero_grad()
             loss.backward()
             optimizer.step()
-            optimizer.zero_grad()
 
-        for scheduler in schedulers:
-            scheduler.step()
+        for lr_scheduler in lr_schedulers:
+            lr_scheduler.step()
 
 In the case of multiple optimizers, Lightning does the following:
 
 .. code-block:: python
 
     for epoch in epochs:
-      for batch in data:
-         for opt in optimizers:
-            disable_grads_for_other_optimizers()
-            train_step(opt)
-            opt.step()
+        for batch in data:
+            for opt in optimizers:
+                loss = model.training_step(batch, batch_idx, optimizer_idx)
+                opt.zero_grad()
+                loss.backward()
+                opt.step()
 
-      for scheduler in schedulers:
-         scheduler.step()
+        for lr_scheduler in lr_schedulers:
+            lr_scheduler.step()
 
 
 Learning rate scheduling
 ------------------------
-Every optimizer you use can be paired with any `LearningRateScheduler <https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate>`_.
-In the basic use-case, the scheduler (or multiple schedulers) should be returned as the second output from the ``.configure_optimizers``
-method:
+Every optimizer you use can be paired with any `Learning Rate Scheduler <https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate>`_.
+In the basic use-case, the scheduler (or multiple schedulers) should be returned as the second output from the ``.configure_optimizers`` method:
 
 .. testcode::
 
@@ -262,7 +264,7 @@ returned as a dict which can contain the following keywords:
 
 Use multiple optimizers (like GANs)
 -----------------------------------
-To use multiple optimizers return > 1 optimizers from :meth:`pytorch_lightning.core.LightningModule.configure_optimizers`
+To use multiple optimizers return two or more optimizers from :meth:`pytorch_lightning.core.LightningModule.configure_optimizers`
 
 .. testcode::
 
@@ -283,13 +285,15 @@ Lightning will call each optimizer sequentially:
 .. code-block:: python
 
    for epoch in epochs:
-      for batch in data:
-         for opt in optimizers:
-            train_step(opt)
-            opt.step()
+       for batch in data:
+           for opt in optimizers:
+               loss = train_step(batch, batch_idx, optimizer_idx)
+               opt.zero_grad()
+               loss.backward()
+               opt.step()
 
-      for scheduler in schedulers:
-         scheduler.step()
+      for lr_scheduler in lr_schedulers:
+          lr_scheduler.step()
 
 ----------
 
@@ -334,7 +338,7 @@ Here we add a learning-rate warm up
         # update params
         optimizer.step(closure=closure)
 
-.. note:: The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step. It handles TPUs, AMP, accumulate_grad_batches, zero_grad, and much more ...
+.. note:: The default ``optimizer_step`` is relying on the internal ``LightningOptimizer`` to properly perform a step. It handles TPUs, AMP, accumulate_grad_batches and much more ...
 
 .. testcode::
 
@@ -364,6 +368,11 @@ Using the closure functions for optimization
 
 When using optimization schemes such as LBFGS, the `second_order_closure` needs to be enabled. By default, this function is defined by wrapping the `training_step` and the backward steps as follows
 
+.. warning::
+   Before 1.2.2, ``.zero_grad()`` was called outside the closure internally.
+   From 1.2.2, the closure calls ``.zero_grad()`` inside, so there is no need to define your own closure
+   when using similar optimizers to :class:`torch.optim.LBFGS` which requires reevaluation of the loss with the closure in ``optimizer.step()``.
+
 .. testcode::
 
     def second_order_closure(pl_module, split_batch, batch_idx, opt_idx, optimizer, hidden):
diff --git a/docs/source/starter/introduction_guide.rst b/docs/source/starter/introduction_guide.rst
index 9a78310d17400..2ee31304299e0 100644
--- a/docs/source/starter/introduction_guide.rst
+++ b/docs/source/starter/introduction_guide.rst
@@ -361,9 +361,9 @@ The training step is what happens inside the training loop.
             # TRAINING STEP
             # ....
             # TRAINING STEP
+            optimizer.zero_grad()
             loss.backward()
             optimizer.step()
-            optimizer.zero_grad()
 
 In the case of MNIST, we do the following
 
@@ -377,9 +377,9 @@ In the case of MNIST, we do the following
             loss = F.nll_loss(logits, y)
             # ------ TRAINING STEP END ------
 
+            optimizer.zero_grad()
             loss.backward()
             optimizer.step()
-            optimizer.zero_grad()
 
 In Lightning, everything that is in the training step gets organized under the
 :func:`~pytorch_lightning.core.LightningModule.training_step` function in the LightningModule.
diff --git a/docs/source/starter/new-project.rst b/docs/source/starter/new-project.rst
index 0a07e9903b348..156db1e060023 100644
--- a/docs/source/starter/new-project.rst
+++ b/docs/source/starter/new-project.rst
@@ -248,7 +248,7 @@ as long as you return a loss with an attached graph from the `training_step`, Li
 .. code-block:: python
 
     def training_step(self, batch, batch_idx):
-        loss = self.encoder(batch[0])
+        loss = self.encoder(batch)
         return loss
 
 .. _manual_opt:
@@ -267,19 +267,18 @@ Turn off automatic optimization and you control the train loop!
 
     def training_step(self, batch, batch_idx, optimizer_idx):
         # access your optimizers with use_pl_optimizer=False. Default is True
-        (opt_a, opt_b, opt_c) = self.optimizers(use_pl_optimizer=True)
+        opt_a, opt_b = self.optimizers(use_pl_optimizer=True)
 
-        loss_a = self.generator(batch[0])
-
-        # use this instead of loss.backward so we can automate half precision, etc...
-        self.manual_backward(loss_a, opt_a, retain_graph=True)
-        self.manual_backward(loss_a, opt_a)
-        opt_a.step()
+        loss_a = self.generator(batch)
         opt_a.zero_grad()
+        # use `manual_backward()` instead of `loss.backward` to automate half precision, etc...
+        self.manual_backward(loss_a)
+        opt_a.step()
 
-        loss_b = self.discriminator(batch[0])
-        self.manual_backward(loss_b, opt_b)
-        ...
+        loss_b = self.discriminator(batch)
+        opt_b.zero_grad()
+        self.manual_backward(loss_b)
+        opt_b.step()
 
 
 Predict or Deploy
diff --git a/pytorch_lightning/core/optimizer.py b/pytorch_lightning/core/optimizer.py
index b6aca3b457a1f..8b6548f438756 100644
--- a/pytorch_lightning/core/optimizer.py
+++ b/pytorch_lightning/core/optimizer.py
@@ -129,15 +129,10 @@ def toggle_model(self, sync_grad: bool = True):
     def __optimizer_step(self, closure: Optional[Callable] = None, profiler_name: str = None, **kwargs):
         trainer = self._trainer
         optimizer = self._optimizer
-        model = trainer.lightning_module
 
         with trainer.profiler.profile(profiler_name):
             trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
 
-        if self._trainer.train_loop.automatic_optimization:
-            trainer.train_loop.on_before_zero_grad(optimizer)
-            model.optimizer_zero_grad(trainer.current_epoch, trainer.batch_idx, optimizer, self._optimizer_idx)
-
     def step(self, *args, closure: Optional[Callable] = None, **kwargs):
         """
         Call this directly from your training_step when doing optimizations manually.
diff --git a/pytorch_lightning/trainer/training_loop.py b/pytorch_lightning/trainer/training_loop.py
index 2acfcd59b722c..23c0abf877a97 100644
--- a/pytorch_lightning/trainer/training_loop.py
+++ b/pytorch_lightning/trainer/training_loop.py
@@ -742,7 +742,13 @@ def training_step_and_backward(self, split_batch, batch_idx, opt_idx, optimizer,
             result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
             self._curr_step_result = result
 
-            if not self._skip_backward and self.trainer.train_loop.automatic_optimization:
+            if not self._skip_backward and self.automatic_optimization:
+                is_first_batch_to_accumulate = batch_idx % self.trainer.accumulate_grad_batches == 0
+
+                if is_first_batch_to_accumulate:
+                    self.on_before_zero_grad(optimizer)
+                    self.optimizer_zero_grad(batch_idx, optimizer, opt_idx)
+
                 # backward pass
                 if result is not None:
                     with self.trainer.profiler.profile("model_backward"):
diff --git a/tests/callbacks/test_callbacks.py b/tests/callbacks/test_callbacks.py
index 3eaaf81ca0a1e..8d01841f3636c 100644
--- a/tests/callbacks/test_callbacks.py
+++ b/tests/callbacks/test_callbacks.py
@@ -73,20 +73,20 @@ def test_trainer_callback_system(torch_save, tmpdir):
         call.on_train_epoch_start(trainer, model),
         call.on_batch_start(trainer, model),
         call.on_train_batch_start(trainer, model, ANY, 0, 0),
-        call.on_after_backward(trainer, model),
         call.on_before_zero_grad(trainer, model, trainer.optimizers[0]),
+        call.on_after_backward(trainer, model),
         call.on_train_batch_end(trainer, model, ANY, ANY, 0, 0),
         call.on_batch_end(trainer, model),
         call.on_batch_start(trainer, model),
         call.on_train_batch_start(trainer, model, ANY, 1, 0),
-        call.on_after_backward(trainer, model),
         call.on_before_zero_grad(trainer, model, trainer.optimizers[0]),
+        call.on_after_backward(trainer, model),
         call.on_train_batch_end(trainer, model, ANY, ANY, 1, 0),
         call.on_batch_end(trainer, model),
         call.on_batch_start(trainer, model),
         call.on_train_batch_start(trainer, model, ANY, 2, 0),
-        call.on_after_backward(trainer, model),
         call.on_before_zero_grad(trainer, model, trainer.optimizers[0]),
+        call.on_after_backward(trainer, model),
         call.on_train_batch_end(trainer, model, ANY, ANY, 2, 0),
         call.on_batch_end(trainer, model),
         call.on_train_epoch_end(trainer, model, ANY),
diff --git a/tests/core/test_lightning_module.py b/tests/core/test_lightning_module.py
index 17449b96d7cab..8270867ae863c 100644
--- a/tests/core/test_lightning_module.py
+++ b/tests/core/test_lightning_module.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from unittest.mock import Mock, patch
+from unittest.mock import Mock
 
 import pytest
 from torch import nn
@@ -74,7 +74,7 @@ def test_property_logger(tmpdir):
     assert model.logger == logger
 
 
-def test_automatic_optimization(tmpdir):
+def test_automatic_optimization_raises(tmpdir):
 
     class TestModel(BoringModel):
 
@@ -95,70 +95,6 @@ def optimizer_step(self, *_, **__):
         trainer.fit(model)
 
 
-def test_automatic_optimization_num_calls(tmpdir):
-
-    with patch("torch.optim.SGD.step") as sgd_step, \
-         patch("torch.optim.SGD.zero_grad") as sgd_zero_grad, \
-         patch("torch.optim.Adam.step") as adam_step, \
-         patch("torch.optim.Adam.zero_grad") as adam_zero_grad:
-
-        class TestModel(BoringModel):
-
-            def training_step(self, batch, batch_idx, optimizer_idx):
-                output = self.layer(batch)
-                loss = self.loss(batch, output)
-                return {"loss": loss}
-
-            def configure_optimizers(self):
-                optimizer = SGD(self.layer.parameters(), lr=0.1)
-                optimizer_2 = Adam(self.layer.parameters(), lr=0.1)
-                return [optimizer, optimizer_2]
-
-            def optimizer_step(
-                self,
-                epoch,
-                batch_idx,
-                optimizer,
-                optimizer_idx,
-                optimizer_closure,
-                on_tpu,
-                using_native_amp,
-                using_lbfgs,
-            ):
-
-                assert optimizer_closure.__name__ == "train_step_and_backward_closure"
-
-                # update generator opt every 2 steps
-                if optimizer_idx == 0:
-                    if batch_idx % 2 == 0:
-                        assert isinstance(optimizer, SGD)
-                        optimizer.step(closure=optimizer_closure)
-
-                # update discriminator opt every 4 steps
-                if optimizer_idx == 1:
-                    if batch_idx % 4 == 0:
-                        assert isinstance(optimizer, Adam)
-                        optimizer.step(closure=optimizer_closure)
-
-        model = TestModel()
-        model.training_epoch_end = None
-
-        trainer = Trainer(
-            max_epochs=1,
-            default_root_dir=tmpdir,
-            limit_train_batches=8,
-            limit_val_batches=1,
-            accumulate_grad_batches=1,
-        )
-
-        trainer.fit(model)
-
-    assert sgd_step.call_count == 4
-    assert sgd_zero_grad.call_count == 4
-    assert adam_step.call_count == 2
-    assert adam_zero_grad.call_count == 2
-
-
 def test_params_groups_and_state_are_accessible(tmpdir):
 
     class TestModel(BoringModel):
diff --git a/tests/core/test_lightning_optimizer.py b/tests/core/test_lightning_optimizer.py
index fab6ceccbfd88..fd37f812c2337 100644
--- a/tests/core/test_lightning_optimizer.py
+++ b/tests/core/test_lightning_optimizer.py
@@ -11,10 +11,10 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from unittest.mock import patch
+from unittest.mock import patch, DEFAULT
 
 import torch
-from torch.optim import Adam, Optimizer
+from torch.optim import Adam, Optimizer, SGD
 
 from pytorch_lightning import Trainer
 from pytorch_lightning.core.optimizer import LightningOptimizer
@@ -76,11 +76,9 @@ def configure_optimizers(self):
     assert trainer._lightning_optimizers[0].__repr__() == expected
 
 
-@patch("torch.optim.Adam.step", autospec=True)
-@patch("torch.optim.SGD.step", autospec=True)
-def test_lightning_optimizer_manual_optimization(mock_sgd_step, mock_adam_step, tmpdir):
+def test_lightning_optimizer_manual_optimization_and_accumulated_gradients(tmpdir):
     """
-    Test that the user can use our LightningOptimizer. Not recommended for now.
+    Test that the user can use our LightningOptimizer. Not recommended.
     """
 
     class TestModel(BoringModel):
@@ -90,29 +88,26 @@ def __init__(self):
             self.automatic_optimization = False
 
         def training_step(self, batch, batch_idx, optimizer_idx=None):
-            (opt_1, opt_2) = self.optimizers()
+            opt_1, opt_2 = self.optimizers()
             assert isinstance(opt_1, LightningOptimizer)
             assert isinstance(opt_2, LightningOptimizer)
 
-            output = self.layer(batch)
-            loss_1 = self.loss(batch, output)
-            self.manual_backward(loss_1)
-            opt_1.step()
-            opt_1.zero_grad()
-
-            output = self.layer(batch)
-            loss_2 = self.loss(batch, output)
-            self.manual_backward(loss_2)
+            def closure(opt):
+                output = self.layer(batch)
+                loss = self.loss(batch, output)
+                opt.zero_grad()
+                self.manual_backward(loss)
 
             if batch_idx % 2 == 0:
-                opt_2.step()
-                opt_2.zero_grad()
+                closure(opt_1)
+                opt_1.step()
+
+            closure(opt_2)
+            opt_2.step()
 
         def configure_optimizers(self):
             optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
             optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1)
-            optimizer_1 = LightningOptimizer(optimizer_1)
-
             lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1)
             return [optimizer_1, optimizer_2], [lr_scheduler]
 
@@ -125,66 +120,18 @@ def configure_optimizers(self):
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
+        accumulate_grad_batches=999,  # does not do anything if manual optimization
     )
-    trainer.fit(model)
-
-    assert len(mock_sgd_step.mock_calls) == 8
-    assert len(mock_adam_step.mock_calls) == 4
-
-
-@patch("torch.optim.Adam.step", autospec=True)
-@patch("torch.optim.SGD.step", autospec=True)
-def test_lightning_optimizer_manual_optimization_and_accumulated_gradients(mock_sgd_step, mock_adam_step, tmpdir):
-    """
-    Test that the user can use our LightningOptimizer. Not recommended.
-    """
-
-    class TestModel(BoringModel):
-
-        def __init__(self):
-            super().__init__()
-            self.automatic_optimization = False
-
-        def training_step(self, batch, batch_idx, optimizer_idx=None):
-            (opt_1, opt_2) = self.optimizers()
-            assert isinstance(opt_1, LightningOptimizer)
-            assert isinstance(opt_2, LightningOptimizer)
-
-            output = self.layer(batch)
-            loss_1 = self.loss(batch, output)
-            self.manual_backward(loss_1)
-            opt_1.step()
-
-            def closure():
-                output = self.layer(batch)
-                loss_2 = self.loss(batch, output)
-                self.manual_backward(loss_2)
 
-            opt_2.step(closure=closure)
+    with patch.multiple(torch.optim.SGD, zero_grad=DEFAULT, step=DEFAULT) as sgd, \
+         patch.multiple(torch.optim.Adam, zero_grad=DEFAULT, step=DEFAULT) as adam:
+        trainer.fit(model)
 
-        def configure_optimizers(self):
-            optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
-            optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1)
-            optimizer_1 = LightningOptimizer(optimizer_1)
+    assert sgd["step"].call_count == 4
+    assert adam["step"].call_count == 8
 
-            lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1)
-            return [optimizer_1, optimizer_2], [lr_scheduler]
-
-    model = TestModel()
-    model.training_step_end = None
-    model.training_epoch_end = None
-    trainer = Trainer(
-        default_root_dir=tmpdir,
-        limit_train_batches=8,
-        limit_val_batches=1,
-        max_epochs=1,
-        weights_summary=None,
-        accumulate_grad_batches=2,
-    )
-    trainer.fit(model)
-
-    assert len(mock_sgd_step.mock_calls) == 8
-    assert len(mock_adam_step.mock_calls) == 8
+    assert sgd["zero_grad"].call_count == 4
+    assert adam["zero_grad"].call_count == 8
 
 
 def test_state(tmpdir):
@@ -237,93 +184,119 @@ def test_state(tmpdir):
     assert optimizer.state == lightning_optimizer.state
 
 
-def test_lightning_optimizer_automatic_optimization(tmpdir):
+def test_lightning_optimizer_automatic_optimization_optimizer_zero_grad(tmpdir):
     """
-    Test lightning optimize works with in automatic_optimization
+    Test overriding zero_grad works in automatic_optimization
+    """
+    class TestModel(BoringModel):
+
+        def training_step(self, batch, batch_idx, optimizer_idx=None):
+            return super().training_step(batch, batch_idx)
+
+        def training_epoch_end(self, outputs):
+            ...
+
+        def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
+            if isinstance(optimizer, SGD) and batch_idx % 2 == 0:
+                optimizer.zero_grad()
+            if isinstance(optimizer, Adam) and batch_idx % 5 == 0:
+                optimizer.zero_grad()
+
+        def configure_optimizers(self):
+            optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
+            optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1)
+            lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1)
+            return [optimizer_1, optimizer_2], [lr_scheduler]
+
+    model = TestModel()
+    trainer = Trainer(
+        default_root_dir=tmpdir,
+        limit_train_batches=20,
+        limit_val_batches=1,
+        max_epochs=1,
+        weights_summary=None,
+    )
+
+    with patch("torch.optim.Adam.zero_grad") as adam_zero_grad, \
+         patch("torch.optim.SGD.zero_grad") as sgd_zero_grad:
+        trainer.fit(model)
+
+    assert adam_zero_grad.call_count == 4
+    assert sgd_zero_grad.call_count == 10
+
+
+def test_lightning_optimizer_automatic_optimization_optimizer_step(tmpdir):
+    """
+    Test overriding step works in automatic_optimization
     """
 
     class TestModel(BoringModel):
 
         def training_step(self, batch, batch_idx, optimizer_idx=None):
-            output = self.layer(batch)
-            loss = self.loss(batch, output)
-            return {"loss": loss}
+            return super().training_step(batch, batch_idx)
 
         def training_epoch_end(self, outputs):
-            outputs = sum(outputs, [])
-            torch.stack([x["loss"] for x in outputs]).mean()
+            ...
 
-        def optimizer_step(
-            self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, on_tpu, using_native_amp, using_lbfgs
-        ):
+        def optimizer_step(self, epoch, batch_idx, optimizer, optimizer_idx, optimizer_closure, **_):
             assert optimizer_closure.__name__ == "train_step_and_backward_closure"
-            optimizer_closure()
-            if batch_idx % 2 == 0:
+            # not passing the closure to the optimizer because step is mocked
+            # zero_grad is called inside the closure
+            if isinstance(optimizer, SGD) and batch_idx % 2 == 0:
+                optimizer_closure()
                 optimizer.step()
-                optimizer.zero_grad()
+            if isinstance(optimizer, Adam) and batch_idx % 4 == 0:
+                optimizer_closure()
+                optimizer.step()  # not passing the closure here because it's a mock
 
         def configure_optimizers(self):
             optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
             optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1)
-            optimizer_1 = LightningOptimizer(optimizer_1)
-
             lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1)
             return [optimizer_1, optimizer_2], [lr_scheduler]
 
     model = TestModel()
+
     trainer = Trainer(
         default_root_dir=tmpdir,
-        limit_train_batches=10,
+        limit_train_batches=8,
         limit_val_batches=1,
         max_epochs=1,
         weights_summary=None,
     )
-    trainer.fit(model)
 
+    with patch.multiple(torch.optim.SGD, zero_grad=DEFAULT, step=DEFAULT) as sgd, \
+         patch.multiple(torch.optim.Adam, zero_grad=DEFAULT, step=DEFAULT) as adam:
+        trainer.fit(model)
 
-def test_lightning_optimizer_automatic_optimization_optimizer_zero_grad(tmpdir):
+    assert sgd["step"].call_count == 4
+    assert adam["step"].call_count == 2
+
+    assert sgd["zero_grad"].call_count == 4
+    assert adam["zero_grad"].call_count == 2
+
+
+def test_lightning_optimizer_automatic_optimization_lbfgs_zero_grad(tmpdir):
     """
-    Test lightning optimize works with optimizer_zero_grad overrides in automatic_optimization
+    Test zero_grad is called the same number of times as LBFGS requires
+    for reevaluation of the loss in automatic_optimization.
     """
+    class TestModel(BoringModel):
+        def configure_optimizers(self):
+            return torch.optim.LBFGS(self.parameters())
 
-    with patch("torch.optim.Adam.zero_grad") as adam_zero_grad, \
-            patch("torch.optim.SGD.zero_grad") as sgd_zero_grad:
-
-        class TestModel(BoringModel):
+    model = TestModel()
+    trainer = Trainer(
+        default_root_dir=tmpdir,
+        limit_train_batches=1,
+        limit_val_batches=1,
+        max_epochs=1,
+        weights_summary=None,
+    )
 
-            def training_step(self, batch, batch_idx, optimizer_idx=None):
-                output = self.layer(batch)
-                loss = self.loss(batch, output)
-                return {"loss": loss}
-
-            def training_epoch_end(self, outputs):
-                outputs = sum(outputs, [])
-                torch.stack([x["loss"] for x in outputs]).mean()
-
-            def optimizer_zero_grad(self, epoch: int, batch_idx: int, optimizer: Optimizer, optimizer_idx: int):
-                if optimizer_idx == 0:
-                    if batch_idx % 2 == 0:
-                        optimizer.zero_grad()
-
-                if optimizer_idx == 1:
-                    if batch_idx % 5 == 0:
-                        optimizer.zero_grad()
-
-            def configure_optimizers(self):
-                optimizer_1 = torch.optim.SGD(self.layer.parameters(), lr=0.1)
-                optimizer_2 = torch.optim.Adam(self.layer.parameters(), lr=0.1)
-                lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer_1, step_size=1)
-                return [optimizer_1, optimizer_2], [lr_scheduler]
-
-        model = TestModel()
-        trainer = Trainer(
-            default_root_dir=tmpdir,
-            limit_train_batches=20,
-            limit_val_batches=1,
-            max_epochs=1,
-            weights_summary=None,
-        )
+    with patch("torch.optim.LBFGS.zero_grad") as zero_grad:
         trainer.fit(model)
 
-        assert adam_zero_grad.call_count == 4
-        assert sgd_zero_grad.call_count == 10
+    lbfgs = model.optimizers()
+    max_iter = lbfgs.param_groups[0]["max_iter"]
+    assert zero_grad.call_count == max_iter
diff --git a/tests/models/test_hooks.py b/tests/models/test_hooks.py
index 416a858537245..62a252eaa3128 100644
--- a/tests/models/test_hooks.py
+++ b/tests/models/test_hooks.py
@@ -446,12 +446,12 @@ def teardown(self, stage: str):
         'on_epoch_start',
         'on_train_epoch_start',
         'on_train_batch_start',
-        'on_after_backward',
         'on_before_zero_grad',
+        'on_after_backward',
         'on_train_batch_end',
         'on_train_batch_start',
-        'on_after_backward',
         'on_before_zero_grad',
+        'on_after_backward',
         'on_train_batch_end',
         'on_train_epoch_end',
         'on_epoch_end',
diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py
index 167930425dab1..23861cdd2563b 100644
--- a/tests/trainer/test_trainer.py
+++ b/tests/trainer/test_trainer.py
@@ -26,6 +26,7 @@
 import pytest
 import torch
 from omegaconf import OmegaConf
+from torch.optim import SGD
 from torch.utils.data import DataLoader
 
 import tests.helpers.utils as tutils
@@ -199,129 +200,32 @@ def test_strict_model_load(monkeypatch, tmpdir, tmpdir_server, url_ckpt):
     assert not failed, "Model should be loaded due to strict=False."
 
 
-@pytest.mark.parametrize(
-    ["schedule", "expected"],
-    [
-        pytest.param({
-            1: 2,
-            3: 4
-        }, [1, 2, 4]),
-        pytest.param(3, [3, 3, 3]),
-        pytest.param(4, [4, 4, 4]),
-    ],
-)
-def test_gradient_accumulation_scheduling(tmpdir, schedule, expected):
-    """
-    Test grad accumulation by the freq of optimizer updates
-    """
-
-    # test incorrect configs
-    with pytest.raises(IndexError):
-        assert Trainer(accumulate_grad_batches={-1: 3, 1: 4, 4: 6})
-    with pytest.raises(IndexError):
-        assert Trainer(accumulate_grad_batches={-2: 3})
-
-    with pytest.raises(TypeError):
-        assert Trainer(accumulate_grad_batches={})
-    with pytest.raises(TypeError):
-        assert Trainer(accumulate_grad_batches=[[2, 3], [4, 6]])
-    with pytest.raises(TypeError):
-        assert Trainer(accumulate_grad_batches={1: 2, 3.0: 4})
-    with pytest.raises(TypeError):
-        assert Trainer(accumulate_grad_batches={1: 2.5, 3: 5})
-
-    model = EvalModelTemplate()
-
-    trainer = Trainer(
-        accumulate_grad_batches=schedule,
-        limit_train_batches=0.7,  # not to be divisible by accumulate_grad_batches on purpose
-        limit_val_batches=0.8,
-        max_epochs=4,
-        default_root_dir=tmpdir,
-    )
-
-    model.old_optimizer_step = model.optimizer_step
-
-    # test optimizer call freq matches scheduler
-    def _optimizer_step(
-        epoch,
-        batch_idx,
-        optimizer,
-        optimizer_idx,
-        second_order_closure=None,
-        on_tpu=False,
-        using_native_amp=False,
-        using_lbfgs=False,
-    ):
-        # only test the first 12 batches in epoch
-        if batch_idx < 12:
-            if epoch == 0:
-                # reset counter when starting epoch
-                if batch_idx == expected[0] - 1:
-                    model.prev_called_batch_idx = expected[0] - 1
-
-                    # use this opportunity to test once
-                    assert trainer.accumulate_grad_batches == expected[0]
-
-                # separate check for last batch with accumulate 1 step
-                if expected[0] == 1 and (batch_idx + 1) == trainer.num_training_batches:
-                    assert batch_idx == model.prev_called_batch_idx
-                elif (batch_idx + 1) == trainer.num_training_batches:
-                    # prev_called_batch_idx - schedule + modulus remainder
-                    assert batch_idx == (model.prev_called_batch_idx - expected[0] + (batch_idx + 1) % expected[0])
-                else:
-                    assert batch_idx == model.prev_called_batch_idx
-                    model.prev_called_batch_idx += expected[0]
-
-            elif 1 <= epoch <= 2:
-                # reset counter when starting epoch
-                if batch_idx == expected[1] - 1:
-                    model.prev_called_batch_idx = expected[1] - 1
-
-                    # use this opportunity to test once
-                    assert trainer.accumulate_grad_batches == expected[1]
-
-                if trainer.num_training_batches == batch_idx + 1:
-                    # prev_called_batch_idx - schedule + modulus remainder
-                    assert batch_idx == (model.prev_called_batch_idx - expected[1] + (batch_idx + 1) % expected[1])
-                else:
-                    assert batch_idx == model.prev_called_batch_idx
-                    model.prev_called_batch_idx += expected[1]
-
-            else:
-                if batch_idx == expected[2] - 1:
-                    model.prev_called_batch_idx = expected[2] - 1
-
-                    # use this opportunity to test once
-                    assert trainer.accumulate_grad_batches == expected[2]
-
-                if (batch_idx + 1) == trainer.num_training_batches:
-                    # prev_called_batch_idx - schedule + modulus remainder
-                    assert batch_idx == (model.prev_called_batch_idx - expected[2] + (batch_idx + 1) % expected[2])
-                else:
-                    assert batch_idx == model.prev_called_batch_idx
-                    model.prev_called_batch_idx += expected[2]
-
-        model.old_optimizer_step(
-            epoch, batch_idx, optimizer, optimizer_idx, second_order_closure, on_tpu, using_native_amp, using_lbfgs
+@pytest.mark.parametrize("accumulate_grad_batches", (1, 2, 3))
+def test_trainer_accumulate_grad_batches_zero_grad(tmpdir, accumulate_grad_batches):
+    with patch("torch.optim.SGD.zero_grad") as sgd_zero_grad:
+        model = BoringModel()
+        trainer = Trainer(
+            default_root_dir=tmpdir,
+            limit_train_batches=20,
+            limit_val_batches=1,
+            max_epochs=1,
+            weights_summary=None,
+            accumulate_grad_batches=accumulate_grad_batches,
         )
+        trainer.fit(model)
+
+        assert sgd_zero_grad.call_count == math.ceil(trainer.limit_train_batches / accumulate_grad_batches)
 
 
 @pytest.mark.parametrize(
     ["accumulate_grad_batches", "limit_train_batches"],
     [
-        pytest.param({
-            1: 2,
-            3: 4
-        }, 1.0),
-        pytest.param({
-            1: 2,
-            3: 4
-        }, 0.5),  # not to be divisible by accumulate_grad_batches on purpose
-        pytest.param(3, 1.0),
-        pytest.param(3, 0.8),  # not to be divisible by accumulate_grad_batches on purpose
-        pytest.param(4, 1.0),
-        pytest.param(4, 0.7),  # not to be divisible by accumulate_grad_batches on purpose
+        ({1: 2, 3: 4}, 1.0),
+        ({1: 2, 3: 4}, 0.5),  # not to be divisible by accumulate_grad_batches on purpose
+        (3, 1.0),
+        (3, 0.8),  # not to be divisible by accumulate_grad_batches on purpose
+        (4, 1.0),
+        (4, 0.7),  # not to be divisible by accumulate_grad_batches on purpose
     ],
 )
 def test_gradient_accumulation_scheduling_last_batch(tmpdir, accumulate_grad_batches, limit_train_batches):
@@ -329,20 +233,19 @@ def test_gradient_accumulation_scheduling_last_batch(tmpdir, accumulate_grad_bat
 
     class CurrentModel(BoringModel):
 
-        def on_batch_start(self, batch, batch_idx, dataloader_idx):
+        def on_batch_start(self, *_):
             self.on_train_batch_start_state_dict = self.state_dict()
 
-        def on_batch_end(self, outputs, batch, batch_idx, dataloader_idx):
+        def on_batch_end(self, outputs, batch, batch_idx, *_):
             self.on_train_batch_start_end_dict = self.state_dict()
             for key in self.on_train_batch_start_end_dict.keys():
+                equal = torch.equal(
+                    self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key]
+                )
                 if (batch_idx + 1) == self.trainer.num_training_batches:
-                    assert torch.equal(
-                        self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key]
-                    )
+                    assert equal
                 else:
-                    assert not torch.equal(
-                        self.on_train_batch_start_state_dict[key], self.on_train_batch_start_end_dict[key]
-                    )
+                    assert not equal
 
     model = CurrentModel()
 
@@ -1809,3 +1712,91 @@ def training_step(self, batch, batch_idx):
     model = TestModel()
     trainer = Trainer(default_root_dir=tmpdir, fast_dev_run=True, gpus=1)
     trainer.fit(model, train_data)
+
+
+def test_train_loop_system(tmpdir):
+    """
+    Test the following methods are called in the order in automatic optimization.
+    1. optimizer.step (skip when gradient accumulation)
+    2. model.training_step
+    3. optimizer.zero_grad (run when the first batch of gradient accumulation)
+    4. model.backward
+
+    Note that the order is NOT `training_step`->`zero_grad`->`backward`->`step`.
+    This is because `optimizer.step(closure)` calls `closure()` which then calls
+    the three remaining methods `training_step`, `zero_grad` and `backward` inside.
+    """
+    called_methods = []
+
+    trainer_options = dict(
+        default_root_dir=tmpdir,
+        max_epochs=1,
+        limit_train_batches=5,
+        limit_val_batches=1,
+        limit_test_batches=1,
+        progress_bar_refresh_rate=0,
+    )
+
+    class TestOptimizer(SGD):
+        def step(self, *args, **kwargs):
+            called_methods.append("step")
+            return super().step(*args, **kwargs)
+
+        def zero_grad(self, *args, **kwargs):
+            called_methods.append("zero_grad")
+            return super().zero_grad(*args, **kwargs)
+
+    class TestModel(BoringModel):
+        def configure_optimizers(self):
+            return TestOptimizer(self.parameters(), lr=0.1)
+
+        def training_step(self, *args, **kwargs):
+            called_methods.append("training_step")
+            return super().training_step(*args, **kwargs)
+
+        def backward(self, *args, **kwargs):
+            called_methods.append("backward")
+            return super().backward(*args, **kwargs)
+
+    model = TestModel()
+    trainer = Trainer(**trainer_options)
+
+    # No methods are called yet.
+    assert called_methods == []
+
+    trainer.fit(model)
+    assert called_methods == [
+        "step",
+        "training_step",
+        "zero_grad",
+        "backward",
+    ] * trainer.limit_train_batches
+
+    called_methods.clear()
+    trainer = Trainer(**trainer_options, accumulate_grad_batches=3)
+
+    # No methods are called yet.
+    assert called_methods == []
+
+    trainer.fit(model)
+    assert called_methods == [
+        # 0
+        "training_step",
+        "zero_grad",
+        "backward",
+        # 1
+        "training_step",
+        "backward",
+        # 2
+        "step",
+        "training_step",
+        "backward",
+        # 3
+        "training_step",
+        "zero_grad",
+        "backward",
+        # 4
+        "step",
+        "training_step",
+        "backward",
+    ]