Lightning-AI · williamFalcon · Mar 5, 2020 · Mar 5, 2020 · Mar 5, 2020 · Mar 5, 2020
@@ -34,47 +34,35 @@ Log metrics
 
 To plot metrics into whatever logger you passed in (tensorboard, comet, neptune, etc...)
 
-1. Training_end, validation_end, test_end will all log anything in the "log" key of the return dict.
+1. training_epoch_end, validation_epoch_end, test_epoch_end will all log anything in the "log" key of the return dict.
 
 .. code-block:: python
 
-   def training_end(self, outputs):
+   def training_epoch_end(self, outputs):
       loss = some_loss()
       ...
 
       logs = {'train_loss': loss}
       results = {'log': logs}
       return results
 
-   def validation_end(self, outputs):
+   def validation_epoch_end(self, outputs):
       loss = some_loss()
       ...
 
       logs = {'val_loss': loss}
       results = {'log': logs}
       return results
 
-   def test_end(self, outputs):
+   def test_epoch_end(self, outputs):
       loss = some_loss()
       ...
 
       logs = {'test_loss': loss}
       results = {'log': logs}
       return results
 
-2. Most of the time, you only need training_step and not training_end. You can also return logs from here:
-
-.. code-block:: python
-
-   def training_step(self, batch, batch_idx):
-      loss = some_loss()
-      ...
-
-      logs = {'train_loss': loss}
-      results = {'log': logs}
-      return results
-
-3. In addition, you can also use any arbitrary functionality from a particular logger from within your LightningModule.
+2. In addition, you can also use any arbitrary functionality from a particular logger from within your LightningModule.
 For instance, here we log images using tensorboard.
 
 .. code-block:: python

@@ -26,7 +26,7 @@ Training loop
 - on_batch_start
 - tbptt_split_batch
 - training_step
-- training_end (optional)
+- training_step_end (optional)
 - backward
 - on_after_backward
 - optimizer.step()

@@ -165,12 +165,13 @@ you will only be operating on one of those pieces.
         y_0 = batch
 
 For most metrics, this doesn't really matter. However, if you want
-full batch statistics or want to use the outputs of the training_step
-to do something like a softmax, you can use the `training_end` step.
+to add something to your computational graph (like softmax)
+using all batch parts you can use the `training_step_end` step.
 
 .. code-block:: python
 
-    def training_end(self, outputs):
+    def training_step_end(self, outputs):
+        # only use when  on dp
         outputs = torch.cat(outputs, dim=1)
         softmax = softmax(outputs, dim=1)
         out = softmax.mean()
@@ -195,9 +196,43 @@ In pseudocode, the full sequence is:
         out = gpu_model(batch_split)
         all_results.append(out)
 
-    # calculate statistics for all parts of the batch
-    full out = model.training_end(all_results)
+    # use the full batch for something like softmax
+    full out = model.training_step_end(all_results)
 
+to illustrate why this is needed, let's look at dataparallel
+
+.. code-block:: python
+
+    def training_step(self, batch, batch_idx):
+        x, y = batch
+        y_hat = self.forward(batch)
+
+        # on dp or ddp2 if we did softmax now it would be wrong
+        # because batch is actually a piece of the full batch
+        return y_hat
+
+    def training_step_end(self, batch_parts_outputs):
+        # batch_parts_outputs has outputs of each part of the batch
+
+        # do softmax here
+        outputs = torch.cat(outputs, dim=1)
+        softmax = softmax(outputs, dim=1)
+        out = softmax.mean()
+
+        return out
+
+If `training_step_end` is defined it will be called regardless of tpu, dp, ddp, etc... which means
+it will behave the same no matter the backend.
+
+Validation and test step also have the same option when using dp
+
+.. code-block:: python
+
+        def validation_step_end(self, batch_parts_outputs):
+            ...
+
+        def test_step_end(self, batch_parts_outputs):
+            ...
 
 Implement Your Own Distributed (DDP) training
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^