Fixes recovering when the model expects metrics to be ready (allenai#…

…5293) * Fixes recovering when the model expects metrics to be ready * Changelog
zegnog · Jul 1, 2021 · 5378533 · 5378533
1 parent 7428155
commit 5378533
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 11 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -29,6 +29,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Ensured `ensure_model_can_train_save_and_load` is consistently random.
 - Fixed weight tying logic in `T5` transformer module. Previously input/output embeddings were always tied. Now this is optional,
   and the default behavior is taken from the `config.tie_word_embeddings` value when instantiating `from_pretrained_module()`.
+- Fixed recovering training jobs with models that expect `get_metrics()` to not be called until they have seen at least one batch.
 
 ### Changed
 

diff --git a/allennlp/training/gradient_descent_trainer.py b/allennlp/training/gradient_descent_trainer.py
@@ -555,17 +555,23 @@ def _train_epoch(self, epoch: int) -> Dict[str, float]:
         if self._distributed:
             dist.barrier()
 
-        metrics = training_util.get_metrics(
-            self.model,
-            train_loss,
-            train_reg_loss,
-            batch_loss=None,
-            batch_reg_loss=None,
-            num_batches=self._batches_in_epoch_completed,
-            reset=True,
-            world_size=self._world_size,
-            cuda_device=self.cuda_device,
-        )
+        if self._epochs_completed < self._start_after_epochs_completed or (
+            self._epochs_completed == self._start_after_epochs_completed
+            and self._batches_in_epoch_completed - 1 < self._start_after_batches_in_epoch_completed
+        ):
+            metrics = {}
+        else:
+            metrics = training_util.get_metrics(
+                self.model,
+                train_loss,
+                train_reg_loss,
+                batch_loss=None,
+                batch_reg_loss=None,
+                num_batches=self._batches_in_epoch_completed,
+                reset=True,
+                world_size=self._world_size,
+                cuda_device=self.cuda_device,
+            )
 
         for (worker, memory) in cpu_memory_usage:
             metrics["worker_" + str(worker) + "_memory_MB"] = memory / (1024 * 1024)