Skip to content

Commit

Permalink
Fixes recovering when the model expects metrics to be ready (allenai#…
Browse files Browse the repository at this point in the history
…5293)

* Fixes recovering when the model expects metrics to be ready

* Changelog
  • Loading branch information
dirkgr authored Jul 1, 2021
1 parent 7428155 commit 5378533
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 11 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Ensured `ensure_model_can_train_save_and_load` is consistently random.
- Fixed weight tying logic in `T5` transformer module. Previously input/output embeddings were always tied. Now this is optional,
and the default behavior is taken from the `config.tie_word_embeddings` value when instantiating `from_pretrained_module()`.
- Fixed recovering training jobs with models that expect `get_metrics()` to not be called until they have seen at least one batch.

### Changed

Expand Down
28 changes: 17 additions & 11 deletions allennlp/training/gradient_descent_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -555,17 +555,23 @@ def _train_epoch(self, epoch: int) -> Dict[str, float]:
if self._distributed:
dist.barrier()

metrics = training_util.get_metrics(
self.model,
train_loss,
train_reg_loss,
batch_loss=None,
batch_reg_loss=None,
num_batches=self._batches_in_epoch_completed,
reset=True,
world_size=self._world_size,
cuda_device=self.cuda_device,
)
if self._epochs_completed < self._start_after_epochs_completed or (
self._epochs_completed == self._start_after_epochs_completed
and self._batches_in_epoch_completed - 1 < self._start_after_batches_in_epoch_completed
):
metrics = {}
else:
metrics = training_util.get_metrics(
self.model,
train_loss,
train_reg_loss,
batch_loss=None,
batch_reg_loss=None,
num_batches=self._batches_in_epoch_completed,
reset=True,
world_size=self._world_size,
cuda_device=self.cuda_device,
)

for (worker, memory) in cpu_memory_usage:
metrics["worker_" + str(worker) + "_memory_MB"] = memory / (1024 * 1024)
Expand Down

0 comments on commit 5378533

Please sign in to comment.