Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkpoint callback & Trainer.test(_) issue for TPUs #6654

Merged
merged 11 commits into from
Mar 25, 2021
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Fixed comparing required versions ([#6434](https://github.com/PyTorchLightning/pytorch-lightning/pull/6434))


- Fixed checkpoint callback issue with TPUs when set False ([#6654](https://github.com/PyTorchLightning/pytorch-lightning/pull/6654))
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved


## [1.2.4] - 2021-03-16

### Changed
Expand Down
3 changes: 2 additions & 1 deletion pytorch_lightning/plugins/training_type/tpu_spawn.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,8 @@ def barrier(self, name: Optional[str] = None) -> None:
rendezvous(f"pl.Trainer.{name}")

def transfer_distrib_spawn_state_on_fit_end(self, results):
best_model_path = self.lightning_module.trainer.checkpoint_callback.best_model_path
checkpoint_callback = self.lightning_module.trainer.checkpoint_callback
best_model_path = checkpoint_callback.best_model_path if checkpoint_callback else None
Comment on lines +133 to +134
Copy link
Contributor

@ananthsub ananthsub Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if there are multiple checkpoint callbacks attached? should we save once per path?

@awaelchli @carmocca this is gonna be amplified if people are tracking multiple versions of "best model paths" at the same time in an example like this

trainer = Trainer(...., callbacks=[checkpoint1, checkpoint2])
trainer.fit(module)
trainer.test()  <--- what checkpoint path path is used for running this?

should this raise an error due to ambiguity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather use the first best path and log the path used when running test


if self.mp_queue is not None:
rank_zero_warn("cleaning up ddp environment...")
Expand Down
15 changes: 15 additions & 0 deletions tests/models/test_tpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -383,3 +383,18 @@ def test_tpu_precision_16_clip_gradients(mock_clip_grad_norm, clip_val, tmpdir):
mock_clip_grad_norm.assert_called()
else:
mock_clip_grad_norm.assert_not_called()


@RunIf(tpu=True)
@pl_multi_process_test
carmocca marked this conversation as resolved.
Show resolved Hide resolved
def test_if_test_works_with_checkpoint_false(tmpdir):
"""
Ensure that model trains properly when
`checkpoint_callback` is set to False.
"""
kaushikb11 marked this conversation as resolved.
Show resolved Hide resolved

# Train a model on TPU
model = BoringModel()
trainer = Trainer(max_epochs=1, tpu_cores=8, default_root_dir=tmpdir, fast_dev_run=True, checkpoint_callback=False)
trainer.fit(model)
assert trainer.state == TrainerState.FINISHED, f"Training failed with {trainer.state}"