Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HOT-BUG] Checkpoint in callbacks list fails #4275

Closed
tchaton opened this issue Oct 21, 2020 · 4 comments · Fixed by #4284
Closed

[HOT-BUG] Checkpoint in callbacks list fails #4275

tchaton opened this issue Oct 21, 2020 · 4 comments · Fixed by #4284
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 0 High priority task
Milestone

Comments

@tchaton
Copy link
Contributor

tchaton commented Oct 21, 2020

🐛 Bug

Please reproduce using the BoringModel and post here

The ModelCheckpoint is not properly setup when provided through the list of callbacks.

def test_checkpoint_within_callbacks_list(tmpdir):
    """
    This test validates that the checkpoint can be called when provided to callacks list
    """

    os.environ['PL_DEV_DEBUG'] = '1'

    checkpoint_callback = ModelCheckpoint(monitor='val_loss', filepath=osp.join(tmpdir, "{epoch:02d}"))

    class ExtendedBoringModel(BoringModel):

        def validation_step(self, batch, batch_idx):
            output = self.layer(batch)
            loss = self.loss(batch, output)
            return {"val_loss": loss}

    model = ExtendedBoringModel()
    model.validation_step_end = None
    model.validation_epoch_end = None
    trainer = pl.Trainer(max_epochs=1, 
                         limit_train_batches=2, 
                         limit_val_batches=2, 
                         limit_test_batches=2, 
                         callbacks=[checkpoint_callback])

    trainer.fit(model)
    assert os.listdir(tmpdir) == ['epoch=00.ckpt]']

tests/checkpointing/test_model_checkpoint.py:590: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pytorch_lightning/trainer/trainer.py:439: in fit
    results = self.accelerator_backend.train()
pytorch_lightning/accelerators/cpu_accelerator.py:48: in train
    results = self.train_or_test()
pytorch_lightning/accelerators/accelerator.py:66: in train_or_test
    results = self.trainer.train()
pytorch_lightning/trainer/trainer.py:482: in train
    self.train_loop.run_training_epoch()
pytorch_lightning/trainer/training_loop.py:569: in run_training_epoch
    self.trainer.run_evaluation(test_mode=False)
pytorch_lightning/trainer/trainer.py:609: in run_evaluation
    self.evaluation_loop.on_evaluation_end()
pytorch_lightning/trainer/evaluation_loop.py:109: in on_evaluation_end
    self.trainer.call_hook('on_validation_end', *args, **kwargs)
pytorch_lightning/trainer/trainer.py:822: in call_hook
    trainer_hook(*args, **kwargs)
pytorch_lightning/trainer/callback_hook.py:177: in on_validation_end
    callback.on_validation_end(self, self.get_model())
pytorch_lightning/callbacks/model_checkpoint.py:167: in on_validation_end
    self.save_checkpoint(trainer, pl_module)
pytorch_lightning/callbacks/model_checkpoint.py:213: in save_checkpoint
    self._save_top_k_checkpoints(monitor_candidates, trainer, pl_module, epoch, filepath)
pytorch_lightning/callbacks/model_checkpoint.py:494: in _save_top_k_checkpoints
    self._update_best_and_save(filepath, current, epoch, trainer, pl_module)
pytorch_lightning/callbacks/model_checkpoint.py:543: in _update_best_and_save
    self._save_model(filepath, trainer, pl_module)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint object at 0x13a775b80>
filepath = '/private/var/folders/q9/1sc07dt91w9581rl1vt6vd4w0000gn/T/pytest-of-thomas/pytest-1058/test_checkpoint_within_callbac0/epoch=00.ckpt'
trainer = <pytorch_lightning.trainer.trainer.Trainer object at 0x13a7756a0>, pl_module = ExtendedBoringModel(
  (layer): Linear(in_features=32, out_features=2, bias=True)
)

    def _save_model(self, filepath: str, trainer, pl_module):
        # in debugging, track when we save checkpoints
        trainer.dev_debugger.track_checkpointing_history(filepath)
    
        # make paths
        if trainer.is_global_zero:
            self._fs.makedirs(os.path.dirname(filepath), exist_ok=True)
    
        # delegate the saving to the trainer
        if self.save_function is not None:
            self.save_function(filepath, self.save_weights_only)
        else:
>           raise ValueError(".save_function() not set")
E           ValueError: .save_function() not set

pytorch_lightning/callbacks/model_checkpoint.py:297: ValueError
----------------------------------------------------------------------------------------- Captured log call ------------------------------------------------------------------------------------------
INFO     lightning:distributed.py:49 GPU available: False, used: False
INFO     lightning:distributed.py:49 TPU available: False, using: 0 TPU cores
INFO     lightning:lightning.py:1290 
  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66
========================================================================================== warnings summary ==========================================================================================
.venv/lib/python3.8/site-packages/comet_ml/monkey_patching.py:19
  /Users/thomas/Documents/projects/pytorch-lightning/.venv/lib/python3.8/site-packages/comet_ml/monkey_patching.py:19: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

.venv/lib/python3.8/site-packages/pandas/compat/__init__.py:120
  /Users/thomas/Documents/projects/pytorch-lightning/.venv/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
    warnings.warn(msg)

.venv/lib/python3.8/site-packages/wandb/util.py:35
  /Users/thomas/Documents/projects/pytorch-lightning/.venv/lib/python3.8/site-packages/wandb/util.py:35: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    from collections import namedtuple, Mapping, Sequence

.venv/lib/python3.8/site-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55
  /Users/thomas/Documents/projects/pytorch-lightning/.venv/lib/python3.8/site-packages/wandb/vendor/graphql-core-1.1/graphql/type/directives.py:55: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    assert isinstance(locations, collections.Iterable), 'Must provide locations for directive.'

tests/checkpointing/test_model_checkpoint.py::test_checkpoint_within_callbacks_list
  /Users/thomas/Documents/projects/pytorch-lightning/pytorch_lightning/utilities/distributed.py:45: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
    warnings.warn(*args, **kwargs)

tests/checkpointing/test_model_checkpoint.py::test_checkpoint_within_callbacks_list
  /Users/thomas/Documents/projects/pytorch-lightning/pytorch_lightning/utilities/distributed.py:45: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
    warnings.warn(*args, **kwargs)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
====================================================================================== short test summary info =======================================================================================
FAILED tests/checkpointing/test_model_checkpoint.py::test_checkpoint_within_callbacks_list - ValueError: .save_function() not set

To Reproduce

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@tchaton tchaton added bug Something isn't working help wanted Open to be worked on good first issue Good for newcomers Important labels Oct 21, 2020
@Borda Borda added priority: 0 High priority task and removed Important labels Oct 21, 2020
@Borda Borda added this to the 1.0.x milestone Oct 21, 2020
@ydcjeff
Copy link
Contributor

ydcjeff commented Oct 21, 2020

isn't passing ModelCheckpoint in callbacks flag WIP?
here is the PR: #3990
@awaelchli

@awaelchli
Copy link
Contributor

Yes exactly. ModelCheckpoint (currently) is special, meaning there can only be one and you have to pass it to Trainer(checkpoint_callback=), passing to callbacks argument is not allowed.

In my PR I try to remove this limitation partially, and it requires several stages of changes and multiple PRs to get this done.

@williamFalcon
Copy link
Contributor

the solution here is to move the save function inside the callback. no need to have these separated anymore

@awaelchli
Copy link
Contributor

That's exactly what I'm doing here:
#3990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants