Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI] Can't launch test command from checkpoint because "fit" key added to top level of CLI config #11463

Closed
F-Barto opened this issue Jan 13, 2022 · 9 comments · Fixed by #11532
Assignees
Labels
bug Something isn't working lightningcli pl.cli.LightningCLI
Milestone

Comments

@F-Barto
Copy link

F-Barto commented Jan 13, 2022

🐛 Bug

Hello everyone and many thanks for your awesome work.

The example given here is a dummy one. I reduced my issue to a simple, reproducible example.
Let's consider the fit and test stage. Between those two stages, the model have the same parameters. However, the data might be different (different split).

When using the CLI for the test command, I expected to only give a config overwrite for the data, but I have to give the config for everything.

python trainer.py fit --config=config.yaml 

Followed by

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

won't work as is and will raise a TypeError: empty(): ....

I can't use the config from the log file:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

as it will raise test_trainer.py: error: 'Configuration check failed :: No action for destination key "fit.model.chans" to check its value.' because a "fit" is added by the CLI at the top of the original config file.

I have to use:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt"  --config=config.yaml  --config=test_data.yaml

Which makes a trained network (or its ckpt, or log dir) not self-contained, and it can be easy to use the wrong config at test time.

To Reproduce

config.yaml

model:
  chans: 32
data:
  size: 32
  length: 64
  batch_size: 2

test_data.yaml

data:
  size: 16
  length: 62
  batch_size: 2

In reality, the parameters of data would be very different (e.g., split_file: train.txt/split_file=test.txt). Here it is just to illustrate a point.

trainer.py

import pytorch_lightning as pl
from typing import Optional

import torch
from torch.utils.data import Dataset, DataLoader

from pytorch_lightning import LightningModule
from pytorch_lightning.utilities.cli import LightningCLI


class RandomDataset(Dataset):
    def __init__(self, size, length):
        super().__init__()
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class RandomDataModule(pl.LightningDataModule):

    def __init__(self, size, length, batch_size):
        super().__init__()
        self.size = size
        self.length = length
        self.batch_size = batch_size

    def setup(self, stage: Optional[str] = None):
        """
        Args:
            stage: used to separate setup logic for trainer.{fit,validate,test}.
                If setup is called with stage = None, we assume all stages have been set-up.
        """

        if stage in (None, "fit"):
            self.train_dataset = RandomDataset(self.size, self.length)

        if stage in (None, "test"):
            self.test_dataset = RandomDataset(self.size, self.length)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)


class BoringModel(LightningModule):
    def __init__(self, chans: int):
        super().__init__()
        self.layer = torch.nn.Linear(chans, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def test_step(self, batch, batch_idx):
        metric = self(batch).sum()
        return {"metric": metric}

    def test_epoch_end(self, outputs):
        list_of_metrics = [output['metric'] for output in outputs]
        avg_metric = torch.stack(list_of_metrics).mean()
        self.log("metric", avg_metric)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


class CLI(LightningCLI):
    def add_arguments_to_parser(self, parser) -> None:
        parser.set_defaults({"trainer.max_epochs": 10})


CLI(
    BoringModel,
    RandomDataModule,
    save_config_overwrite=True,
)

Expected behavior

I would expect the following sequence of command to work:

train

python trainer.py fit --config=config.yaml 

test

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

or (for test):

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

but lightning_logs/version_0/config.yaml has now an additional "fit" key:

fit:
  model:
    chans: 32
  data:
    size: 32
    length: 64
    batch_size: 2

Environment

  • Packages:
    • pyTorch_version: 1.10.0
    • pytorch-lightning: 1.6.0.dev0
    • jsonargparse: 4.1.1

Additional context

Related issues:
#10460

Have a good day.

cc @carmocca @mauvilsa @rbracco

@F-Barto F-Barto added the bug Something isn't working label Jan 13, 2022
@carmocca carmocca added the lightningcli pl.cli.LightningCLI label Jan 13, 2022
@carmocca carmocca self-assigned this Jan 13, 2022
@carmocca carmocca added this to the 1.5.x milestone Jan 13, 2022
@mauvilsa
Copy link
Contributor

To not have fit in the saved config, the save config callback would need to be instantiated with the subcommand parser and only the config of the subcommand.

https://github.com/PyTorchLightning/pytorch-lightning/blob/d95c0d5a4441249346197dfa1f7b459d1ffd9fbf/pytorch_lightning/utilities/cli.py#L702-L703

@carmocca
Copy link
Contributor

@mauvilsa would you like to work on fixing this?

@F-Barto
Copy link
Author

F-Barto commented Jan 14, 2022

I was just wondering. Correct me if I am wrong, but using self.save_hyperparameters() is recommended practice and so the checkpoint should contain the hyperparams necessary to load the model.

Would it be possible to do something like that:

  1. Init model with config from hparams stored in checkpoint
  2. Load weights also using what is stored from checkpoint
  3. Let user give the config of the datamodule for the test stage

This would allow one to only have to run:
python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

Maybe 1 and 2 could be done with load_from_checkpoint when the model is given to the CLI:

CLI(
    BoringModel,
    RandomDataModule,
    save_config_overwrite=True,
)

However, when dealing with CLI(), both the class_path and init_args should have been saved in the ckpt.

@mauvilsa
Copy link
Contributor

@mauvilsa would you like to work on fixing this?

I could but not sure when. Independent from this, is there any reason why we shouldn't do this or what else needs to be done? One minor I could think of is that by removing the subcommand key there is no way to know which subcommand was executed.

@mauvilsa
Copy link
Contributor

I was just wondering. Correct me if I am wrong, but using self.save_hyperparameters() is recommended practice and so the checkpoint should contain the hyperparams necessary to load the model.

I think it would be great if the model hyperparameters are saved in the checkpoint so that for test/predict the model can be loaded without the need to give a config. Though, note that LightningCLI supports more than what self.save_hyperparameters() is currently able to deliver. More specifically with LightningCLI it is possible to have models that work via composition. Therefore, in the __init__ what is seen are instances of classes and no way for save_hyperparameters to know what is needed to instantiate them. There is also the linked arguments which might need special handling. I guess there are ways in which save_hyperparameters could be extended to support this. I have only used fit up to now and haven't looked into the details of save_hyperparameters so there isn't much more I could say about this right now.

@carmocca
Copy link
Contributor

carmocca commented Jan 18, 2022

So there's two topics being discussed here:

  1. A bug. The saved config includes the subcommand used, even though the user passed a config for a specific one.

is there any reason why we shouldn't do this or what else needs to be done?

I don't think so.

  1. A feature. That the CLI uses the ckpt path or hyperparameters yaml to load the model configuration.

This should be discussed in a separate issue, however, saving hypeparameters is not required with the CLI, perhaps even discouraged since the CLI config can do much more as @mauvilsa mentioned above.

Load weights also using what is stored from checkpoint

We need the Trainer to be set up for this. This is out of the scope of the CLI.

@carmocca
Copy link
Contributor

I opened #11532 which fixes:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

@F-Barto
Copy link
Author

F-Barto commented Jan 19, 2022

Thank you very much @carmocca and @mauvilsa !

I agree with you for the two points you mentioned @carmocca.
For 2. I think the question can be further extended.

How do we, as users, use the Lightning CLI to run the standard model usage pipeline (train-val, test and predict) ? What parameters do I have to provide the CLI at each stage (configs, checkpoint, config overwrite...) ? Why do I need to configure the optimizer and scheduler for the test stage (if using linked arguments, it throws an error in the test stage if not configured) ?

The current documentation is not clear about what we should do to execute the test and predict commands through the CLI.

maybe related are:
#8385
#7508

I'm currently in the process of publishing a research paper, and I would really like to use your CLI in my released code.
While it would be awesome to have an answer to these questions, I believe it falls outside the scope of this thread.

Many thanks again and have a good day ! I am looking forward to the merge of the PR 👀

@carmocca
Copy link
Contributor

How do we, as users, use the Lightning CLI to run the standard model usage pipeline (train-val, test and predict) ?

For these use cases, you have 3 options:

Why do I need to configure the optimizer and scheduler for the test stage (if using linked arguments, it throws an error in the test stage if not configured)

We should totally skip this in that case. Can you open a separate issue?

I am looking forward to the merge of the PR 👀

The patch will be included with the 1.5.9 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working lightningcli pl.cli.LightningCLI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants