[CLI] Can't launch `test` command from checkpoint because "fit" key added to top level of CLI config #11463

F-Barto · 2022-01-13T14:31:45Z

🐛 Bug

Hello everyone and many thanks for your awesome work.

The example given here is a dummy one. I reduced my issue to a simple, reproducible example.
Let's consider the fit and test stage. Between those two stages, the model have the same parameters. However, the data might be different (different split).

When using the CLI for the test command, I expected to only give a config overwrite for the data, but I have to give the config for everything.

python trainer.py fit --config=config.yaml

Followed by

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

won't work as is and will raise a TypeError: empty(): ....

I can't use the config from the log file:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

as it will raise test_trainer.py: error: 'Configuration check failed :: No action for destination key "fit.model.chans" to check its value.' because a "fit" is added by the CLI at the top of the original config file.

I have to use:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt"  --config=config.yaml  --config=test_data.yaml

Which makes a trained network (or its ckpt, or log dir) not self-contained, and it can be easy to use the wrong config at test time.

To Reproduce

config.yaml

model:
  chans: 32
data:
  size: 32
  length: 64
  batch_size: 2

test_data.yaml

data:
  size: 16
  length: 62
  batch_size: 2

In reality, the parameters of data would be very different (e.g., split_file: train.txt/split_file=test.txt). Here it is just to illustrate a point.

trainer.py

import pytorch_lightning as pl
from typing import Optional

import torch
from torch.utils.data import Dataset, DataLoader

from pytorch_lightning import LightningModule
from pytorch_lightning.utilities.cli import LightningCLI


class RandomDataset(Dataset):
    def __init__(self, size, length):
        super().__init__()
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class RandomDataModule(pl.LightningDataModule):

    def __init__(self, size, length, batch_size):
        super().__init__()
        self.size = size
        self.length = length
        self.batch_size = batch_size

    def setup(self, stage: Optional[str] = None):
        """
        Args:
            stage: used to separate setup logic for trainer.{fit,validate,test}.
                If setup is called with stage = None, we assume all stages have been set-up.
        """

        if stage in (None, "fit"):
            self.train_dataset = RandomDataset(self.size, self.length)

        if stage in (None, "test"):
            self.test_dataset = RandomDataset(self.size, self.length)

    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)


class BoringModel(LightningModule):
    def __init__(self, chans: int):
        super().__init__()
        self.layer = torch.nn.Linear(chans, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def test_step(self, batch, batch_idx):
        metric = self(batch).sum()
        return {"metric": metric}

    def test_epoch_end(self, outputs):
        list_of_metrics = [output['metric'] for output in outputs]
        avg_metric = torch.stack(list_of_metrics).mean()
        self.log("metric", avg_metric)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


class CLI(LightningCLI):
    def add_arguments_to_parser(self, parser) -> None:
        parser.set_defaults({"trainer.max_epochs": 10})


CLI(
    BoringModel,
    RandomDataModule,
    save_config_overwrite=True,
)

Expected behavior

I would expect the following sequence of command to work:

train

python trainer.py fit --config=config.yaml

test

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

or (for test):

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

but lightning_logs/version_0/config.yaml has now an additional "fit" key:

fit:
  model:
    chans: 32
  data:
    size: 32
    length: 64
    batch_size: 2

Environment

Packages:
- pyTorch_version: 1.10.0
- pytorch-lightning: 1.6.0.dev0
- jsonargparse: 4.1.1

Additional context

Related issues:
#10460

Have a good day.

cc @carmocca @mauvilsa @rbracco

The text was updated successfully, but these errors were encountered:

mauvilsa · 2022-01-13T17:32:58Z

To not have fit in the saved config, the save config callback would need to be instantiated with the subcommand parser and only the config of the subcommand.

https://github.com/PyTorchLightning/pytorch-lightning/blob/d95c0d5a4441249346197dfa1f7b459d1ffd9fbf/pytorch_lightning/utilities/cli.py#L702-L703

carmocca · 2022-01-13T23:38:24Z

@mauvilsa would you like to work on fixing this?

F-Barto · 2022-01-14T09:45:57Z

I was just wondering. Correct me if I am wrong, but using self.save_hyperparameters() is recommended practice and so the checkpoint should contain the hyperparams necessary to load the model.

Would it be possible to do something like that:

Init model with config from hparams stored in checkpoint
Load weights also using what is stored from checkpoint
Let user give the config of the datamodule for the test stage

This would allow one to only have to run:
python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config=test_data.yaml

Maybe 1 and 2 could be done with load_from_checkpoint when the model is given to the CLI:

CLI(
    BoringModel,
    RandomDataModule,
    save_config_overwrite=True,
)

However, when dealing with CLI(), both the class_path and init_args should have been saved in the ckpt.

mauvilsa · 2022-01-17T19:58:25Z

@mauvilsa would you like to work on fixing this?

I could but not sure when. Independent from this, is there any reason why we shouldn't do this or what else needs to be done? One minor I could think of is that by removing the subcommand key there is no way to know which subcommand was executed.

mauvilsa · 2022-01-17T20:11:59Z

I was just wondering. Correct me if I am wrong, but using self.save_hyperparameters() is recommended practice and so the checkpoint should contain the hyperparams necessary to load the model.

I think it would be great if the model hyperparameters are saved in the checkpoint so that for test/predict the model can be loaded without the need to give a config. Though, note that LightningCLI supports more than what self.save_hyperparameters() is currently able to deliver. More specifically with LightningCLI it is possible to have models that work via composition. Therefore, in the __init__ what is seen are instances of classes and no way for save_hyperparameters to know what is needed to instantiate them. There is also the linked arguments which might need special handling. I guess there are ways in which save_hyperparameters could be extended to support this. I have only used fit up to now and haven't looked into the details of save_hyperparameters so there isn't much more I could say about this right now.

carmocca · 2022-01-18T21:01:48Z

So there's two topics being discussed here:

A bug. The saved config includes the subcommand used, even though the user passed a config for a specific one.

is there any reason why we shouldn't do this or what else needs to be done?

I don't think so.

A feature. That the CLI uses the ckpt path or hyperparameters yaml to load the model configuration.

This should be discussed in a separate issue, however, saving hypeparameters is not required with the CLI, perhaps even discouraged since the CLI config can do much more as @mauvilsa mentioned above.

Load weights also using what is stored from checkpoint

We need the Trainer to be set up for this. This is out of the scope of the CLI.

carmocca · 2022-01-18T21:31:50Z

I opened #11532 which fixes:

python trainer.py test --ckpt_path "lightning_logs/version_0/checkpoints/epoch=9-step=319.ckpt" --config="lightning_logs/version_0/config.yaml" --config=test_data.yaml

F-Barto · 2022-01-19T08:04:31Z

Thank you very much @carmocca and @mauvilsa !

I agree with you for the two points you mentioned @carmocca.
For 2. I think the question can be further extended.

How do we, as users, use the Lightning CLI to run the standard model usage pipeline (train-val, test and predict) ? What parameters do I have to provide the CLI at each stage (configs, checkpoint, config overwrite...) ? Why do I need to configure the optimizer and scheduler for the test stage (if using linked arguments, it throws an error in the test stage if not configured) ?

The current documentation is not clear about what we should do to execute the test and predict commands through the CLI.

maybe related are:
#8385
#7508

I'm currently in the process of publishing a research paper, and I would really like to use your CLI in my released code.
While it would be awesome to have an answer to these questions, I believe it falls outside the scope of this thread.

Many thanks again and have a good day ! I am looking forward to the merge of the PR 👀

carmocca · 2022-01-19T15:50:08Z

How do we, as users, use the Lightning CLI to run the standard model usage pipeline (train-val, test and predict) ?

For these use cases, you have 3 options:

Save everything you need to disk and use it to reload on the next command line call.
Use LightningCLI(run=False) and manage all calls within a Python process, the CLI is then used to provide the initial configuration
Override the LightningCLI.after_run methods as exemplified here [CLI] Trainer pipelines (tune+fit, fit+test) using the CLI #8385 (comment)

Why do I need to configure the optimizer and scheduler for the test stage (if using linked arguments, it throws an error in the test stage if not configured)

We should totally skip this in that case. Can you open a separate issue?

I am looking forward to the merge of the PR 👀

The patch will be included with the 1.5.9 release

F-Barto added the bug Something isn't working label Jan 13, 2022

carmocca added the lightningcli pl.cli.LightningCLI label Jan 13, 2022

carmocca self-assigned this Jan 13, 2022

carmocca added this to the 1.5.x milestone Jan 13, 2022

carmocca mentioned this issue Jan 20, 2022

[CLI] Save only the configuration used #11532

Merged

11 tasks

carmocca closed this as completed in #11532 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLI] Can't launch `test` command from checkpoint because "fit" key added to top level of CLI config #11463

[CLI] Can't launch `test` command from checkpoint because "fit" key added to top level of CLI config #11463

F-Barto commented Jan 13, 2022 •

edited

Loading

mauvilsa commented Jan 13, 2022

carmocca commented Jan 13, 2022

F-Barto commented Jan 14, 2022 •

edited

Loading

mauvilsa commented Jan 17, 2022

mauvilsa commented Jan 17, 2022

carmocca commented Jan 18, 2022 •

edited

Loading

carmocca commented Jan 18, 2022

F-Barto commented Jan 19, 2022

carmocca commented Jan 19, 2022

[CLI] Can't launch test command from checkpoint because "fit" key added to top level of CLI config #11463

[CLI] Can't launch test command from checkpoint because "fit" key added to top level of CLI config #11463

Comments

F-Barto commented Jan 13, 2022 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mauvilsa commented Jan 13, 2022

carmocca commented Jan 13, 2022

F-Barto commented Jan 14, 2022 • edited Loading

mauvilsa commented Jan 17, 2022

mauvilsa commented Jan 17, 2022

carmocca commented Jan 18, 2022 • edited Loading

carmocca commented Jan 18, 2022

F-Barto commented Jan 19, 2022

carmocca commented Jan 19, 2022

[CLI] Can't launch `test` command from checkpoint because "fit" key added to top level of CLI config #11463

[CLI] Can't launch `test` command from checkpoint because "fit" key added to top level of CLI config #11463

F-Barto commented Jan 13, 2022 •

edited

Loading

F-Barto commented Jan 14, 2022 •

edited

Loading

carmocca commented Jan 18, 2022 •

edited

Loading