Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple reproducibility with minimum boilerplate train_cli #4492

Merged
merged 46 commits into from
Apr 6, 2021
Merged

Simple reproducibility with minimum boilerplate train_cli #4492

merged 46 commits into from
Apr 6, 2021

Conversation

mauvilsa
Copy link
Contributor

@mauvilsa mauvilsa commented Nov 3, 2020

What does this PR do?

This is a proposal which is much easier to explain with actual code in a pull request. The exact idea is not discussed in a single issue, instead it relates to multiple. Though the most relevant discussion would be the one in #3720.

I have added to the pull request an rst file which describes the proposal in more detail, so please have a look: lightning_cli.rst.

The main motivation I have for this is to standardize and remove as much as possible boilerplate code in order to have trainings that can be trivially reproduced with a single config file. This config file is automatically saved in the log dir.

The short summary is, users would run a training like:

    python trainer.py --config config.yaml

And then they would be able to reproduce this by running:

    python trainer.py --config lightning_logs/version_X/config.yaml

The point is not rerunning a training that has already been done. But that the config file includes all of the information needed to reproduce it. This is possible because the config file also includes all of the settings from user defined classes.

This proposal overlaps with other already existing features, namely argument parsing and omegaconf. Though I believe it goes way beyond those in a much simpler way, providing several additional advantages. Specifically:

  • Single standardized config file including settings for trainer and user defined classes.
  • Encourages users to write better code with good type hints and docstrings.
  • Implementation of training command line tools require minimum boilerplate code.
  • Supports more complex trainer arguments: list, dict, see test_lightning_cli.py#L157-L165.
  • The training tool help automatically includes defaults and descriptions in docstrings.
  • Supports parsing settings from command line, config file and environment variables.
  • Supports tab argument completion if argcomplete is installed.
  • Very small amount code in comparison to the number features it provides.

The unit tests test_lightning_cli.py are in part a copy from the existing test_trainer_cli.py tests. Almost all functionalities from previous code are supported. The only exception is that command line arguments for booleans are more strict, not allowing for example "0" or "t", see selected tests in test_trainer_cli.py#L110-L119. Certainly the support for these could be added, though being more strict could actually be a good thing. With argument completion this is even less of an issue.

Since parts of this proposal are new, I surely hope to get lots of feedback to improve the idea further.

Fixes #3076
Fixes #3720
Fixes #4119

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Next steps:

  • Deprecate the current argparse support
  • Migrate examples

@pep8speaks
Copy link

pep8speaks commented Nov 3, 2020

Hello @mauvilsa! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-06 12:03:23 UTC

@mauvilsa
Copy link
Contributor Author

mauvilsa commented Nov 3, 2020

I ping a few more people that could be interested in this. @carmocca @odedbd

@Borda Borda added the feature Is an improvement or enhancement label Nov 3, 2020
@Borda Borda added this to the 1.1 milestone Nov 3, 2020
docs/source/trainer_cli.rst Outdated Show resolved Hide resolved
docs/source/trainer_cli.rst Outdated Show resolved Hide resolved
docs/source/trainer_cli.rst Outdated Show resolved Hide resolved
tests/trainer/test_trainer_cli_new.py Outdated Show resolved Hide resolved
pytorch_lightning/utilities/jsonargparse_utils.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
pytorch_lightning/utilities/jsonargparse_utils.py Outdated Show resolved Hide resolved
pytorch_lightning/utilities/jsonargparse_utils.py Outdated Show resolved Hide resolved
pytorch_lightning/utilities/jsonargparse_utils.py Outdated Show resolved Hide resolved
@Borda Borda added the design Includes a design discussion label Nov 3, 2020
@Borda
Copy link
Member

Borda commented Nov 3, 2020

@nateraw min have a look, this seems like complete redone our CLI... :]

@carmocca
Copy link
Contributor

carmocca commented Nov 3, 2020

Would we deprecate the current argparse functionality (and bolts implementation) if this lands?

@SeanNaren
Copy link
Contributor

This is so so epic! Thanks for an awesome PR 👍

How would this work with shared parameters, i.e I define two models with encoder_layers as an input arg, and I want the same arg to go to both of them? Also curious about inheritance.

I love the trainer_cli idea, but it seems as soon as you go towards custom args, we lose all the niceties of no boilerplate with the LightningArgsParser. It would be nice to have some form of intermediate that allows us to keep things self contained in PL without having to define the args object/instantiate everything ourselves like normal PL!

@mauvilsa
Copy link
Contributor Author

mauvilsa commented Nov 3, 2020

How would this work with shared parameters, i.e I define two models with encoder_layers as an input arg, and I want the same arg to go to both of them?

I would say there are many ways in which shared parameters could be handled, and the best way to do it maybe would depend on the specific case. This would require a bit more thought.

Also curious about inheritance.

Arguments of base classes are also added correctly, see for instance the example in the unit test https://github.com/omni-us/jsonargparse/blob/master/jsonargparse_tests/signatures_tests.py#L16-L71 . The only requirement is that *args and/or **kwargs must be present. Not sure if there would be cases which this does not cover, but could be improved further.

I love the trainer_cli idea, but it seems as soon as you go towards custom args, we lose all the niceties of no boilerplate with the LightningArgsParser. It would be nice to have some form of intermediate that allows us to keep things self contained in PL without having to define the args object/instantiate everything ourselves like normal PL!

@SeanNaren I am not sure what you mean by "allows us to keep things self contained in PL without having to define the args object/instantiate everything ourselves like normal PL". Could you please explain a bit more.

@SeanNaren
Copy link
Contributor

SeanNaren commented Nov 3, 2020

If I want to specify custom args, I shouldn't need to define all the boilerplate again. I would love to keep it like this (not a really well thought out proposal, but hopefully adds to discussion):

from pytorch_lightning.utilities.jsonargparse_utils import trainer_cli
from mycode import LitModel, LitDataModule, CustomCallback

custom_callback = CustomCallback()

trainer_cli(LitModel, LitDataModule, callbacks=[custom_callback])
from pytorch_lightning.utilities.jsonargparse_utils import LightningArgumentParser, trainer_cli
from mycode import LitModel, LitDataModule

# Define parser
parser = LightningArgumentParser(description='pytorch-lightning trainer',
                                 parse_as_dict=True)
parser.add_argument('--notification_email', default='will@email.com')

trainer_cli(LitModel, LitDataModule, parser=parser)

@teddykoker
Copy link
Contributor

teddykoker commented Nov 3, 2020

I think @SeanNaren has a good point. Should we implement a system like this, we would definitely require any arbitrary additional arguments to be added.

Additionally, I think we need to draw the line somewhere where lighting ends, and users can use their own solutions. Right now there are so many solutions for python configuration and CLI: typer, hydra, click, fire, I am not sure if it makes sense to build something like this into Lightning, when each user will have their own usecase, and this would add another dependency to worry about breaking.

That being said, I could be wrong, and nevertheless this PR should promote a thoughtful discussion.

@mauvilsa
Copy link
Contributor Author

mauvilsa commented Nov 4, 2020

@teddykoker I was certainly expecting comments related to adding a dependency and this probably being the reason why in the end this is not merged. Though putting this aside for a moment, I think the most important topic to discuss here is if lightning should out of the box have this kind of easy reproducibility feature with no boilerplate. People that just want to run some experiment shouldn't need to know about the variety of python solutions for configuration or be required to learn how to use one of them and add this logic to their code. I personally think the core idea of this PR fits perfectly the lightning philosophy. But this is just one opinion. I would like to hear the opinion about this from as many people as possible.

@mauvilsa
Copy link
Contributor Author

mauvilsa commented Nov 4, 2020

If in the end there is a consensus that this is a desired feature, the implementation details would be the second topic to discuss. Many options are possible, including implementing everything in lightning for the sake of avoiding a dependency. Though I would advice against this. Modularity and separation of concerns are basic software engineering principles. Makes more sense to use a package specifically designed for this than reimplementing. I can continue working on this pull request based on the ongoing discussion, though I would do so using jsonargparse, since I am the main developer and I can easily steer its development to whatever the lightning community requires.

@mergify mergify bot removed the has conflicts label Mar 30, 2021
auto-merge was automatically disabled March 31, 2021 06:46

Head branch was pushed to by a user without write access

@mauvilsa
Copy link
Contributor Author

Added a unit test for callbacks as argument from the command line. Also included in the documentation the need for the all extras require.

@Borda Borda requested review from awaelchli, carmocca and a team and removed request for nateraw March 31, 2021 07:27
@awaelchli
Copy link
Contributor

Is LightningCLI restricted to Lightning classes?
So some_object = LightningCLI(MyCustomClass) won't work, even if type annotations are given? Will it be possible to support profilers, loggers etc too?
Why are the Trainer methods embedded inside the cli? Is it not possible to call trainer.fit() or test at the desired place?
I only read the docs so far, so apologies but it sounds like this is not possible.

@mauvilsa
Copy link
Contributor Author

mauvilsa commented Apr 5, 2021

Hi @awaelchli

Is LightningCLI restricted to Lightning classes? So some_object = LightningCLI(MyCustomClass) won't work, even if type annotations are given?

It works for custom classes. That is the idea. People implement their own modules and data modules and then use the cli. The only requirement is that the classes inherit from LightningModule and LightningDataModule. This is the requirement for lightning's Trainer, right?

Will it be possible to support profilers, loggers etc too?

Yes, they should be working. There are no unit tests specifically for profilers and loggers. But these have type hints similar to callbacks for which there are unit tests so there is no reason why it wouldn't work. Moreover there are several unit tests in jsonargparse that already check this kind of behavior. Trainer has too many parameters and possible values for me to test everything. It is needed that people try it out in their use cases to identify problems.

Why are the Trainer methods embedded inside the cli? Is it not possible to call trainer.fit() or test at the desired place?

I am not entirely sure what you mean. The purpose of LightningCLI is to load configuration and then run trainer.fit avoiding to write this boilerplate. If you want to do something slightly different then you can implement a new cli inheriting from LightningCLI and only needing to modify some of its methods.

I only read the docs so far, so apologies but it sounds like this is not possible.

No problem. I am glad that someone else is looking at it. I guess it will become a bit clearer if you try it out with a simple example.

@carmocca carmocca added the ready PRs ready to be merged label Apr 6, 2021
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to go! 💃

@carmocca carmocca enabled auto-merge (squash) April 6, 2021 13:03
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

docs/source/trainer_cli.rst Outdated Show resolved Hide resolved
@carmocca carmocca merged commit b7f3a3c into Lightning-AI:master Apr 6, 2021
@Borda
Copy link
Member

Borda commented Apr 6, 2021

🎉 💟 🐰 great work @mauvilsa and @carmocca!

facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Apr 14, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
### New commit log messages
## [UnReleased] - 2021-MM-DD

### Added

- Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667))

- Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492))

- Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417))

- Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)).

- Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072))

- Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915))

- Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673))

- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274))

- Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370))

- Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633))

- Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139))

- Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543))

- Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621))

- Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618))

- Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120))

- Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679))

- Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595))

- Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736))

- Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677))

- Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259))

- Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621))

- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349))

### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621))

- Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505),
    [#6530](Lightning-AI/pytorch-lightning#6530),
    [#6540](Lightning-AI/pytorch-lightning#6540),
    [#6547](Lightning-AI/pytorch-lightning#6547),
    [#6515](Lightning-AI/pytorch-lightning#6515),
    [#6572](Lightning-AI/pytorch-lightning#6572),
    [#6573](Lightning-AI/pytorch-lightning#6573),
    [#6584](Lightning-AI/pytorch-lightning#6584),
    [#6636](Lightning-AI/pytorch-lightning#6636),
    [#6637](Lightning-AI/pytorch-lightning#6637),
    [#6649](Lightning-AI/pytorch-lightning#6649),
    [#6659](Lightning-AI/pytorch-lightning#6659),
)

### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164))

- Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139))

- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166))

- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163))

- Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161))
    * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
    * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`

- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162))

- Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167))

- Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016))

- Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207))

- Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734))

- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093))

### Fixed

- Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802))

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011))

- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070))

- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109))

- Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436))

- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416))

- Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506))

- Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705))

## [1.2.7] - 2021-04-06

### Fixed

- Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741))
- Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828))
- Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836))
- Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588))
- Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816))
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730))

## [1.2.6] - 2021-03-30

### Changed

- Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498))

### Removed

- Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682))

### Fixed

- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398))
- Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657))
- Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719))

## [1.2.5] - 2021-03-23

### Changed

- Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576))
- Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590))

### Fixed

- Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587))
- Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434))
- Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275))
- Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565))

Reviewed By: shuyingsunshine21

Differential Revision: D27528929

fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion feature Is an improvement or enhancement priority: 1 Medium priority task ready PRs ready to be merged
Projects
None yet