Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

Merged
merged 6 commits into from
Mar 25, 2021

Conversation

rohitgr7
Copy link
Contributor

@rohitgr7 rohitgr7 commented Mar 12, 2021

What does this PR do?

Added on_epoch_start to run at the beginning of every loop irrespective of train/val/test. This hook is the common one that can let users do somethings required to be done in case of all train/val/test. Otherwise we already have on_train/val/test_epoch_start hooks if something needs to be done separately for train/val/test separately.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@rohitgr7
Copy link
Contributor Author

quick question here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/evaluation_loop.py#L120-L124
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/evaluation_loop.py#L314-L321

on_epoch_end is called after every epoch irrespective of train/val/test but this Is not the case with on_epoch_start. Is this intended behavior or missed? let me know, I'll update the PR and tests accordingly if required ✌️

cc @carmocca @tchaton

@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #6498 (37e0c88) into master (dcd9dd8) will decrease coverage by 45%.
The diff coverage is 80%.

@@           Coverage Diff            @@
##           master   #6498     +/-   ##
========================================
- Coverage      93%     47%    -45%     
========================================
  Files         161     161             
  Lines       11518   11413    -105     
========================================
- Hits        10661    5374   -5287     
- Misses        857    6039   +5182     

@carmocca
Copy link
Contributor

Is this intended behavior or missed? let me know, I'll update the PR and tests accordingly if required ✌️

I'd say missed. I don't see any reason to have it act differently

@rohitgr7 rohitgr7 changed the title [WIP] Update docs for on_epoch_start/on_epoch_end Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021
@rohitgr7 rohitgr7 added this to the 1.2.x milestone Mar 14, 2021
@rohitgr7 rohitgr7 added design Includes a design discussion callback feature Is an improvement or enhancement labels Mar 14, 2021
@rohitgr7 rohitgr7 changed the title Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test [WIP] Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021
@rohitgr7 rohitgr7 marked this pull request as draft March 14, 2021 13:59
@rohitgr7 rohitgr7 changed the title [WIP] Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021
@rohitgr7 rohitgr7 marked this pull request as ready for review March 14, 2021 15:28
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Is it called only once for training ?

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a section "backward incompatible changes" in the release notes, like pytorch does it?

This PR is an example where the changelog entry is not enough

CHANGELOG.md Outdated Show resolved Hide resolved
@rohitgr7
Copy link
Contributor Author

Can we make a section "backward incompatible changes" in the release notes, like pytorch does it?

This PR is an example where the changelog entry is not enough

should I add a section for it in the changelog? @awaelchli

@awaelchli
Copy link
Contributor

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
@Borda
Copy link
Member

Borda commented Mar 15, 2021

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

not sure what means incompatible change, we shall always be at least 0.2 version compatible with any past API
so the process is deprecation and it is written in the Deprecation section for a particular version v0.X
then when it is truly removed in v0.(X+2) we write it again in the Removed section... 🐰

@rohitgr7
Copy link
Contributor Author

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

not sure what means incompatible change, we shall always be at least 0.2 version compatible with any past API
so the process is deprecation and it is written in the Deprecation section for a particular version v0.X
then when it is truly removed in v0.(X+2) we write it again in the Removed section... rabbit

@Borda here nothing is being removed/deprecated. Only the behavior is changed since it was missed in the past when its counterpart (on_epoch_end) was updated. So guess @awaelchli is suggesting a simple changelog entry might not be enough here.

@rohitgr7 rohitgr7 added the _Will label Mar 15, 2021
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am missing something, can we update the PR description?
I see some rename from on_train_epoch_start but is not mentioned anywhere...
also, shall the genric on_epoch_start has a phase argument?

CHANGELOG.md Show resolved Hide resolved
pytorch_lightning/trainer/callback_hook.py Show resolved Hide resolved
@rohitgr7 rohitgr7 requested a review from Borda March 16, 2021 11:20
Copy link
Contributor

@ananthsub ananthsub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These generic on_epoch_start and on_epoch_end hooks cause a lot of confusion for users who don't know how an epoch is defined in lightning. nor do they account for the validation being run multiple times within the train epoch when val_check_interval is configured.

the existing hooks are explicit as to when they're applied. if anything, i think we should remove the generic on_epoch_end hook and ask users to share code across the existing train/val/test epoch end hooks

Being more explicit in the naming is also a motivation for #6420

@rohitgr7
Copy link
Contributor Author

the existing hooks are explicit as to when they're applied. if anything, i think we should remove the generic on_epoch_end hook and ask users to share code across the existing train/val/test epoch end hooks

exactly what I suggested on slack but William suggested making it generic to run irrespective of train/val/test. Although another way around for a user to create such generic logic can be done via a separate hook they can define themselves.

def _common_epoch_start(self, ...):
    # do stuff

def on_train_epoch_start(self, ...):
    self._common_epoch_start(()

def on_val_epoch_start(self, ...):
    self._common_epoch_start(()

def on_test_epoch_start(self, ...):
    # do something else

or we can just make them on_loop_start (at least better than epoch IMO) if required.

I just updated this the way on_epoch_end behaves currently on the master.

@Borda Borda merged commit 9be092d into master Mar 25, 2021
@Borda Borda deleted the docs/on_ep branch March 25, 2021 13:20
@carmocca carmocca mentioned this pull request Mar 29, 2021
carmocca pushed a commit that referenced this pull request Mar 29, 2021
… of train/val/test (#6498)

* update docs

* add hook and update docs

* update tests

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Borda pushed a commit that referenced this pull request Mar 30, 2021
… of train/val/test (#6498)

* update docs

* add hook and update docs

* update tests

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
lexierule pushed a commit that referenced this pull request Mar 30, 2021
… of train/val/test (#6498)

* update docs

* add hook and update docs

* update tests

* chlog

* Update CHANGELOG.md

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* chlog

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
facebook-github-bot pushed a commit to facebookresearch/d2go that referenced this pull request Apr 14, 2021
…ter) to github/third-party/PyTorchLightning/pytorch-lightning

Summary:
### New commit log messages
## [UnReleased] - 2021-MM-DD

### Added

- Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667))

- Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492))

- Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417))

- Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)).

- Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470))

- Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072))

- Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948))

- Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915))

- Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673))

- Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277))

- Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274))

- Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370))

- Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633))

- Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139))

- Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543))

- Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621))

- Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618))

- Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120))

- Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679))

- Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595))

- Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736))

- Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677))

- Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764))

### Changed

- Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259))

- Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945))

- Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621))

- Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349))

### Deprecated

- `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146))

- Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945))

- Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621))

- Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349))

- Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505),
    [#6530](Lightning-AI/pytorch-lightning#6530),
    [#6540](Lightning-AI/pytorch-lightning#6540),
    [#6547](Lightning-AI/pytorch-lightning#6547),
    [#6515](Lightning-AI/pytorch-lightning#6515),
    [#6572](Lightning-AI/pytorch-lightning#6572),
    [#6573](Lightning-AI/pytorch-lightning#6573),
    [#6584](Lightning-AI/pytorch-lightning#6584),
    [#6636](Lightning-AI/pytorch-lightning#6636),
    [#6637](Lightning-AI/pytorch-lightning#6637),
    [#6649](Lightning-AI/pytorch-lightning#6649),
    [#6659](Lightning-AI/pytorch-lightning#6659),
)

### Removed

- Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164))

- Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139))

- Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166))

- Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163))

- Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161))
    * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve`
    * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce`

- Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162))

- Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167))

- Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016))

- Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207))

- Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734))

- Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093))

### Fixed

- Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802))

- Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011))

- Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070))

- Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109))

- Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436))

- Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136))

- Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386))

- Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416))

- Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506))

- Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705))

## [1.2.7] - 2021-04-06

### Fixed

- Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741))
- Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828))
- Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836))
- Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588))
- Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816))
- Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730))

## [1.2.6] - 2021-03-30

### Changed

- Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498))

### Removed

- Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682))

### Fixed

- Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398))
- Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654))
- Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657))
- Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719))

## [1.2.5] - 2021-03-23

### Changed

- Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576))
- Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590))

### Fixed

- Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587))
- Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434))
- Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275))
- Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565))

Reviewed By: shuyingsunshine21

Differential Revision: D27528929

fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback design Includes a design discussion feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants