Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

rohitgr7 · 2021-03-12T18:45:12Z

What does this PR do?

Added on_epoch_start to run at the beginning of every loop irrespective of train/val/test. This hook is the common one that can let users do somethings required to be done in case of all train/val/test. Otherwise we already have on_train/val/test_epoch_start hooks if something needs to be done separately for train/val/test separately.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

rohitgr7 · 2021-03-12T18:49:31Z

quick question here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/evaluation_loop.py#L120-L124
https://github.com/PyTorchLightning/pytorch-lightning/blob/680e83adab38c2d680b138bdc39d48fc35c0cb58/pytorch_lightning/trainer/evaluation_loop.py#L314-L321

on_epoch_end is called after every epoch irrespective of train/val/test but this Is not the case with on_epoch_start. Is this intended behavior or missed? let me know, I'll update the PR and tests accordingly if required ✌️

cc @carmocca @tchaton

codecov · 2021-03-12T19:06:55Z

Codecov Report

Merging #6498 (37e0c88) into master (dcd9dd8) will decrease coverage by 45%.
The diff coverage is 80%.

@@           Coverage Diff            @@
##           master   #6498     +/-   ##
========================================
- Coverage      93%     47%    -45%     
========================================
  Files         161     161             
  Lines       11518   11413    -105     
========================================
- Hits        10661    5374   -5287     
- Misses        857    6039   +5182

carmocca · 2021-03-14T00:41:48Z

Is this intended behavior or missed? let me know, I'll update the PR and tests accordingly if required ✌️

I'd say missed. I don't see any reason to have it act differently

tchaton

LGTM ! Is it called only once for training ?

awaelchli

Can we make a section "backward incompatible changes" in the release notes, like pytorch does it?

This PR is an example where the changelog entry is not enough

CHANGELOG.md

rohitgr7 · 2021-03-15T11:40:38Z

Can we make a section "backward incompatible changes" in the release notes, like pytorch does it?

This PR is an example where the changelog entry is not enough

should I add a section for it in the changelog? @awaelchli

awaelchli · 2021-03-15T12:44:35Z

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

Borda · 2021-03-15T18:03:04Z

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

not sure what means incompatible change, we shall always be at least 0.2 version compatible with any past API
so the process is deprecation and it is written in the Deprecation section for a particular version v0.X
then when it is truly removed in v0.(X+2) we write it again in the Removed section... 🐰

rohitgr7 · 2021-03-15T18:14:15Z

No I believe these release notes are a special section on github that gets created separately. At least it looks like it :) @Borda knows

not sure what means incompatible change, we shall always be at least 0.2 version compatible with any past API
so the process is deprecation and it is written in the Deprecation section for a particular version v0.X
then when it is truly removed in v0.(X+2) we write it again in the Removed section... rabbit

@Borda here nothing is being removed/deprecated. Only the behavior is changed since it was missed in the past when its counterpart (on_epoch_end) was updated. So guess @awaelchli is suggesting a simple changelog entry might not be enough here.

Borda

maybe I am missing something, can we update the PR description?
I see some rename from on_train_epoch_start but is not mentioned anywhere...
also, shall the genric on_epoch_start has a phase argument?

CHANGELOG.md

pytorch_lightning/trainer/callback_hook.py

ananthsub

These generic on_epoch_start and on_epoch_end hooks cause a lot of confusion for users who don't know how an epoch is defined in lightning. nor do they account for the validation being run multiple times within the train epoch when val_check_interval is configured.

the existing hooks are explicit as to when they're applied. if anything, i think we should remove the generic on_epoch_end hook and ask users to share code across the existing train/val/test epoch end hooks

Being more explicit in the naming is also a motivation for #6420

rohitgr7 · 2021-03-16T23:03:07Z

the existing hooks are explicit as to when they're applied. if anything, i think we should remove the generic on_epoch_end hook and ask users to share code across the existing train/val/test epoch end hooks

exactly what I suggested on slack but William suggested making it generic to run irrespective of train/val/test. Although another way around for a user to create such generic logic can be done via a separate hook they can define themselves.

def _common_epoch_start(self, ...):
    # do stuff

def on_train_epoch_start(self, ...):
    self._common_epoch_start(()

def on_val_epoch_start(self, ...):
    self._common_epoch_start(()

def on_test_epoch_start(self, ...):
    # do something else

or we can just make them on_loop_start (at least better than epoch IMO) if required.

I just updated this the way on_epoch_end behaves currently on the master.

… of train/val/test (#6498) * update docs * add hook and update docs * update tests * chlog * Update CHANGELOG.md Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> * chlog Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ### New commit log messages ## [UnReleased] - 2021-MM-DD ### Added - Added more explicit exception message when trying to execute `trainer.test()` or `trainer.validate()` with `fast_dev_run=True` ([#6667](Lightning-AI/pytorch-lightning#6667)) - Added `LightningCLI` class to provide simple reproducibility with minimum boilerplate training cli. ([#4492](Lightning-AI/pytorch-lightning#4492)) - Trigger warning when non-metric logged value with multi processes hasn't been reduced ([#6417](Lightning-AI/pytorch-lightning#6417)) - Added `gradient_clip_algorithm` argument to Trainer for gradient clipping by value ([#6123](Lightning-AI/pytorch-lightning#6123)). - Added a way to print to terminal without breaking up the progress bar ([#5470](Lightning-AI/pytorch-lightning#5470)) - Added support to checkpoint after training steps in `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Added `checkpoint` parameter to callback's `on_save_checkpoint` hook ([#6072](Lightning-AI/pytorch-lightning#6072)) - Added `RunningStage.SANITY_CHECKING` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `TrainerState.{FITTING,VALIDATING,TESTING,PREDICTING,TUNING}` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Added `Trainer.validate()` method to perform one evaluation epoch over the validation set ([#4948](Lightning-AI/pytorch-lightning#4948)) - Added `LightningEnvironment` for Lightning-specific DDP ([#5915](Lightning-AI/pytorch-lightning#5915)) - Added `teardown()` hook to LightningDataModule ([#4673](Lightning-AI/pytorch-lightning#4673)) - Added `auto_insert_metric_name` parameter to `ModelCheckpoint` ([#6277](Lightning-AI/pytorch-lightning#6277)) - Added arg to `self.log` that enables users to give custom names when dealing with multiple dataloaders ([#6274](Lightning-AI/pytorch-lightning#6274)) - Added `teardown` method to `BaseProfiler` to enable subclasses defining post-profiling steps outside of `__del__` ([#6370](Lightning-AI/pytorch-lightning#6370)) - Added `setup` method to `BaseProfiler` to enable subclasses defining pre-profiling steps for every process ([#6633](Lightning-AI/pytorch-lightning#6633)) - Added no return warning to predict ([#6139](Lightning-AI/pytorch-lightning#6139)) - Added `Trainer.predict` config validation ([#6543](Lightning-AI/pytorch-lightning#6543)) - Added `AbstractProfiler` interface ([#6621](Lightning-AI/pytorch-lightning#6621)) - Added support for including module names for forward in the autograd trace of `PyTorchProfiler` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Added support for the PyTorch 1.8.1 autograd profiler ([#6618](Lightning-AI/pytorch-lightning#6618)) - Added `outputs` parameter to callback's `on_validation_epoch_end` & `on_test_epoch_end` hooks ([#6120](Lightning-AI/pytorch-lightning#6120)) - Added `configure_sharded_model` hook ([#6679](Lightning-AI/pytorch-lightning#6679)) - Added support for `precision=64`, enabling training with double precision ([#6595](Lightning-AI/pytorch-lightning#6595)) - Added support for DDP communication hooks ([#6736](Lightning-AI/pytorch-lightning#6736)) - Added `artifact_location` argument to `MLFlowLogger` which will be passed to the `MlflowClient.create_experiment` call ([#6677](Lightning-AI/pytorch-lightning#6677)) - Added `model` parameter to precision plugins' `clip_gradients` signature ([#6764](Lightning-AI/pytorch-lightning#6764)) ### Changed - Renamed `pytorch_lightning.callbacks.swa` to `pytorch_lightning.callbacks.stochastic_weight_avg` ([#6259](Lightning-AI/pytorch-lightning#6259)) - Refactor `RunningStage` and `TrainerState` usage ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `trainer.evaluating` to return `True` if validating or testing ([#4945](Lightning-AI/pytorch-lightning#4945)) - Changed `setup()` and `teardown()` stage argument to take any of `{fit,validate,test,predict}` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Changed profilers to save separate report files per state and rank ([#6621](Lightning-AI/pytorch-lightning#6621)) - Changed `PyTorchProfiler` to use `torch.autograd.profiler.record_function` to record functions ([#6349](Lightning-AI/pytorch-lightning#6349)) ### Deprecated - `period` has been deprecated in favor of `every_n_val_epochs` in the `ModelCheckpoint` callback ([#6146](Lightning-AI/pytorch-lightning#6146)) - Deprecated `trainer.running_sanity_check` in favor of `trainer.sanity_checking` ([#4945](Lightning-AI/pytorch-lightning#4945)) - Deprecated `Profiler(output_filename)` in favor of `dirpath` and `filename` ([#6621](Lightning-AI/pytorch-lightning#6621)) - Deprecated `PytorchProfiler(profiled_functions)` in favor of `record_functions` ([#6349](Lightning-AI/pytorch-lightning#6349)) - Deprecated metrics in favor of `torchmetrics` ([#6505](Lightning-AI/pytorch-lightning#6505), [#6530](Lightning-AI/pytorch-lightning#6530), [#6540](Lightning-AI/pytorch-lightning#6540), [#6547](Lightning-AI/pytorch-lightning#6547), [#6515](Lightning-AI/pytorch-lightning#6515), [#6572](Lightning-AI/pytorch-lightning#6572), [#6573](Lightning-AI/pytorch-lightning#6573), [#6584](Lightning-AI/pytorch-lightning#6584), [#6636](Lightning-AI/pytorch-lightning#6636), [#6637](Lightning-AI/pytorch-lightning#6637), [#6649](Lightning-AI/pytorch-lightning#6649), [#6659](Lightning-AI/pytorch-lightning#6659), ) ### Removed - Removed support for passing a bool value to `profiler` argument of Trainer ([#6164](Lightning-AI/pytorch-lightning#6164)) - Removed no return warning from val/test step ([#6139](Lightning-AI/pytorch-lightning#6139)) - Removed passing a `ModelCheckpoint` instance to `Trainer(checkpoint_callback)` ([#6166](Lightning-AI/pytorch-lightning#6166)) - Removed deprecated Trainer argument `enable_pl_optimizer` and `automatic_optimization` ([#6163](Lightning-AI/pytorch-lightning#6163)) - Removed deprecated metrics ([#6161](Lightning-AI/pytorch-lightning#6161)) * from `pytorch_lightning.metrics.functional.classification` removed `to_onehot`, `to_categorical`, `get_num_classes`, `roc`, `multiclass_roc`, `average_precision`, `precision_recall_curve`, `multiclass_precision_recall_curve` * from `pytorch_lightning.metrics.functional.reduction` removed `reduce`, `class_reduce` - Removed deprecated `ModelCheckpoint` arguments `prefix`, `mode="auto"` ([#6162](Lightning-AI/pytorch-lightning#6162)) - Removed `mode='auto'` from `EarlyStopping` ([#6167](Lightning-AI/pytorch-lightning#6167)) - Removed legacy references for magic keys in the `Result` object ([#6016](Lightning-AI/pytorch-lightning#6016)) - Removed deprecated `LightningModule` `hparams` setter ([#6207](Lightning-AI/pytorch-lightning#6207)) - Removed legacy code to log or include metrics in the progress bar by returning them in a dict with the `"log"/"progress_bar"` magic keys. Use `self.log` instead ([#6734](Lightning-AI/pytorch-lightning#6734)) - Removed `optimizer_idx` argument from `training_step` in manual optimization ([#6093](Lightning-AI/pytorch-lightning#6093)) ### Fixed - Set better defaults for `rank_zero_only.rank` when training is launched with SLURM and torchelastic ([#6802](Lightning-AI/pytorch-lightning#6802)) - Made the `Plugin.reduce` method more consistent across all Plugins to reflect a mean-reduction by default ([#6011](Lightning-AI/pytorch-lightning#6011)) - Move lightning module to correct device type when using LightningDistributedWrapper ([#6070](Lightning-AI/pytorch-lightning#6070)) - Do not print top-k verbose log with `ModelCheckpoint(monitor=None)` ([#6109](Lightning-AI/pytorch-lightning#6109)) - Fixed csv extension check ([#6436](Lightning-AI/pytorch-lightning#6436)) - Fixed `ModelCheckpoint(monitor=None, save_last=True)` not saving checkpoints ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `ModelCheckpoint(save_top_k=0, save_last=True)` not saving the `last` checkpoint ([#6136](Lightning-AI/pytorch-lightning#6136)) - Fixed `.teardown(stage='fit')` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed `.on_fit_{start,end}()` getting called during `trainer.test` ([#6386](Lightning-AI/pytorch-lightning#6386)) - Fixed LightningModule `all_gather` on cpu tensors ([#6416](Lightning-AI/pytorch-lightning#6416)) - Fixed torch distributed not available in setup hook for DDP ([#6506](Lightning-AI/pytorch-lightning#6506)) - Fixed `EarlyStopping` logic when `min_epochs` or `min_steps` requirement is not met ([#6705](Lightning-AI/pytorch-lightning#6705)) ## [1.2.7] - 2021-04-06 ### Fixed - Fixed resolve a bug with omegaconf and xm.save ([#6741](Lightning-AI/pytorch-lightning#6741)) - Fixed an issue with IterableDataset when __len__ is not defined ([#6828](Lightning-AI/pytorch-lightning#6828)) - Sanitize None params during pruning ([#6836](Lightning-AI/pytorch-lightning#6836)) - Enforce an epoch scheduler interval when using SWA ([#6588](Lightning-AI/pytorch-lightning#6588)) - Fixed TPU Colab hang issue, post training ([#6816](Lightning-AI/pytorch-lightning#6816)) - Fixed a bug where `TensorBoardLogger` would give a warning and not log correctly to a symbolic link `save_dir` ([#6730](Lightning-AI/pytorch-lightning#6730)) ## [1.2.6] - 2021-03-30 ### Changed - Changed the behavior of `on_epoch_start` to run at the beginning of validation & test epoch ([#6498](Lightning-AI/pytorch-lightning#6498)) ### Removed - Removed legacy code to include `step` dictionary returns in `callback_metrics`. Use `self.log_dict` instead. ([#6682](Lightning-AI/pytorch-lightning#6682)) ### Fixed - Fixed `DummyLogger.log_hyperparams` raising a `TypeError` when running with `fast_dev_run=True` ([#6398](Lightning-AI/pytorch-lightning#6398)) - Fixed error on TPUs when there was no `ModelCheckpoint` ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed `trainer.test` freeze on TPUs ([#6654](Lightning-AI/pytorch-lightning#6654)) - Fixed a bug where gradients were disabled after calling `Trainer.predict` ([#6657](Lightning-AI/pytorch-lightning#6657)) - Fixed bug where no TPUs were detected in a TPU pod env ([#6719](Lightning-AI/pytorch-lightning#6719)) ## [1.2.5] - 2021-03-23 ### Changed - Update Gradient Clipping for the TPU Accelerator ([#6576](Lightning-AI/pytorch-lightning#6576)) - Refactored setup for typing friendly ([#6590](Lightning-AI/pytorch-lightning#6590)) ### Fixed - Fixed a bug where `all_gather` would not work correctly with `tpu_cores=8` ([#6587](Lightning-AI/pytorch-lightning#6587)) - Fixed comparing required versions ([#6434](Lightning-AI/pytorch-lightning#6434)) - Fixed duplicate logs appearing in console when using the python logging module ([#6275](Lightning-AI/pytorch-lightning#6275)) - Added Autocast in validation, test and predict modes for Native AMP ([#6565](Lightning-AI/pytorch-lightning#6565)) Reviewed By: shuyingsunshine21 Differential Revision: D27528929 fbshipit-source-id: 311c88f71461c2c79bbf185e28d7a6d683ccc26f

rohitgr7 added 4 commits March 14, 2021 19:08

update docs

aaba917

add hook and update docs

3935c5e

update tests

2810fab

chlog

a3227ca

rohitgr7 force-pushed the docs/on_ep branch from 37e0c88 to a3227ca Compare March 14, 2021 13:44

rohitgr7 requested review from awaelchli, Borda, carmocca, edenlightning, justusschock, SeanNaren, tchaton and williamFalcon as code owners March 14, 2021 13:44

rohitgr7 changed the title ~~[WIP] Update docs for on_epoch_start/on_epoch_end~~ Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021

rohitgr7 added this to the 1.2.x milestone Mar 14, 2021

rohitgr7 added design Includes a design discussion callback feature Is an improvement or enhancement labels Mar 14, 2021

rohitgr7 requested review from kaushikb11 and SkafteNicki March 14, 2021 13:49

rohitgr7 changed the title ~~Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test~~ [WIP] Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021

rohitgr7 marked this pull request as draft March 14, 2021 13:59

rohitgr7 changed the title ~~[WIP] Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test~~ Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test Mar 14, 2021

rohitgr7 marked this pull request as ready for review March 14, 2021 15:28

tchaton approved these changes Mar 14, 2021

View reviewed changes

awaelchli approved these changes Mar 14, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

carmocca approved these changes Mar 15, 2021

View reviewed changes

Update CHANGELOG.md

e5fe202

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

chlog

3536b6e

rohitgr7 added the _Will label Mar 15, 2021

Borda requested changes Mar 15, 2021

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

pytorch_lightning/trainer/callback_hook.py Show resolved Hide resolved

rohitgr7 requested a review from Borda March 16, 2021 11:20

ananthsub reviewed Mar 16, 2021

View reviewed changes

mergify bot added has conflicts and removed has conflicts labels Mar 18, 2021

williamFalcon approved these changes Mar 25, 2021

View reviewed changes

Borda approved these changes Mar 25, 2021

View reviewed changes

Borda merged commit 9be092d into master Mar 25, 2021

Borda deleted the docs/on_ep branch March 25, 2021 13:20

carmocca mentioned this pull request Mar 29, 2021

1.2.x cherries 🍒 #6083

Closed

rohitgr7 mentioned this pull request Nov 29, 2021

Deprecate and remove on_epoch_start/end and on_batch_start/end hooks #10807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

rohitgr7 commented Mar 12, 2021 •

edited

Loading

rohitgr7 commented Mar 12, 2021

codecov bot commented Mar 12, 2021 •

edited

Loading

carmocca commented Mar 14, 2021

tchaton left a comment

awaelchli left a comment

rohitgr7 commented Mar 15, 2021

awaelchli commented Mar 15, 2021

Borda commented Mar 15, 2021

rohitgr7 commented Mar 15, 2021

Borda left a comment

ananthsub left a comment •

edited

Loading

rohitgr7 commented Mar 16, 2021

Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

Add on_epoch_start to run at the beginning of every loop irrespective of train/val/test #6498

Conversation

rohitgr7 commented Mar 12, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

rohitgr7 commented Mar 12, 2021

codecov bot commented Mar 12, 2021 • edited Loading

Codecov Report

carmocca commented Mar 14, 2021

tchaton left a comment

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

rohitgr7 commented Mar 15, 2021

awaelchli commented Mar 15, 2021

Borda commented Mar 15, 2021

rohitgr7 commented Mar 15, 2021

Borda left a comment

Choose a reason for hiding this comment

ananthsub left a comment • edited Loading

Choose a reason for hiding this comment

rohitgr7 commented Mar 16, 2021

rohitgr7 commented Mar 12, 2021 •

edited

Loading

codecov bot commented Mar 12, 2021 •

edited

Loading

ananthsub left a comment •

edited

Loading