[RFC] Ensure error handling is supported across all Trainer entry points #8723

ananthsub · 2021-08-04T20:42:19Z

🚀 Feature

Motivation

We are auditing the Lightning components and APIs to assess opportunities for improvements:

One item that came up was error handling in the Trainer.

Currently, Lightning has error handling for when trainer.fit() is called. This allows for component cleanup before re-raising the exception to the parent program. https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1057-L1079

However, this error handling currently applies only during trainer.fit(). Instead, we should ensure this try/catch applies to all top-level trainer functions, such as trainer.validate(), trainer.test(), and trainer.predict(). This can be very useful to power features such as error collection datasets.

Pitch

The _run function houses most of the execution logic: all the top-level trainer entry points are funneled through here for processing: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/trainer/trainer.py#L854

We could wrap _run with the try/catch and rename the current _run to _run_impl

def _run(self, model: "pl.LightningModule") -> ...:
    try:
        return self._run_impl(model)
    except KeyboardInterrupt:
        ...
    except BaseException:
        ....
        raise

lifting the logic from _run_train here for the shutdown: https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1061-L1079

Alternatives

The proposal above misses some of the misconfiguration errors which take place inside of fit/validate/test/predict before _run is called. To ensure no gaps, we could have corresponding _fit_impl, _validate_impl, _test_impl, and _predict_impl functions in the trainer , such that fit becomes:

def fit():
    try:
        return _fit_impl()
    # exception handling

Both of these proposals would remove the need for error handling inside of _run_train specifically

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

ananthsub · 2021-08-05T03:07:13Z

List of things to address:

We call on_train_end in the error handling, but this doesn't apply for other trainer entry points

ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor labels Aug 4, 2021

ananthsub added this to the v1.5 milestone Aug 4, 2021

ananthsub self-assigned this Aug 4, 2021

ananthsub changed the title ~~Ensure error handling is supported across all Trainer entry points~~ [RFC] Ensure error handling is supported across all Trainer entry points Aug 4, 2021

daniellepintz mentioned this issue Aug 9, 2021

Add error handling for all trainer entry points #8819

Merged

12 tasks

ananthsub closed this as completed in #8819 Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Ensure error handling is supported across all Trainer entry points #8723

[RFC] Ensure error handling is supported across all Trainer entry points #8723

ananthsub commented Aug 4, 2021

ananthsub commented Aug 5, 2021

[RFC] Ensure error handling is supported across all Trainer entry points #8723

[RFC] Ensure error handling is supported across all Trainer entry points #8723

Comments

ananthsub commented Aug 4, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Aug 5, 2021