Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Ensure error handling is supported across all Trainer entry points #8723

Closed
ananthsub opened this issue Aug 4, 2021 · 1 comment · Fixed by #8819
Closed

[RFC] Ensure error handling is supported across all Trainer entry points #8723

ananthsub opened this issue Aug 4, 2021 · 1 comment · Fixed by #8819
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on refactor
Milestone

Comments

@ananthsub
Copy link
Contributor

🚀 Feature

Motivation

We are auditing the Lightning components and APIs to assess opportunities for improvements:

One item that came up was error handling in the Trainer.

Currently, Lightning has error handling for when trainer.fit() is called. This allows for component cleanup before re-raising the exception to the parent program. https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1057-L1079

However, this error handling currently applies only during trainer.fit(). Instead, we should ensure this try/catch applies to all top-level trainer functions, such as trainer.validate(), trainer.test(), and trainer.predict(). This can be very useful to power features such as error collection datasets.

Pitch

The _run function houses most of the execution logic: all the top-level trainer entry points are funneled through here for processing: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/trainer/trainer.py#L854

We could wrap _run with the try/catch and rename the current _run to _run_impl

def _run(self, model: "pl.LightningModule") -> ...:
    try:
        return self._run_impl(model)
    except KeyboardInterrupt:
        ...
    except BaseException:
        ....
        raise

lifting the logic from _run_train here for the shutdown: https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1061-L1079

Alternatives

The proposal above misses some of the misconfiguration errors which take place inside of fit/validate/test/predict before _run is called. To ensure no gaps, we could have corresponding _fit_impl, _validate_impl, _test_impl, and _predict_impl functions in the trainer , such that fit becomes:

def fit():
    try:
        return _fit_impl()
    # exception handling

Both of these proposals would remove the need for error handling inside of _run_train specifically

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@ananthsub ananthsub added feature Is an improvement or enhancement help wanted Open to be worked on refactor labels Aug 4, 2021
@ananthsub ananthsub added this to the v1.5 milestone Aug 4, 2021
@ananthsub ananthsub self-assigned this Aug 4, 2021
@ananthsub ananthsub changed the title Ensure error handling is supported across all Trainer entry points [RFC] Ensure error handling is supported across all Trainer entry points Aug 4, 2021
@ananthsub
Copy link
Contributor Author

List of things to address:

  • We call on_train_end in the error handling, but this doesn't apply for other trainer entry points

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant