You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, this error handling currently applies only during trainer.fit(). Instead, we should ensure this try/catch applies to all top-level trainer functions, such as trainer.validate(), trainer.test(), and trainer.predict(). This can be very useful to power features such as error collection datasets.
The proposal above misses some of the misconfiguration errors which take place inside of fit/validate/test/predict before _run is called. To ensure no gaps, we could have corresponding _fit_impl, _validate_impl, _test_impl, and _predict_impl functions in the trainer , such that fit becomes:
Both of these proposals would remove the need for error handling inside of _run_train specifically
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered:
ananthsub
changed the title
Ensure error handling is supported across all Trainer entry points
[RFC] Ensure error handling is supported across all Trainer entry points
Aug 4, 2021
🚀 Feature
Motivation
We are auditing the Lightning components and APIs to assess opportunities for improvements:
One item that came up was error handling in the Trainer.
Currently, Lightning has error handling for when
trainer.fit()
is called. This allows for component cleanup before re-raising the exception to the parent program. https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1057-L1079However, this error handling currently applies only during trainer.fit(). Instead, we should ensure this try/catch applies to all top-level trainer functions, such as trainer.validate(), trainer.test(), and trainer.predict(). This can be very useful to power features such as error collection datasets.
Pitch
The
_run
function houses most of the execution logic: all the top-level trainer entry points are funneled through here for processing: https://github.com/PyTorchLightning/pytorch-lightning/blob/963c26764682fa4cf64c93c5a7572ae0040e9c32/pytorch_lightning/trainer/trainer.py#L854We could wrap
_run
with the try/catch and rename the current_run
to_run_impl
lifting the logic from _run_train here for the shutdown: https://github.com/PyTorchLightning/pytorch-lightning/blob/49d03f87fed0458cbb146d38243be56be4cb9689/pytorch_lightning/trainer/trainer.py#L1061-L1079
Alternatives
The proposal above misses some of the misconfiguration errors which take place inside of
fit
/validate
/test
/predict
before_run
is called. To ensure no gaps, we could have corresponding_fit_impl
,_validate_impl
,_test_impl
, and_predict_impl
functions in the trainer , such that fit becomes:Both of these proposals would remove the need for error handling inside of
_run_train
specificallyAdditional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: