TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel
Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!
https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9
[1.4.0] - 2021-07-27
Added
- Added
extract_batch_size
utility and corresponding tests to extract batch dimension from multiple batch types (#8357) - Added support for named parameter groups in
LearningRateMonitor
(#7987) - Added
dataclass
support forpytorch_lightning.utilities.apply_to_collection
(#7935) - Added support to
LightningModule.to_torchscript
for saving to custom filesystems withfsspec
(#7617) - Added
KubeflowEnvironment
for use with thePyTorchJob
operator in Kubeflow - Added LightningCLI support for config files on object stores (#7521)
- Added
ModelPruning(prune_on_train_epoch_end=True|False)
to choose when to apply pruning (#7704) - Added support for checkpointing based on a provided time interval during training (#7515)
- Progress tracking
- Added support for passing a
LightningDataModule
positionally as the second argument totrainer.{validate,test,predict}
(#7431) - Added argument
trainer.predict(ckpt_path)
(#7430) - Added
clip_grad_by_value
support for TPUs (#7025) - Added support for passing any class to
is_overridden
(#7918) - Added
sub_dir
parameter toTensorBoardLogger
(#6195) - Added correct
dataloader_idx
to batch transfer hooks (#6241) - Added
include_none=bool
argument toapply_to_collection
(#7769) - Added
apply_to_collections
to apply a function to two zipped collections (#7769) - Added
ddp_fully_sharded
support (#7487) - Added
should_rank_save_checkpoint
property to Training Plugins (#7684) - Added
log_grad_norm
hook toLightningModule
to customize the logging of gradient norms (#7873) - Added
save_config_filename
init argument toLightningCLI
to ease resolving name conflicts (#7741) - Added
save_config_overwrite
init argument toLightningCLI
to ease overwriting existing config files (#8059) - Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
- Added trainer stage hooks for Training Plugins and Accelerators (#7864)
- Added the
on_before_optimizer_step
hook (#8048) - Added IPU Accelerator (#7867)
- Fault-tolerant training
- Added
{,load_}state_dict
toResultCollection
(#7948) - Added
{,load_}state_dict
toLoops
(#8197) - Set
Loop.restarting=False
at the end of the first iteration (#8362) - Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added
state_dict
andload_state_dict
utilities forCombinedLoader
+ utilities for dataloader (#8364)
- Added
- Added
rank_zero_only
toLightningModule.log
function (#7966) - Added
metric_attribute
toLightningModule.log
function (#7966) - Added a warning if
Trainer(log_every_n_steps)
is a value too high for the training dataloader (#7734) - Added LightningCLI support for argument links applied on instantiation (#7895)
- Added LightningCLI support for configurable callbacks that should always be present (#7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
- Added support for
torch.nn.UninitializedParameter
inModelSummary
(#7642) - Added support
LightningModule.save_hyperparameters
whenLightningModule
is a dataclass (#7992) - Added support for overriding
optimizer_zero_grad
andoptimizer_step
when using accumulate_grad_batches (#7980) - Added
logger
boolean flag tosave_hyperparameters
(#7960) - Added support for calling scripts using the module syntax (
python -m package.script
) (#8073) - Added support for optimizers and learning rate schedulers to
LightningCLI
(#8093) - Added XLA Profiler (#8014)
- Added
PrecisionPlugin.{pre,post}_backward
(#8328) - Added
on_load_checkpoint
andon_save_checkpoint
hooks to thePrecisionPlugin
base class (#7831) - Added
max_depth
parameter inModelSummary
(#8062) - Added
XLAStatsMonitor
callback (#8235) - Added
restore
function andrestarting
attribute to baseLoop
(#8247) - Added
FastForwardSampler
andCaptureIterableDataset
(#8307) - Added support for
save_hyperparameters
inLightningDataModule
(#3792) - Added the
ModelCheckpoint(save_on_train_epoch_end)
to choose when to run the saving logic (#8389) - Added
LSFEnvironment
for distributed training with the LSF resource managerjsrun
(#5102) - Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'
(#7808) - Added
tpu_spawn_debug
to plugin registry (#7933) - Enabled traditional/manual launching of DDP processes through
LOCAL_RANK
andNODE_RANK
environment variable assignments (#7480) - Added
quantize_on_fit_end
argument toQuantizationAwareTraining
(#8464) - Added experimental support for loop specialization (#8226)
- Added support for
devices
flag to Trainer (#8440) - Added private
prevent_trainer_and_dataloaders_deepcopy
context manager on theLightningModule
(#8472) - Added support for providing callables to the Lightning CLI instead of types (#8400)
Changed
- Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
- Changed the
Trainer
'scheckpoint_callback
argument to allow only boolean values (#7539) - Log epoch metrics before the
on_evaluation_end
hook (#7272) - Explicitly disallow calling
self.log(on_epoch=False)
during epoch-only or single-call hooks (#7874) - Changed these
Trainer
methods to be protected:call_setup_hook
,call_configure_sharded_model
,pre_dispatch
,dispatch
,post_dispatch
,call_teardown_hook
,run_train
,run_sanity_check
,run_evaluate
,run_evaluation
,run_predict
,track_output_for_epoch_end
- Changed
metrics_to_scalars
to work with any collection or value (#7888) - Changed
clip_grad_norm
to usetorch.nn.utils.clip_grad_norm_
(#7025) - Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint
now runs at the end of the training epoch by default (#8389)EarlyStopping
now runs at the end of the training epoch by default (#8286)- Refactored Loops
- Moved attributes
global_step
,current_epoch
,max/min_steps
,max/min_epochs
,batch_idx
, andtotal_batch_idx
to TrainLoop (#7437) - Refactored result handling in training loop (#7506)
- Moved attributes
hiddens
andsplit_idx
to TrainLoop (#7507) - Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the
on_epoch
guard from the "should stop" validation check (#7701) - Refactored internal loop interface; added new classes
FitLoop
,TrainingEpochLoop
,TrainingBatchLoop
(#7871, #8077) - Removed
pytorch_lightning/trainer/training_loop.py
(#7985) - Refactored evaluation loop interface; added new classes
DataLoaderLoop
,EvaluationLoop
,EvaluationEpochLoop
(#7990, #8077) - Removed
pytorch_lightning/trainer/evaluation_loop.py
(#8056) - Restricted public access to several internal functions (#8024)
- Refactored trainer
_run_*
functions and separate evaluation loops (#8065) - Refactored prediction loop interface; added new classes
PredictionLoop
,PredictionEpochLoop
(#7700, #8077) - Removed
pytorch_lightning/trainer/predict_loop.py
(#8094) - Moved result teardown to the loops (#8245)
- Improve
Loop
API to better handle childrenstate_dict
andprogress
(#8334)
- Moved attributes
- Refactored logging
- Renamed and moved
core/step_result.py
totrainer/connectors/logger_connector/result.py
(#7736) - Dramatically simplify the
LoggerConnector
(#7882) trainer.{logged,progress_bar,callback}_metrics
are now updated on-demand (#7882)- Completely overhaul the
Result
object in favor ofResultMetric
(#7882) - Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing
self.log(batch_size=...)
(#7891) - Each of the training loops now keeps its own results collection (#7891)
- Remove
EpochResultStore
andHookResultStore
in favor ofResultCollection
(#7909) - Remove
MetricsHolder
(#7909)
- Renamed and moved
- Moved
ignore_scalar_return_in_dp
warning suppression to the DataParallelPlugin class (#7421) - Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*
to the metric name (#7351) - Raised
ValueError
when aNone
value isself.log
-ed (#7771) - Changed
resolve_training_type_plugins
to allow settingnum_nodes
andsync_batchnorm
fromTrainer
setting (#7026) - Default
seed_everything(workers=True)
in theLightningCLI
(#7504) - Changed
model.state_dict()
inCheckpointConnector
to allowtraining_type_plugin
to customize the model'sstate_dict()
(#7474) MLflowLogger
now uses the env variableMLFLOW_TRACKING_URI
as default tracking URI (#7457)- Changed
Trainer
arg and functionality fromreload_dataloaders_every_epoch
toreload_dataloaders_every_n_epochs
(#5043) - Changed
WandbLogger(log_model={True/'all'})
to log models as artifacts (#6231) - MLFlowLogger now accepts
run_name
as an constructor argument (#7622) - Changed
teardown()
inAccelerator
to allowtraining_type_plugin
to customizeteardown
logic (#7579) Trainer.fit
now raises an error when using manual optimization with unsupported features such asgradient_clip_val
oraccumulate_grad_batches
(#7788)- Accelerator hooks are called regardless if
LightningModule
overrides the same hooks (#7826) - Moved profilers to their own file (#7822)
- The
on_after_backward
hook is now called on accumulating iterations. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The mixed precision loss is no longer unscaled before the
on_after_backward
hook. Use theon_before_optimizer_step
hook to mimic the old behaviour (#8328) - The
TrainingTypePlugin.{pre,post}_backward
hooks no longer take theoptimizer, opt_idx, should_accumulate
arguments (#8328) - The
PrecisionPlugin.backward
hooks no longer returns a value (#8328) - The
PrecisionPlugin.backward
hooks no longer takes ashould_accumulate
argument (#8328) - Added the
on_before_backward
hook (#7865) LightningCLI
now aborts with a clearer message if config already exists and disables save config duringfast_dev_run
(#7963)- Saved the
LightningCLI
config onsetup
and only on the main process (#8017) - Dropped the
LightningCLI
ArgumentParser
when pickling (#8017) - Skip
broadcast
if distributed not initialized for the spawn plugins (#8017) Trainer(resume_from_checkpoint=...)
now restores the model directly afterLightningModule.setup()
, which is beforeLightningModule.configure_sharded_model()
(#7652)- Moved
torch.cuda.set_device()
to enable collective calls earlier in setup (#8312) - Used XLA utility API to move data to CPU (Single TPU core) (#8078)
- Improved error messages in
replace_sampler
when theDataLoader
attributes are not included in the signature or the signature is missing optional arguments (#8519) - Moved
DeviceDtypeModuleMixin
andHyperparametersMixin
mixin tocore
(#8396) - Return the
default_root_dir
as thelog_dir
when the logger is aLoggerCollection
(#8187)
Deprecated
- Deprecated
LightningModule.loaded_optimizer_states_dict
(#8229) - Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}
(#7431) - Deprecated
DataModule
properties:has_prepared_data
,has_setup_fit
,has_setup_validate
,has_setup_test
,has_setup_predict
,has_teardown_fit
,has_teardown_validate
,has_teardown_test
,has_teardown_predict
(#7657) - Deprecated
TrainerModelHooksMixin
in favor ofpytorch_lightning.utilities.signature_utils
(#7422) - Deprecated
num_nodes
andsync_batchnorm
arguments inDDPPlugin
andDDPSpawnPlugin
(#7026) - Deprecated
self.log(sync_dist_op)
in favor ofself.log(reduce_fx)
. (#7891) - Deprecated
is_overridden(model=...)
in favor ofis_overridden(instance=...)
(#7918) - Deprecated automatically detaching returned extras with grads (#7994)
- Deprecated default value of
monitor
argument in EarlyStopping callback to enforcemonitor
as a required argument (#7907) - Deprecated importing
rank_zero_{warn,deprecation}
directly frompytorch_lightning.utilities.distributed
(#8085) - Deprecated the use of
CheckpointConnector.hpc_load()
in favor ofCheckpointConnector.restore()
(#7652) - Deprecated
ModelCheckpoint(every_n_val_epochs)
in favor ofModelCheckpoint(every_n_epochs)
(#8383) - Deprecated
DDPPlugin.task_idx
in favor ofDDPPlugin.local_rank
(#8203) - Deprecated the
Trainer.train_loop
property in favor ofTrainer.fit_loop
(#8025) - Deprecated the
Trainer.disable_validation
property in favor ofnot Trainer.enable_validation
(#8291) - Deprecated
mode
parameter inModelSummary
in favor ofmax_depth
(#8062) - Deprecated
reload_dataloaders_every_epoch
argument ofTrainer
in favor ofreload_dataloaders_every_n_epochs
(#5043) - Deprecated
distributed_backend
argument forTrainer
(#8575)
Removed
- Dropped official support/testing for PyTorch <1.6 (#8288)
- Removed
ProfilerConnector
(#7654) - Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification
(#7499) - Removed deprecated data parallel classes
LightningDataParallel
andLightningDistributedDataParallel
frompytorch_lightning.overrides.data_parallel
(#7510) - Removed deprecated trainer attributes -
get_model
andaccelerator_backend
(#7502) - Removed support for automatically monitoring the
val_loss
key withModelCheckpoint
. Pass yourmonitor
of choice to theModelCheckpoint
instance instead (#8293) - Removed support for
self.log(tbptt_reduce_fx)
andself.log(tbptt_pad_token)
. Please, open a discussion explaining your use-case if you relied on these. (#7644) - Removed deprecated utils modules
model_utils
,warning_utils
,xla_device_utils
and partiallyargparse_utils
(#7503) - Removed
RPCPlugin
andRPCSequentialPlugin
. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101) - Removed deprecated trainer attributes -
on_cpu
,on_tpu
,use_tpu
,on_gpu
,use_dp
,use_ddp
,use_ddp2
,use_horovod
,use_single_gpu
(#7501) - Removed deprecated
optimizer
argument inLightningModule.manual_backward()
; Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()
(#8287) - Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
- Removed environment variable
PL_EXP_VERSION
from DDP subprocesses (#7403)
Fixed
- Fixed the
GPUStatsMonitor
callbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICES
set (#8260) - Fixed
lr_scheduler
checkpointed state by callingupdate_lr_schedulers
before saving checkpoints (#7877) - Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
- Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
- Fixed
None
loss keys getting added intraining_epoch_end
when using manual optimization and not returning a loss (#7772) - Fixed a bug where
precision=64
withaccelerator='ddp_spawn'
would throw a pickle error (#6924) - Do not override the existing
epoch
value inlogged_metrics
when already logged by the user (#7982) - Support for manual optimization with DeepSpeed (#7970)
- Fixed
dataloader_idx
argument value when predicting with only oneDataLoader
(#7941) - Fixed passing the
stage
argument ofCallback.{setup,teardown}
as a keyword (#7973) - Fixed metrics generated during
validation sanity checking
are cleaned on end (#8171) - Fixed
log_gpu_memory
metrics not being added tologging
when nothing else is logged (#8174) - Fixed a bug where calling
log
with aMetric
instance would raise an error if it was a nested attribute of the model (#8181) - Fixed a bug where using
precision=64
would cause buffers with complex dtype to be cast to real (#8208) - Fixed
is_overridden
returning true for wrapped functions with no changes (#8296) - Fixed a bug where
truncated_bptt_steps
would throw an AttributeError when the target RNN has multiple hidden states (#8145) - Fixed
self.optimizers()
not returning a single optimizer if it had been wrapped (#8326) - Fixed the
on_after_backward
hook not getting called when using manual optimization and no plugins (#8328) - Fixed the
LightningModule.backward
hook only getting called with theapex
plugin when using manual optimization (#8328) - Fixed moving batch to device before sending it to the
on_*_batch_start
/on_*_batch_end
callbacks and model hooks (#7378) - Fixed passing a custom
DDPPlugin
when choosingaccelerator="ddp_cpu"
for the accelerator (#6208) - Fixed missing call to
LightningModule.untoggle_optimizer
in training loop when running gradient accumulation with multiple optimizers (#8284) - Fixed hash of LightningEnum to work with value instead of name (#8421).
- Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_interval
did not align with the number of training batches (#7724) - Fixed hash of LightningEnum to work with value instead of name(#8421).
- Fixed
move_data_to_device
to return the batch if the objectto
function didn't returnself
(#8433) - Fixed progress bar updates for Pod Training (#8258)
- Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
- Fixed memory leaks on GPU by moving
optimizer_states
,ResultCollection.extra
,ResultMetric
attributes, andLoggerConnector
metrics tocpu
. Also, delete the DDP wrapper onteardown
(#8490) - Fixed
SWA
callback using LightningModuleprevent_trainer_and_dataloaders_deepcopy
to avoid OOM (#8472) - Fixed
ModelPruning
callbackon_save_checkpoint
to avoid making adeepcopy
potentially leading to OOM (#8472) - Fixed the sampler replacement logic for
DataLoader
s which do not define allDataLoader
attributes as__init__
parameters (#8519) - Fixed DeepSpeed Windows support (#8488)
- Fixed DeepSpeed not properly setting the trainer
lr_schedulers
attribute (#8527) - Fixed experiment version and log-dir divergence in DDP when using multiple
Trainer
instances in sequence (#7403) - Enabled manual optimization for TPUs (#8458)
- Fixed
accumulate_grad_batches
not been recomputed during model reload (#5334) - Fixed a
TypeError
when wrapping optimizers in theHorovodPlugin
and runningTrainer.test
(#7840) - Fixed
BackboneFinetuning
restoration (#8501) - Fixed
lr_scheduler
with metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau
) when usingautomatic_optimization = False
(#7643) - Fixed
DeepSpeed
breaking with no schedulers (#8580)
Contributors
@00sapo @AffineParameter @ajtritt @akihironitta @ananthsub @aniketmaurya @aslisabanci @awaelchli @bamblebam @Borda @borisdayma @carmocca @dalek-who @DavidMChan @davors72 @dcfidalgo @ddrevicky @deepsource-autofix @djthegr8 @edenlightning @edgarriba @eladsegal @ethanwharris @eugeneh101 @fepegar @gaoteng-git @gtauzin @i-aki-y @janhenriklambrechts @jiwidi @justusschock @karthikrangasai @kaushikb11 @loic-beheshti @Lucklyric @ManuelPalermo @mauvilsa @maxoppelt @neggert @nikvaessen @nisheethlahoti @pre-commit-ci @rohitgr7 @ruotianluo @satishjasthi @SeanNaren @shirayu @shuyingsunshine21 @sid-sundrani @Sileadim @simran2905 @stancld @t-vi @tchaton @theblackfly @theodumont @tilman151 @tomy0000000 @tshu-w @vatch123 @WrRan @yifuwang
If we forgot someone, let us know :]