Release TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel · Lightning-AI/pytorch-lightning

Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!

https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9

[1.4.0] - 2021-07-27

Added

Added extract_batch_size utility and corresponding tests to extract batch dimension from multiple batch types (#8357)
Added support for named parameter groups in LearningRateMonitor (#7987)
Added dataclass support for pytorch_lightning.utilities.apply_to_collection (#7935)
Added support to LightningModule.to_torchscript for saving to custom filesystems with fsspec (#7617)
Added KubeflowEnvironment for use with the PyTorchJob operator in Kubeflow
Added LightningCLI support for config files on object stores (#7521)
Added ModelPruning(prune_on_train_epoch_end=True|False) to choose when to apply pruning (#7704)
Added support for checkpointing based on a provided time interval during training (#7515)
Progress tracking
- Added dataclasses for progress tracking (#6603, #7574, #8140, #8362)
- Add {,load_}state_dict to the progress tracking dataclasses (#8140)
- Connect the progress tracking dataclasses to the loops (#8244, #8362)
- Do not reset the progress tracking dataclasses total counters (#8475)
Added support for passing a LightningDataModule positionally as the second argument to trainer.{validate,test,predict} (#7431)
Added argument trainer.predict(ckpt_path) (#7430)
Added clip_grad_by_value support for TPUs (#7025)
Added support for passing any class to is_overridden (#7918)
Added sub_dir parameter to TensorBoardLogger (#6195)
Added correct dataloader_idx to batch transfer hooks (#6241)
Added include_none=bool argument to apply_to_collection (#7769)
Added apply_to_collections to apply a function to two zipped collections (#7769)
Added ddp_fully_sharded support (#7487)
Added should_rank_save_checkpoint property to Training Plugins (#7684)
Added log_grad_norm hook to LightningModule to customize the logging of gradient norms (#7873)
Added save_config_filename init argument to LightningCLI to ease resolving name conflicts (#7741)
Added save_config_overwrite init argument to LightningCLI to ease overwriting existing config files (#8059)
Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
Added trainer stage hooks for Training Plugins and Accelerators (#7864)
Added the on_before_optimizer_step hook (#8048)
Added IPU Accelerator (#7867)
Fault-tolerant training
- Added {,load_}state_dict to ResultCollection (#7948)
- Added {,load_}state_dict to Loops (#8197)
- Set Loop.restarting=False at the end of the first iteration (#8362)
- Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added state_dict and load_state_dict utilities for CombinedLoader + utilities for dataloader (#8364)
Added rank_zero_only to LightningModule.log function (#7966)
Added metric_attribute to LightningModule.log function (#7966)
Added a warning if Trainer(log_every_n_steps) is a value too high for the training dataloader (#7734)
Added LightningCLI support for argument links applied on instantiation (#7895)
Added LightningCLI support for configurable callbacks that should always be present (#7964)
Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
Added support for torch.nn.UninitializedParameter in ModelSummary (#7642)
Added support LightningModule.save_hyperparameters when LightningModule is a dataclass (#7992)
Added support for overriding optimizer_zero_grad and optimizer_step when using accumulate_grad_batches (#7980)
Added logger boolean flag to save_hyperparameters (#7960)
Added support for calling scripts using the module syntax (python -m package.script) (#8073)
Added support for optimizers and learning rate schedulers to LightningCLI (#8093)
Added XLA Profiler (#8014)
Added PrecisionPlugin.{pre,post}_backward (#8328)
Added on_load_checkpoint and on_save_checkpoint hooks to the PrecisionPlugin base class (#7831)
Added max_depth parameter in ModelSummary (#8062)
Added XLAStatsMonitor callback (#8235)
Added restore function and restarting attribute to base Loop (#8247)
Added FastForwardSampler and CaptureIterableDataset (#8307)
Added support for save_hyperparameters in LightningDataModule (#3792)
Added the ModelCheckpoint(save_on_train_epoch_end) to choose when to run the saving logic (#8389)
Added LSFEnvironment for distributed training with the LSF resource manager jsrun (#5102)
Added support for accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto' (#7808)
Added tpu_spawn_debug to plugin registry (#7933)
Enabled traditional/manual launching of DDP processes through LOCAL_RANK and NODE_RANK environment variable assignments (#7480)
Added quantize_on_fit_end argument to QuantizationAwareTraining (#8464)
Added experimental support for loop specialization (#8226)
Added support for devices flag to Trainer (#8440)
Added private prevent_trainer_and_dataloaders_deepcopy context manager on the LightningModule (#8472)
Added support for providing callables to the Lightning CLI instead of types (#8400)

Changed

Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
Changed the Trainer's checkpoint_callback argument to allow only boolean values (#7539)
Log epoch metrics before the on_evaluation_end hook (#7272)
Explicitly disallow calling self.log(on_epoch=False) during epoch-only or single-call hooks (#7874)
Changed these Trainer methods to be protected: call_setup_hook, call_configure_sharded_model, pre_dispatch, dispatch, post_dispatch, call_teardown_hook, run_train, run_sanity_check, run_evaluate, run_evaluation, run_predict, track_output_for_epoch_end
Changed metrics_to_scalars to work with any collection or value (#7888)
Changed clip_grad_norm to use torch.nn.utils.clip_grad_norm_ (#7025)
Validation is now always run inside the training epoch scope (#7357)
ModelCheckpoint now runs at the end of the training epoch by default (#8389)
EarlyStopping now runs at the end of the training epoch by default (#8286)
Refactored Loops
- Moved attributes global_step, current_epoch, max/min_steps, max/min_epochs, batch_idx, and total_batch_idx to TrainLoop (#7437)
- Refactored result handling in training loop (#7506)
- Moved attributes hiddens and split_idx to TrainLoop (#7507)
- Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the on_epoch guard from the "should stop" validation check (#7701)
- Refactored internal loop interface; added new classes FitLoop, TrainingEpochLoop, TrainingBatchLoop (#7871, #8077)
- Removed pytorch_lightning/trainer/training_loop.py (#7985)
- Refactored evaluation loop interface; added new classes DataLoaderLoop, EvaluationLoop, EvaluationEpochLoop (#7990, #8077)
- Removed pytorch_lightning/trainer/evaluation_loop.py (#8056)
- Restricted public access to several internal functions (#8024)
- Refactored trainer _run_* functions and separate evaluation loops (#8065)
- Refactored prediction loop interface; added new classes PredictionLoop, PredictionEpochLoop (#7700, #8077)
- Removed pytorch_lightning/trainer/predict_loop.py (#8094)
- Moved result teardown to the loops (#8245)
- Improve Loop API to better handle children state_dict and progress (#8334)
Refactored logging
- Renamed and moved core/step_result.py to trainer/connectors/logger_connector/result.py (#7736)
- Dramatically simplify the LoggerConnector (#7882)
- trainer.{logged,progress_bar,callback}_metrics are now updated on-demand (#7882)
- Completely overhaul the Result object in favor of ResultMetric (#7882)
- Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing self.log(batch_size=...) (#7891)
- Each of the training loops now keeps its own results collection (#7891)
- Remove EpochResultStore and HookResultStore in favor of ResultCollection (#7909)
- Remove MetricsHolder (#7909)
Moved ignore_scalar_return_in_dp warning suppression to the DataParallelPlugin class (#7421)
Changed the behaviour when logging evaluation step metrics to no longer append /epoch_* to the metric name (#7351)
Raised ValueError when a None value is self.log-ed (#7771)
Changed resolve_training_type_plugins to allow setting num_nodes and sync_batchnorm from Trainer setting (#7026)
Default seed_everything(workers=True) in the LightningCLI (#7504)
Changed model.state_dict() in CheckpointConnector to allow training_type_plugin to customize the model's state_dict() (#7474)
MLflowLogger now uses the env variable MLFLOW_TRACKING_URI as default tracking URI (#7457)
Changed Trainer arg and functionality from reload_dataloaders_every_epoch to reload_dataloaders_every_n_epochs (#5043)
Changed WandbLogger(log_model={True/'all'}) to log models as artifacts (#6231)
MLFlowLogger now accepts run_name as an constructor argument (#7622)
Changed teardown() in Accelerator to allow training_type_plugin to customize teardown logic (#7579)
Trainer.fit now raises an error when using manual optimization with unsupported features such as gradient_clip_val or accumulate_grad_batches (#7788)
Accelerator hooks are called regardless if LightningModule overrides the same hooks (#7826)
Moved profilers to their own file (#7822)
The on_after_backward hook is now called on accumulating iterations. Use the on_before_optimizer_step hook to mimic the old behaviour (#8328)
The mixed precision loss is no longer unscaled before the on_after_backward hook. Use the on_before_optimizer_step hook to mimic the old behaviour (#8328)
The TrainingTypePlugin.{pre,post}_backward hooks no longer take the optimizer, opt_idx, should_accumulate arguments (#8328)
The PrecisionPlugin.backward hooks no longer returns a value (#8328)
The PrecisionPlugin.backward hooks no longer takes a should_accumulate argument (#8328)
Added the on_before_backward hook (#7865)
LightningCLI now aborts with a clearer message if config already exists and disables save config during fast_dev_run(#7963)
Saved the LightningCLI config on setup and only on the main process (#8017)
Dropped the LightningCLI ArgumentParser when pickling (#8017)
Skip broadcast if distributed not initialized for the spawn plugins (#8017)
Trainer(resume_from_checkpoint=...) now restores the model directly after LightningModule.setup(), which is before LightningModule.configure_sharded_model() (#7652)
Moved torch.cuda.set_device() to enable collective calls earlier in setup (#8312)
Used XLA utility API to move data to CPU (Single TPU core) (#8078)
Improved error messages in replace_sampler when the DataLoader attributes are not included in the signature or the signature is missing optional arguments (#8519)
Moved DeviceDtypeModuleMixin and HyperparametersMixin mixin to core (#8396)
Return the default_root_dir as the log_dir when the logger is a LoggerCollection (#8187)

Deprecated

Deprecated LightningModule.loaded_optimizer_states_dict (#8229)
Standardized the dataloaders arguments of trainer.{fit,valdiate,test,tune} (#7431)
Deprecated DataModule properties: has_prepared_data, has_setup_fit, has_setup_validate, has_setup_test, has_setup_predict, has_teardown_fit, has_teardown_validate, has_teardown_test, has_teardown_predict (#7657)
Deprecated TrainerModelHooksMixin in favor of pytorch_lightning.utilities.signature_utils (#7422)
Deprecated num_nodes and sync_batchnorm arguments in DDPPlugin and DDPSpawnPlugin (#7026)
Deprecated self.log(sync_dist_op) in favor of self.log(reduce_fx). (#7891)
Deprecated is_overridden(model=...) in favor of is_overridden(instance=...) (#7918)
Deprecated automatically detaching returned extras with grads (#7994)
Deprecated default value of monitor argument in EarlyStopping callback to enforce monitor as a required argument (#7907)
Deprecated importing rank_zero_{warn,deprecation} directly from pytorch_lightning.utilities.distributed (#8085)
Deprecated the use of CheckpointConnector.hpc_load() in favor of CheckpointConnector.restore() (#7652)
Deprecated ModelCheckpoint(every_n_val_epochs) in favor of ModelCheckpoint(every_n_epochs) (#8383)
Deprecated DDPPlugin.task_idx in favor of DDPPlugin.local_rank (#8203)
Deprecated the Trainer.train_loop property in favor of Trainer.fit_loop (#8025)
Deprecated the Trainer.disable_validation property in favor of not Trainer.enable_validation (#8291)
Deprecated mode parameter in ModelSummary in favor of max_depth (#8062)
Deprecated reload_dataloaders_every_epoch argument of Trainer in favor of reload_dataloaders_every_n_epochs (#5043)
Deprecated distributed_backend argument for Trainer (#8575)

Removed

Dropped official support/testing for PyTorch <1.6 (#8288)
Removed ProfilerConnector (#7654)
Pruned deprecated classif. metrics from pytorch_lightning.metrics.functional.classification (#7499)
Removed deprecated data parallel classes LightningDataParallel and LightningDistributedDataParallel from pytorch_lightning.overrides.data_parallel (#7510)
Removed deprecated trainer attributes - get_model and accelerator_backend (#7502)
Removed support for automatically monitoring the val_loss key with ModelCheckpoint. Pass your monitor of choice to the ModelCheckpoint instance instead (#8293)
Removed support for self.log(tbptt_reduce_fx) and self.log(tbptt_pad_token). Please, open a discussion explaining your use-case if you relied on these. (#7644)
Removed deprecated utils modules model_utils, warning_utils, xla_device_utils and partially argparse_utils (#7503)
Removed RPCPlugin and RPCSequentialPlugin. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101)
Removed deprecated trainer attributes - on_cpu, on_tpu, use_tpu, on_gpu, use_dp, use_ddp, use_ddp2, use_horovod, use_single_gpu (#7501)
Removed deprecated optimizer argument in LightningModule.manual_backward(); Toggling optimizers in manual optimization should be done using LightningModule.{un}toggle_optimizer() (#8287)
Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
Removed environment variable PL_EXP_VERSION from DDP subprocesses (#7403)

Fixed

Fixed the GPUStatsMonitor callbacks to use the correct GPU IDs if CUDA_VISIBLE_DEVICES set (#8260)
Fixed lr_scheduler checkpointed state by calling update_lr_schedulers before saving checkpoints (#7877)
Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
Fixed None loss keys getting added in training_epoch_end when using manual optimization and not returning a loss (#7772)
Fixed a bug where precision=64 with accelerator='ddp_spawn' would throw a pickle error (#6924)
Do not override the existing epoch value in logged_metrics when already logged by the user (#7982)
Support for manual optimization with DeepSpeed (#7970)
Fixed dataloader_idx argument value when predicting with only one DataLoader (#7941)
Fixed passing the stage argument of Callback.{setup,teardown} as a keyword (#7973)
Fixed metrics generated during validation sanity checking are cleaned on end (#8171)
Fixed log_gpu_memory metrics not being added to logging when nothing else is logged (#8174)
Fixed a bug where calling log with a Metric instance would raise an error if it was a nested attribute of the model (#8181)
Fixed a bug where using precision=64 would cause buffers with complex dtype to be cast to real (#8208)
Fixed is_overridden returning true for wrapped functions with no changes (#8296)
Fixed a bug where truncated_bptt_steps would throw an AttributeError when the target RNN has multiple hidden states (#8145)
Fixed self.optimizers() not returning a single optimizer if it had been wrapped (#8326)
Fixed the on_after_backward hook not getting called when using manual optimization and no plugins (#8328)
Fixed the LightningModule.backward hook only getting called with the apex plugin when using manual optimization (#8328)
Fixed moving batch to device before sending it to the on_*_batch_start/on_*_batch_end callbacks and model hooks (#7378)
Fixed passing a custom DDPPlugin when choosing accelerator="ddp_cpu" for the accelerator (#6208)
Fixed missing call to LightningModule.untoggle_optimizer in training loop when running gradient accumulation with multiple optimizers (#8284)
Fixed hash of LightningEnum to work with value instead of name (#8421).
Fixed a bug where an extra checkpoint was saved at the end of training if the val_check_interval did not align with the number of training batches (#7724)
Fixed hash of LightningEnum to work with value instead of name(#8421).
Fixed move_data_to_device to return the batch if the object to function didn't return self (#8433)
Fixed progress bar updates for Pod Training (#8258)
Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
Fixed memory leaks on GPU by moving optimizer_states, ResultCollection.extra, ResultMetric attributes, and LoggerConnector metrics to cpu. Also, delete the DDP wrapper on teardown (#8490)
Fixed SWA callback using LightningModule prevent_trainer_and_dataloaders_deepcopy to avoid OOM (#8472)
Fixed ModelPruning callback on_save_checkpoint to avoid making a deepcopy potentially leading to OOM (#8472)
Fixed the sampler replacement logic for DataLoaders which do not define all DataLoader attributes as __init__ parameters (#8519)
Fixed DeepSpeed Windows support (#8488)
Fixed DeepSpeed not properly setting the trainer lr_schedulers attribute (#8527)
Fixed experiment version and log-dir divergence in DDP when using multiple Trainer instances in sequence (#7403)
Enabled manual optimization for TPUs (#8458)
Fixed accumulate_grad_batches not been recomputed during model reload (#5334)
Fixed a TypeError when wrapping optimizers in the HorovodPlugin and running Trainer.test (#7840)
Fixed BackboneFinetuning restoration (#8501)
Fixed lr_scheduler with metric (e.g. torch.optim.lr_scheduler.ReduceLROnPlateau) when using automatic_optimization = False (#7643)
Fixed DeepSpeed breaking with no schedulers (#8580)

Contributors

@00sapo @AffineParameter @ajtritt @akihironitta @ananthsub @aniketmaurya @aslisabanci @awaelchli @bamblebam @Borda @borisdayma @carmocca @dalek-who @DavidMChan @davors72 @dcfidalgo @ddrevicky @deepsource-autofix @djthegr8 @edenlightning @edgarriba @eladsegal @ethanwharris @eugeneh101 @fepegar @gaoteng-git @gtauzin @i-aki-y @janhenriklambrechts @jiwidi @justusschock @karthikrangasai @kaushikb11 @loic-beheshti @Lucklyric @ManuelPalermo @mauvilsa @maxoppelt @neggert @nikvaessen @nisheethlahoti @pre-commit-ci @rohitgr7 @ruotianluo @satishjasthi @SeanNaren @shirayu @shuyingsunshine21 @sid-sundrani @Sileadim @simran2905 @stancld @t-vi @tchaton @theblackfly @theodumont @tilman151 @tomy0000000 @tshu-w @vatch123 @WrRan @yifuwang

If we forgot someone, let us know :]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel