Skip to content

TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel

Compare
Choose a tag to compare
@kaushikb11 kaushikb11 released this 27 Jul 15:30
· 5167 commits to master since this release
c7f8c8c

Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!

https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9

[1.4.0] - 2021-07-27

Added

  • Added extract_batch_size utility and corresponding tests to extract batch dimension from multiple batch types (#8357)
  • Added support for named parameter groups in LearningRateMonitor (#7987)
  • Added dataclass support for pytorch_lightning.utilities.apply_to_collection (#7935)
  • Added support to LightningModule.to_torchscript for saving to custom filesystems with fsspec (#7617)
  • Added KubeflowEnvironment for use with the PyTorchJob operator in Kubeflow
  • Added LightningCLI support for config files on object stores (#7521)
  • Added ModelPruning(prune_on_train_epoch_end=True|False) to choose when to apply pruning (#7704)
  • Added support for checkpointing based on a provided time interval during training (#7515)
  • Progress tracking
    • Added dataclasses for progress tracking (#6603, #7574, #8140, #8362)
    • Add {,load_}state_dict to the progress tracking dataclasses (#8140)
    • Connect the progress tracking dataclasses to the loops (#8244, #8362)
    • Do not reset the progress tracking dataclasses total counters (#8475)
  • Added support for passing a LightningDataModule positionally as the second argument to trainer.{validate,test,predict} (#7431)
  • Added argument trainer.predict(ckpt_path) (#7430)
  • Added clip_grad_by_value support for TPUs (#7025)
  • Added support for passing any class to is_overridden (#7918)
  • Added sub_dir parameter to TensorBoardLogger (#6195)
  • Added correct dataloader_idx to batch transfer hooks (#6241)
  • Added include_none=bool argument to apply_to_collection (#7769)
  • Added apply_to_collections to apply a function to two zipped collections (#7769)
  • Added ddp_fully_sharded support (#7487)
  • Added should_rank_save_checkpoint property to Training Plugins (#7684)
  • Added log_grad_norm hook to LightningModule to customize the logging of gradient norms (#7873)
  • Added save_config_filename init argument to LightningCLI to ease resolving name conflicts (#7741)
  • Added save_config_overwrite init argument to LightningCLI to ease overwriting existing config files (#8059)
  • Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
  • Added trainer stage hooks for Training Plugins and Accelerators (#7864)
  • Added the on_before_optimizer_step hook (#8048)
  • Added IPU Accelerator (#7867)
  • Fault-tolerant training
    • Added {,load_}state_dict to ResultCollection (#7948)
    • Added {,load_}state_dict to Loops (#8197)
    • Set Loop.restarting=False at the end of the first iteration (#8362)
    • Save the loops state with the checkpoint (opt-in) (#8362)
    • Save a checkpoint to restore the state on exception (opt-in) (#8362)
    • Added state_dict and load_state_dict utilities for CombinedLoader + utilities for dataloader (#8364)
  • Added rank_zero_only to LightningModule.log function (#7966)
  • Added metric_attribute to LightningModule.log function (#7966)
  • Added a warning if Trainer(log_every_n_steps) is a value too high for the training dataloader (#7734)
  • Added LightningCLI support for argument links applied on instantiation (#7895)
  • Added LightningCLI support for configurable callbacks that should always be present (#7964)
  • Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
  • Added support for torch.nn.UninitializedParameter in ModelSummary (#7642)
  • Added support LightningModule.save_hyperparameters when LightningModule is a dataclass (#7992)
  • Added support for overriding optimizer_zero_grad and optimizer_step when using accumulate_grad_batches (#7980)
  • Added logger boolean flag to save_hyperparameters (#7960)
  • Added support for calling scripts using the module syntax (python -m package.script) (#8073)
  • Added support for optimizers and learning rate schedulers to LightningCLI (#8093)
  • Added XLA Profiler (#8014)
  • Added PrecisionPlugin.{pre,post}_backward (#8328)
  • Added on_load_checkpoint and on_save_checkpoint hooks to the PrecisionPlugin base class (#7831)
  • Added max_depth parameter in ModelSummary (#8062)
  • Added XLAStatsMonitor callback (#8235)
  • Added restore function and restarting attribute to base Loop (#8247)
  • Added FastForwardSampler and CaptureIterableDataset (#8307)
  • Added support for save_hyperparameters in LightningDataModule (#3792)
  • Added the ModelCheckpoint(save_on_train_epoch_end) to choose when to run the saving logic (#8389)
  • Added LSFEnvironment for distributed training with the LSF resource manager jsrun (#5102)
  • Added support for accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto' (#7808)
  • Added tpu_spawn_debug to plugin registry (#7933)
  • Enabled traditional/manual launching of DDP processes through LOCAL_RANK and NODE_RANK environment variable assignments (#7480)
  • Added quantize_on_fit_end argument to QuantizationAwareTraining (#8464)
  • Added experimental support for loop specialization (#8226)
  • Added support for devices flag to Trainer (#8440)
  • Added private prevent_trainer_and_dataloaders_deepcopy context manager on the LightningModule (#8472)
  • Added support for providing callables to the Lightning CLI instead of types (#8400)

Changed

  • Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
  • Changed the Trainer's checkpoint_callback argument to allow only boolean values (#7539)
  • Log epoch metrics before the on_evaluation_end hook (#7272)
  • Explicitly disallow calling self.log(on_epoch=False) during epoch-only or single-call hooks (#7874)
  • Changed these Trainer methods to be protected: call_setup_hook, call_configure_sharded_model, pre_dispatch, dispatch, post_dispatch, call_teardown_hook, run_train, run_sanity_check, run_evaluate, run_evaluation, run_predict, track_output_for_epoch_end
  • Changed metrics_to_scalars to work with any collection or value (#7888)
  • Changed clip_grad_norm to use torch.nn.utils.clip_grad_norm_ (#7025)
  • Validation is now always run inside the training epoch scope (#7357)
  • ModelCheckpoint now runs at the end of the training epoch by default (#8389)
  • EarlyStopping now runs at the end of the training epoch by default (#8286)
  • Refactored Loops
    • Moved attributes global_step, current_epoch, max/min_steps, max/min_epochs, batch_idx, and total_batch_idx to TrainLoop (#7437)
    • Refactored result handling in training loop (#7506)
    • Moved attributes hiddens and split_idx to TrainLoop (#7507)
    • Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
    • Simplified "should run validation" logic (#7682)
    • Simplified logic for updating the learning rate for schedulers (#7682)
    • Removed the on_epoch guard from the "should stop" validation check (#7701)
    • Refactored internal loop interface; added new classes FitLoop, TrainingEpochLoop, TrainingBatchLoop (#7871, #8077)
    • Removed pytorch_lightning/trainer/training_loop.py (#7985)
    • Refactored evaluation loop interface; added new classes DataLoaderLoop, EvaluationLoop, EvaluationEpochLoop (#7990, #8077)
    • Removed pytorch_lightning/trainer/evaluation_loop.py (#8056)
    • Restricted public access to several internal functions (#8024)
    • Refactored trainer _run_* functions and separate evaluation loops (#8065)
    • Refactored prediction loop interface; added new classes PredictionLoop, PredictionEpochLoop (#7700, #8077)
    • Removed pytorch_lightning/trainer/predict_loop.py (#8094)
    • Moved result teardown to the loops (#8245)
    • Improve Loop API to better handle children state_dict and progress (#8334)
  • Refactored logging
    • Renamed and moved core/step_result.py to trainer/connectors/logger_connector/result.py (#7736)
    • Dramatically simplify the LoggerConnector (#7882)
    • trainer.{logged,progress_bar,callback}_metrics are now updated on-demand (#7882)
    • Completely overhaul the Result object in favor of ResultMetric (#7882)
    • Improve epoch-level reduction time and overall memory usage (#7882)
    • Allow passing self.log(batch_size=...) (#7891)
    • Each of the training loops now keeps its own results collection (#7891)
    • Remove EpochResultStore and HookResultStore in favor of ResultCollection (#7909)
    • Remove MetricsHolder (#7909)
  • Moved ignore_scalar_return_in_dp warning suppression to the DataParallelPlugin class (#7421)
  • Changed the behaviour when logging evaluation step metrics to no longer append /epoch_* to the metric name (#7351)
  • Raised ValueError when a None value is self.log-ed (#7771)
  • Changed resolve_training_type_plugins to allow setting num_nodes and sync_batchnorm from Trainer setting (#7026)
  • Default seed_everything(workers=True) in the LightningCLI (#7504)
  • Changed model.state_dict() in CheckpointConnector to allow training_type_plugin to customize the model's state_dict() (#7474)
  • MLflowLogger now uses the env variable MLFLOW_TRACKING_URI as default tracking URI (#7457)
  • Changed Trainer arg and functionality from reload_dataloaders_every_epoch to reload_dataloaders_every_n_epochs (#5043)
  • Changed WandbLogger(log_model={True/'all'}) to log models as artifacts (#6231)
  • MLFlowLogger now accepts run_name as an constructor argument (#7622)
  • Changed teardown() in Accelerator to allow training_type_plugin to customize teardown logic (#7579)
  • Trainer.fit now raises an error when using manual optimization with unsupported features such as gradient_clip_val or accumulate_grad_batches (#7788)
  • Accelerator hooks are called regardless if LightningModule overrides the same hooks (#7826)
  • Moved profilers to their own file (#7822)
  • The on_after_backward hook is now called on accumulating iterations. Use the on_before_optimizer_step hook to mimic the old behaviour (#8328)
  • The mixed precision loss is no longer unscaled before the on_after_backward hook. Use the on_before_optimizer_step hook to mimic the old behaviour (#8328)
  • The TrainingTypePlugin.{pre,post}_backward hooks no longer take the optimizer, opt_idx, should_accumulate arguments (#8328)
  • The PrecisionPlugin.backward hooks no longer returns a value (#8328)
  • The PrecisionPlugin.backward hooks no longer takes a should_accumulate argument (#8328)
  • Added the on_before_backward hook (#7865)
  • LightningCLI now aborts with a clearer message if config already exists and disables save config during fast_dev_run(#7963)
  • Saved the LightningCLI config on setup and only on the main process (#8017)
  • Dropped the LightningCLI ArgumentParser when pickling (#8017)
  • Skip broadcast if distributed not initialized for the spawn plugins (#8017)
  • Trainer(resume_from_checkpoint=...) now restores the model directly after LightningModule.setup(), which is before LightningModule.configure_sharded_model() (#7652)
  • Moved torch.cuda.set_device() to enable collective calls earlier in setup (#8312)
  • Used XLA utility API to move data to CPU (Single TPU core) (#8078)
  • Improved error messages in replace_sampler when the DataLoader attributes are not included in the signature or the signature is missing optional arguments (#8519)
  • Moved DeviceDtypeModuleMixin and HyperparametersMixin mixin to core (#8396)
  • Return the default_root_dir as the log_dir when the logger is a LoggerCollection (#8187)

Deprecated

  • Deprecated LightningModule.loaded_optimizer_states_dict (#8229)
  • Standardized the dataloaders arguments of trainer.{fit,valdiate,test,tune} (#7431)
  • Deprecated DataModule properties: has_prepared_data, has_setup_fit, has_setup_validate, has_setup_test, has_setup_predict, has_teardown_fit, has_teardown_validate, has_teardown_test, has_teardown_predict (#7657)
  • Deprecated TrainerModelHooksMixin in favor of pytorch_lightning.utilities.signature_utils (#7422)
  • Deprecated num_nodes and sync_batchnorm arguments in DDPPlugin and DDPSpawnPlugin (#7026)
  • Deprecated self.log(sync_dist_op) in favor of self.log(reduce_fx). (#7891)
  • Deprecated is_overridden(model=...) in favor of is_overridden(instance=...) (#7918)
  • Deprecated automatically detaching returned extras with grads (#7994)
  • Deprecated default value of monitor argument in EarlyStopping callback to enforce monitor as a required argument (#7907)
  • Deprecated importing rank_zero_{warn,deprecation} directly from pytorch_lightning.utilities.distributed (#8085)
  • Deprecated the use of CheckpointConnector.hpc_load() in favor of CheckpointConnector.restore() (#7652)
  • Deprecated ModelCheckpoint(every_n_val_epochs) in favor of ModelCheckpoint(every_n_epochs) (#8383)
  • Deprecated DDPPlugin.task_idx in favor of DDPPlugin.local_rank (#8203)
  • Deprecated the Trainer.train_loop property in favor of Trainer.fit_loop (#8025)
  • Deprecated the Trainer.disable_validation property in favor of not Trainer.enable_validation (#8291)
  • Deprecated mode parameter in ModelSummary in favor of max_depth (#8062)
  • Deprecated reload_dataloaders_every_epoch argument of Trainer in favor of reload_dataloaders_every_n_epochs (#5043)
  • Deprecated distributed_backend argument for Trainer (#8575)

Removed

  • Dropped official support/testing for PyTorch <1.6 (#8288)
  • Removed ProfilerConnector (#7654)
  • Pruned deprecated classif. metrics from pytorch_lightning.metrics.functional.classification (#7499)
  • Removed deprecated data parallel classes LightningDataParallel and LightningDistributedDataParallel from pytorch_lightning.overrides.data_parallel (#7510)
  • Removed deprecated trainer attributes - get_model and accelerator_backend (#7502)
  • Removed support for automatically monitoring the val_loss key with ModelCheckpoint. Pass your monitor of choice to the ModelCheckpoint instance instead (#8293)
  • Removed support for self.log(tbptt_reduce_fx) and self.log(tbptt_pad_token). Please, open a discussion explaining your use-case if you relied on these. (#7644)
  • Removed deprecated utils modules model_utils, warning_utils, xla_device_utils and partially argparse_utils (#7503)
  • Removed RPCPlugin and RPCSequentialPlugin. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101)
  • Removed deprecated trainer attributes - on_cpu, on_tpu, use_tpu, on_gpu, use_dp, use_ddp, use_ddp2, use_horovod, use_single_gpu (#7501)
  • Removed deprecated optimizer argument in LightningModule.manual_backward(); Toggling optimizers in manual optimization should be done using LightningModule.{un}toggle_optimizer() (#8287)
  • Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
  • Removed environment variable PL_EXP_VERSION from DDP subprocesses (#7403)

Fixed

  • Fixed the GPUStatsMonitor callbacks to use the correct GPU IDs if CUDA_VISIBLE_DEVICES set (#8260)
  • Fixed lr_scheduler checkpointed state by calling update_lr_schedulers before saving checkpoints (#7877)
  • Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
  • Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
  • Fixed None loss keys getting added in training_epoch_end when using manual optimization and not returning a loss (#7772)
  • Fixed a bug where precision=64 with accelerator='ddp_spawn' would throw a pickle error (#6924)
  • Do not override the existing epoch value in logged_metrics when already logged by the user (#7982)
  • Support for manual optimization with DeepSpeed (#7970)
  • Fixed dataloader_idx argument value when predicting with only one DataLoader (#7941)
  • Fixed passing the stage argument of Callback.{setup,teardown} as a keyword (#7973)
  • Fixed metrics generated during validation sanity checking are cleaned on end (#8171)
  • Fixed log_gpu_memory metrics not being added to logging when nothing else is logged (#8174)
  • Fixed a bug where calling log with a Metric instance would raise an error if it was a nested attribute of the model (#8181)
  • Fixed a bug where using precision=64 would cause buffers with complex dtype to be cast to real (#8208)
  • Fixed is_overridden returning true for wrapped functions with no changes (#8296)
  • Fixed a bug where truncated_bptt_steps would throw an AttributeError when the target RNN has multiple hidden states (#8145)
  • Fixed self.optimizers() not returning a single optimizer if it had been wrapped (#8326)
  • Fixed the on_after_backward hook not getting called when using manual optimization and no plugins (#8328)
  • Fixed the LightningModule.backward hook only getting called with the apex plugin when using manual optimization (#8328)
  • Fixed moving batch to device before sending it to the on_*_batch_start/on_*_batch_end callbacks and model hooks (#7378)
  • Fixed passing a custom DDPPlugin when choosing accelerator="ddp_cpu" for the accelerator (#6208)
  • Fixed missing call to LightningModule.untoggle_optimizer in training loop when running gradient accumulation with multiple optimizers (#8284)
  • Fixed hash of LightningEnum to work with value instead of name (#8421).
  • Fixed a bug where an extra checkpoint was saved at the end of training if the val_check_interval did not align with the number of training batches (#7724)
  • Fixed hash of LightningEnum to work with value instead of name(#8421).
  • Fixed move_data_to_device to return the batch if the object to function didn't return self (#8433)
  • Fixed progress bar updates for Pod Training (#8258)
  • Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
  • Fixed memory leaks on GPU by moving optimizer_states, ResultCollection.extra, ResultMetric attributes, and LoggerConnector metrics to cpu. Also, delete the DDP wrapper on teardown (#8490)
  • Fixed SWA callback using LightningModule prevent_trainer_and_dataloaders_deepcopy to avoid OOM (#8472)
  • Fixed ModelPruning callback on_save_checkpoint to avoid making a deepcopy potentially leading to OOM (#8472)
  • Fixed the sampler replacement logic for DataLoaders which do not define all DataLoader attributes as __init__ parameters (#8519)
  • Fixed DeepSpeed Windows support (#8488)
  • Fixed DeepSpeed not properly setting the trainer lr_schedulers attribute (#8527)
  • Fixed experiment version and log-dir divergence in DDP when using multiple Trainer instances in sequence (#7403)
  • Enabled manual optimization for TPUs (#8458)
  • Fixed accumulate_grad_batches not been recomputed during model reload (#5334)
  • Fixed a TypeError when wrapping optimizers in the HorovodPlugin and running Trainer.test (#7840)
  • Fixed BackboneFinetuning restoration (#8501)
  • Fixed lr_scheduler with metric (e.g. torch.optim.lr_scheduler.ReduceLROnPlateau) when using automatic_optimization = False (#7643)
  • Fixed DeepSpeed breaking with no schedulers (#8580)

Contributors

@00sapo @AffineParameter @ajtritt @akihironitta @ananthsub @aniketmaurya @aslisabanci @awaelchli @bamblebam @Borda @borisdayma @carmocca @dalek-who @DavidMChan @davors72 @dcfidalgo @ddrevicky @deepsource-autofix @djthegr8 @edenlightning @edgarriba @eladsegal @ethanwharris @eugeneh101 @fepegar @gaoteng-git @gtauzin @i-aki-y @janhenriklambrechts @jiwidi @justusschock @karthikrangasai @kaushikb11 @loic-beheshti @Lucklyric @ManuelPalermo @mauvilsa @maxoppelt @neggert @nikvaessen @nisheethlahoti @pre-commit-ci @rohitgr7 @ruotianluo @satishjasthi @SeanNaren @shirayu @shuyingsunshine21 @sid-sundrani @Sileadim @simran2905 @stancld @t-vi @tchaton @theblackfly @theodumont @tilman151 @tomy0000000 @tshu-w @vatch123 @WrRan @yifuwang

If we forgot someone, let us know :]