All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
-
Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. (#2839, #3042, #3043, #3054, #3083, #3090)
-
Added XLA support for Allreduce via
tf.function(jit_compile=True)
. (#3053) -
Added fused buffer scaling and unpack/pack kernels on GPU. (#2973)
-
Added support for NCCL on CUDA 11.4. (#3182)
-
Added fp16 compression for MXNet. (#2987)
-
Added terminate_on_nan flag to Spark Lightning estimator. (#3088)
-
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
-
Added params for customizing Tensorboard callback. (#3153)
-
Added
hvd.cross_rank()
for keras. (#3008) -
Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. #3139
-
Implemented more asynchronous dependency handling on GPU. (#2963)
-
Ray: RayExecutor will now use the current placement group instead of always creating a new one. (#3134)
-
Lightning: turned off shuffling for validation dataset. (#2974)
-
Ray: RayExecutor will use the current placement group if one exists. (#3134)
-
Extended
hvd.join()
to return the last rank that joined. (#3097
- Spark/Keras: remove bare Keras support. (#3191)
-
Fix Horovod develop/editable install mode and incremental builds. (#3074)
-
Estimator/Lightning: use lightning datamodule. (#3084)
-
Fix Horovod Spark StringType and numpy type mapping issue. (#3146)
-
Fixed error in Keras LearningRateScheduler. (#3135)
-
Fixed bug in Lightning Profiler on Ray. (#3122)
-
Fixed torch op lazy release to prevent OOM in elastic training. (#3110)
-
Lightning: Fixed usage of the checkpoint callback. (#3186)
-
Fixed MPICH support to use Intel MPI's implementation. (#3148)
-
Fixed race condition in PyTorch async dataloader. (#3120)
-
Estimator: added support for loading data from S3, GCS, ADLS, and other remote filesystems. (#2927)
-
Estimator: added custom Spark data loader interface. (#2938)
-
LightningEstimator: added support to supply a logger and associated parameter to control the frequency of logging. (#2926)
-
Estimator: added check to ensure all ranks have the same device type. (#2942)
-
Changed behavior from using TensorBoardLogger to now using it as a fallback if a logger is not supplied. (#2926)
-
Ray: disabled capturing child tasks in placement group. (#2920)
-
Fixed
hvd.tensorflow.keras.Compression
, accidentally removed in v0.22.0. (#2945) -
TorchEstimator: fixed usage of
validation_steps
in place ofvalidation_steps_per_epoch
. (#2918) -
TensorFlow: fixed C++ API for TF v2.6.0. (#2932)
-
PyTorch: fixed
sparse_allreduce_async
for PyTorch v0.10.0. (#2965)
-
Added pytorch_lightning spark estimator which enables training pytorch_lightning models. (#2713)
-
Added NVTX tracing hooks for profiling with Nsight Systems. (#2723)
-
Added a generic
num_workers
API forRayExecutor
(#2870) -
Supports Ray Client without code changes. (#2882)
-
Supports inmemory cache option for Keras Estimator. (#2896)
-
Added FP16 support for GPU tensor in mxnet. (#2915)
-
Added response caching for allgather operations. (#2872)
-
Estimator: add petastorm reader_pool_type into constructor (#2903)
-
Changed
alltoall
to return the received splits as a second return value if non-uniform splits are sent. (#2631) -
Changed
RayExecutor
to use Ray Placement Groups for worker colocation. (#2824) -
Changed
Inmemory dataloader
usage for Torch Estimator with petastorm v0.11.0 release. (#2896)
-
Changed RayExecutor to use Ray node ID to enable multi-container:single-host setups. (#2883)
-
Support sparse gradients aggregation in TF1 Keras. (#2879)
-
Respect
global_step
parameter for LegacyOptimizers when aggregating gradients. (#2879) -
Fixed compatibility with PyTorch 1.9.0. (#2829)
- Add
groups
parameter inDistributedOptimizer
for custom allreduce groups. (#2523)
- Removed
num_groups
parameter inDistributedOptimizer
, replaced withgroups
. (#2523)
-
Fixed worker desynchronization deadlock issue in TensorFlow 2.4. (#2647)
-
Deduped Keras
LearningRateWarmupCallback
log after gradual learning rate warmup. (#2661)
-
Added support for Intel(R) MPI in horovodrun. (#2374)
-
Add support for callbacks in Ray Elastic Executor. (#2639)
-
Added forwarding of stdout/stderr captured to driver over Gloo. (#2646)
-
Fixed broadcast_optimizer_state to handle NoneType params for PyTorch 1.8. (#2624)
-
Fixed
local_rank
support for Ray. (#2596) -
Fixed DL estimators to obtain the output df schema without sampling the input. (#2611)
-
Fixed wrong default for horovod.tensorflow.keras.allreduce average (#2627)
-
Added in-memory dataset caching param to
TorchEstimator
. (#2434) -
Added
val_batch_size
param to the Estimator API. (#2505) -
Added support for TorchScript modules when using
TorchEstimator
. (#2494)
-
Migrated to oneCCL aligned with oneAPI specification v1.0. (#2513)
-
Added knob to set cache hint for oneCCL allreduce. (#2560)
-
Renamed
horovodrun
arg--ccl-bgt-affinity
to--thread-affinity
. (#2562) -
Changed default build parallelism from
-j8
to-j1
to address potential race condition. (#2572)
-
Fixed building Horovod for ROCm PyTorch with newer hipify script. (#2360)
-
Fixed "Executable class" support for Ray. (#2510)
-
Fixed TorchEstimator returning model without switching to eval mode. (#2517)
-
Remove ssh reliance for Ray elastic training. (#2528)
-
Fixed error handling for changing framework without reinstalling horovod. (#2529)
-
Fixed "Intermediate path does not exist" error with DBFSLocalStore. (#2526)
-
Avoid synchronization if workers are only shrinked in elastic mode. (#2514)
-
Fixed Ray resource test. (#2575)
-
Fixed usage of env variable
HOROVOD_GLOO_TIMEOUT_SECONDS
withhorovodrun
. (#2571)
-
Added support for backward_passes_per_step > 1 for TF Keras graph mode. (#2346)
-
Added support for backward_passes_per_step > 1 for TF Keras eager execution. (#2371)
-
Added support for backward_passes_per_step > 1 for TF LegacyOptimizer in graph mode. (#2401)
-
Added grouped allreduce to enable more efficient tensor fusion and deterministic training. (#2453)
-
Add support for specifying
op
andcompression
inhorovod.tensorflow.keras.allreduce()
. (#2423) -
Adding support for batched D2D memcopy kernel on GPU. (#2435)
-
Added schema inference in Spark Estimator without sampling. (#2373)
-
Added
Store.create("dbfs:/")
mapping toDBFSLocalStore("/dbfs/...")
. (#2376)
-
Changed Keras callbacks to require parameter
initial_lr
ofLearningRateScheduleCallback
andLearningRateWarmupCallback
. (#2459) -
Changed default cycle time from 5ms to 1ms and fusion threshold from 64MB to 128MB. (#2468)
-
Fixed support for TensorFlow v2.4.0. (#2381)
-
Fixed averaging using CUDA half2 implementation one element half buffers. (#2375)
-
Fixed
HOROVOD_THREAD_AFFINITY
when using oneCCL. (#2350) -
Added timeout to SSH check in horovodrun to prevent hanging. (#2448)
-
Added
HOROVOD_GLOO_TIMEOUT_SECONDS
value to error messages. (#2436) -
Fixed race condition in dynamic timeline API. (#2341)
-
Fixed --log-hide-timestamp to apply to driver logs with Gloo. (#2388)
-
Fixed the search order of Eigen and Flatbuffers paths. (#2473)
-
Fixed type checks in
TorchEstimator
to correctly useisinstance()
. (#2480)
- Added Elastic Ray integration. (#2291)
- Removed dependency on SSH access for Ray. (#2275)
- Fixed building Horovod without HOROVOD_WITHOUT_MXNET when MXNet is not installed. (#2334)
-
Added Databricks storage
DBFSLocalStore
and support for GPU-aware scheduling to horovod.spark Estimator. (#2234) -
Added ElasticSampler and PyTorch Elastic ImageNet example. (#2297)
-
Added ability to dynamically start and stop timeline programmatically. (#2215)
-
Added support for Gloo on macOS. (#2254)
-
Exposed name argument to TensorFlow allreduce operation. (#2325)
-
Added option to strip outer name scope from Horovod ops in TensorFlow. (#2328)
-
Fixed usage of VERBOSE=1 when setting custom MAKEFLAGS. (#2239)
-
Fixed bugs in Keras Elastic Callback classes. (#2289)
-
Fixed RelWithDebInfo build and made it the default with -03 optimizations. (#2305)
-
Fixed usage of tf.cond in TensorFlow alltoall gradient. (#2327)
-
Fixed allreduce averaging for TF IndexedSlices in ROCm path. (#2279)
-
Include stdexcept to handle certain compiler / frameworks that don't include it already. (#2238)
-
Fixed Debug builds by setting compiler options based on CMake build type. (#2263)
-
Skipped launching zero-sized send/recvs for NCCLAlltoall. (#2273)
-
Fixed missing run in tf keras elastic mode. (#2272)
-
Fixed loss function in TensorFlow2 elastic synthetic benchmark. (#2265)
-
Fixed usage of HOROVOD_MIXED_INSTALL env var in alltoall tests. (#2266)
-
Removed keras requirement from Ray example. (#2262)
-
Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. (#1849)
-
Added Elastic Horovod support for Spark auto-scaling. (#1956)
-
Added All-to-All operation for TensorFlow, PyTorch, and MXNet. (#2143)
-
Added support for
gradient_predivide_factor
and averaging in Horovod backend. (#1949) -
Added NCCL implementation of the allgather operation. (#1952)
-
Added
HOROVOD_GPU_OPERATIONS
installation variable to simplify enabling NCCL support for all GPU operations. (#1960) -
Added TensorFlow implementation of
SyncBatchNormalization
layer. (#2075) -
Added
hvd.is_initialized()
method. (#2020) -
Added
hvd.allgather_object
function for TensorFlow, PyTorch, and MXNet. (#2166) -
Added
hvd.broadcast_object
function for MXNet. (#2122) -
Added
label_shapes
parameter to KerasEstimator and TorchEstimator. (#2140) -
Added optional
modelCheckPoint
callback to KerasEstimator params. (#2124) -
Added
ssh_identity_file
argument tohorovodrun
. (#2201) -
Added support for
horovodrun
onkubeflow/mpi-job
. (#2199) -
Added Ray integration. (#2218)
-
Moved
horovod.run.runner.run
tohorovod.run
. (#2099) -
HOROVOD_THREAD_AFFINITY accepts multiple values, one for every Horovod rank (#2131)
-
Migrated build system for native libraries to CMake (#2009)
- HOROVOD_CCL_BGT_AFFINITY is deprected. Use HOROVOD_THREAD_AFFINITY instead (#2131)
-
Dropped support for Python 2. (#1954)
-
Dropped support for TensorFlow < 1.15. (#2169)
-
Dropped support for PyTorch < 1.2. (#2086)
-
Fixed MXNet allgather implementation to correctly handle resizing the output buffer. (#2092)
-
Fixed Keras Spark Estimator incompatibility with TensorFlow 1.15 due to
tf.autograph
. (#2069) -
Fixed API compatibility with PyTorch 1.6. (#2051)
-
Fixed Keras API compatibility with TensorFlow 2.4.0. (#2178)
-
Fixed allgather gradient for TensorFlow 2 in cases where the tensor shape is not known during graph construction. (#2121)
-
Fixed running using Gloo with an imbalanced number of workers per host. (#2212)