Skip to content

Latest commit



68 lines (57 loc) · 2.96 KB

File metadata and controls

68 lines (57 loc) · 2.96 KB


Code to reproduce the experiments reported in this paper:

Jianyu Wang, Hao Liang, Gauri Joshi, "Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD," ICASSP 2020. (arXiv)

This repo contains the implementations of the following algorithms:

Please cite this paper if you use this code for your research/projects.

Dependencies and Setup

The code runs on Python 3.5 with PyTorch 1.0.0 and torchvision 0.2.1. The non-blocking communication is implemented using Python threading package.

Training examples

We implement all the above mentioned algorithms as subclasses of torch.optim.optimizer. A typical usage is shown as follows:

import distoptim

# Before training
# define the optimizer
# One can use: 1) LocalSGD (including BMUF); 2) OverlapLocalSGD; 
#              3) EASGD; 4) CoCoDSGD
# tau is the number of local updates / communication period
optimizer = distoptim.SELECTED_OPTIMIZER(tau)
...... # define model, criterion, logging, etc..

# Start training
for batch_id, (data, label) in enumerate(data_loader):
	# same as serial training
	output = model(data) # forward
	loss = criterion(output, label)
	loss.backward() # backward
	optimizer.step() # gradient step

	# additional line to average local models at workers
	# communication happens after every tau iterations
	# optimizer has its own iteration counter inside

In addition, one need to initialize the process group as described in this documentation. In our private cluster, each machine has one GPU.

# backend = gloo or nccl
# rank: 0,1,2,3,...
# size: number of workers
# h0 is the host name of worker0, you need to change it


	title={Overlap Local-{SGD}: An Algorithmic Approach to Hide Communication Delays in Distributed {SGD}},
	author={Wang, Jianyu and Liang, Hao and Joshi, Gauri},
	journal={arXiv preprint arXiv:2002.09539},