Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism, which has a number of advanced characteristics:
- Efficiency: OSDP breaks the memory redundancy minimization limitation in previous ZeRO-based systems and enables to determine whether to perform parameter sharding for each operator individually. In addition, OSDP supports operator splitting and fine-grained memory management, enlarging the entire decision space.
- Flexibility: OSDP provides an efficient search engine to automatically find the optimal parallel strategies for each operator in the computation graph and eventually generates the execution plans.
- Usability: only a few lines of code is needed to be replaced to complete the OSDP deployment.
Feel free to contribute codes, create issues and pull requests.
-
Youhe Jiang, Xupeng Miao, Xiaonan Nie, Bin Cui. OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning. In Proceedings of ICML Hardware Aware Efficient Training (HAET) Workshop, July 2022.
-
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Bin Cui. OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), August 2023.
The following command create the conda environment to be used:
$ conda env create -f environment.yml
Or prepare the environment by:
$ sh prepare_env.sh
Example using OSDP:
from data_parallel.optimal_sharded_data_parallel import OptimalShardedDataParallel as OSDP
...
sharded_module = OSDP(my_module, model_description, device_information)
optim = torch.optim.Adam(sharded_module.parameters(), lr=0.0001)
for sample, label in dataload.next_batch:
out = sharded_module(x=sample, y=3, z=torch.Tensor([1]))
loss = criterion(out, label)
loss.backward()
optim.step()
Execute the train_gpt2_...py file through the scripts/script_gpt2_...sh script, and deploy the OSDP experiment by specifying fsdp_type as OSDP (specify fsdp_type as FSDP to deploy the comparative experiment).
$ cd gpt
$ sh scripts/script_gpt2_osdp.sh
$ sh scripts/script_gpt2_fsdp.sh
We show the system throughput and memory utilization of GPT-2 model training (48 layers with hidden_size 2048) in our environment (GPU memory limit: 8G):
- OSDP:
- device memory utilization: 8057.35 MB / 8192 MB
- overall system throughput: 192.2572615622423 seq/sec
- FSDP:
- device memory utilization: 5656.91 MB / 8192 MB
- overall system throughput: 158.0486313692509 seq/sec
We add OSDP implementation for OPT models. The following instructions can deploy the OPT-30B (8 layers) training on a single machine with 8 GPUs and memory limit 16GB.
$ cd opt
$ sh scripts/train_fsdp.sh
$ sh scripts/train_osdp.sh
- OSDP:
- device memory utilization: 15985.89 MB / 16384 MB
- overall system throughput: 1261.98 seq/sec
- FSDP:
- device memory utilization: 14802.88 MB / 16384 MB
- overall system throughput: 453.18 seq/sec
Operator splitting provides OSDP with the ability to search for a finer-grained execution plan for the model as well as minimizes memory surge in training, which provides OSDP with the ability to undertake a larger batch size and further optimize the system throughput.
Example using operator splitting:
class Layer(nn.Module):
def __init__(self, config):
self.mlp = splitted_linear(config...)
...
def forward(self, input):
output = splitted_linear_forward(input, self.mlp, num_splits)
...
- Stage 1: Intra-group Sharded Data Parallel.
- Stage 2: Inter-group All-Reduce.
Trade-off between memory consumption and system throughput for more efficient use of inter-machine bandwidth.
- Stage 1: Intra-group All-Gather and Reduce-Scatter during the Sharded Data Parallel process.
- Stage 2: Inter-group All-Gather and Reduce-Scatter during the Sharded Data Parallel process.
Increase system throughput by reducing inter-machine communication parameters (usually the inter-machine bandwidth is much lower than the intra-machine bandwidth).
Example of using Group Sharding training bert-large with 2 machines and 4 GPUs (we use the auto_wrap API provided by fairscale to complete sharded data parallel deployment):
fsdp_args = gen_fsdp_args(nnodes=2, nproc_per_node=2, gsdp_type='group_sharding', model_type='bert-large')
with enable_wrap(wrapper_cls=FSDP, **fsdp_args):
my_auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=1e6)
model = auto_wrap(model, auto_wrap_policy=my_auto_wrap_policy) #auto_wrap
Example of using Communication with groups training bert-large with 2 machines and 4 GPUs:
fsdp_args = gen_fsdp_args(nnodes=2, nproc_per_node=2, gsdp_type='communication_with_groups', model_type='bert-large')
with enable_wrap(wrapper_cls=FSDP, **fsdp_args):
my_auto_wrap_policy = functools.partial(default_auto_wrap_policy, min_num_params=1e6)
model = auto_wrap(model, auto_wrap_policy=my_auto_wrap_policy) #auto_wrap
We provide an example of OSDP training bert with Group Sharding & Communication with groups:
Execute the train_bert_large...py file through the scripts/script_bert_large_...sh script, and deploy the Group Sharding or Communication with groups experiment by specifying fsdp_args as 'group_sharding' or 'communication_with_groups' (specify fsdp_args as 'none' to deploy the comparative experiment).
$ cd bert
$ sh scripts/script_bert_large_group_sharding.sh
$ sh scripts/script_bert_large_communication_with_groups.sh
$ sh scripts/script_bert_large_fsdp.sh
If you use OSDP in a scientific publication, we would appreciate citations to the following paper:
@misc{jiang2023osdp,
title={OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning},
author={Youhe Jiang and Fangcheng Fu and Xupeng Miao and Xiaonan Nie and Bin Cui},
year={2023},
eprint={2209.13258},
archivePrefix={arXiv},
primaryClass={cs.DC}
}