Oobleck
Resilient Distributed Training Framework

Oobleck is a large-model training framework with fast fault recovery support utilizing the concept of pipeline templates.

Getting Started

Install

Oobleck relies on conda for both installation and running. Pleasae install conda from Anaconda website. Then, install the Oobleck environment and package.

conda env create -f environment.yml
conda activate oobleck
(oobleck) pip install .
...
Successfully installed oobleck-0.1.0

Run

Run a master daemon. If you run multi-node training, feeding a public ip to --ip is required. If --port is not specified, randomly available port is used.
```
python -m oobleck.elastic.master (--ip <ip>) (--port <port>)
```
Submit a training job to the master. --node_ips is a list separated by whitespaces.
```
python -m oobleck.run --config_path <config_yaml> --node_ips [node_ips] (--node_port <node_port>) --master_ip <master_ip> --master_port <master_port>
```
--node_port specifies the ssh port of worker nodes. The master daemon will launch agent processes on nodes through ssh. All of them must use the same port number and specified here, and the node where the master daemon is running should be able to passwordless ssh to nodes.
- Taget model and dataset
  
  Oobleck follows HuggingFace transformer format. Therefore, job_args.model_name should be one among the models in HuggingFace model hub, and dataset_path and dataset_name in job_args should be from HuggingFace datasets hub.
  
  Currently GPT models are tested.
- An example of node_ips in command line:
```
--config_path examples/gpt2.yml --node_ips 192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4 --master_ip ...
```
- Format of yaml config file
  
  A file specified in --config_path is parsed to create DistributedJobConfiguration dataclass defined in elastic/training_util.py. Each key in yaml file corresponds to each attributes in DistributedJobConfiguration dataclass.
  
  Example
```
master_ip: 192.168.0.1
master_port: 12345
node_ips:
- 192.168.0.2
- 192.168.0.3
- 192.168.0.4
- 192.168.0.5
job_args:
    model_name: gpt2
    dataset_path: wikitext
    dataset_name: wikitext-2-raw-v1
    ...
```
  All fields are required, however, you can specify them either in your yaml file or in command line.

Architecture

Oobleck consists of three types of processes: master daemon, agent, and worker.

The master daemon automatically launches agents to all given nodes via ssh. When agents are launched, they establish a TCP channel with the master daemon, receive a job configuration, and launches worker processes via multiprocessing.Process in each node. The TCP channels are only used for fault tolerance, not for training.

Worker processes use torch.distributed for distributed training.

When node(s) fail, the master daemon detects it via a TCP channel disconnection event. It then broadcasts a node failure event to all the other nodes. 'the agents rebroadcast the event to workers via Linux pipes. After that, workers begin reconfiguration, i.e., reinstatiating pipelines, copying missing layers, reconstructing torch distributed groups, etc.

Citation

@inproceedings{oobleck-sosp23,
    title     = {Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates},
    author    = {Jang, Insu and Yang, Zhenning and Zhang, Zhen and Jin, Xin and Chowdhury, Mosharaf},
    booktitle = {ACM SIGOPS 29th Symposium of Operating Systems and Principles (SOSP '23)},
    year      = {2023},
}

Implementation Limitations

Oobleck is an ongoing research prototype and currently lacks some features, including:

Different precisions: AMP (Automatic Mixed Precision)/BF16/FP16 training is not supported yet.
Sharing model states: Sharing model states across nodes is not supported yet (e.g. GPT embedding layer is shared between the first pipeline stage and the last pipeline stage). Due to this, training may not be correct.
Richer documentation
Modualization: We are working on improving Oobleck compatibility to HuggingFace Transformer and Accelerate.
Checkpoint: saving and loading checkpoints it not supported yet.

Name		Name	Last commit message	Last commit date
Latest commit History 624 Commits
docs/assets/img		docs/assets/img
examples		examples
important_data		important_data
k8s		k8s
oobleck		oobleck
simulate		simulate
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile_oobleck.bk		Dockerfile_oobleck.bk
LICENSE		LICENSE
README.md		README.md
RECORD.md		RECORD.md
docker-compose.yml		docker-compose.yml
download_model_config.py		download_model_config.py
environment.yml		environment.yml
job.sh		job.sh
kjob.sh		kjob.sh
kmaster.sh		kmaster.sh
master.sh		master.sh
oobleck_simulate.py		oobleck_simulate.py
pyproject.toml		pyproject.toml
setup.py		setup.py
simulate_pipelines.py		simulate_pipelines.py
start_container.sh		start_container.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oobleck
Resilient Distributed Training Framework

Getting Started

Install

Run

Architecture

Citation

Implementation Limitations

About

Releases

Packages

Languages

License

ZhuJiaqi9905/Oobleck

Folders and files

Latest commit

History

Repository files navigation

Oobleck Resilient Distributed Training Framework

Getting Started

Install

Run

Architecture

Citation

Implementation Limitations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Oobleck
Resilient Distributed Training Framework

Packages