TorchElastic allows you to launch distributed PyTorch jobs in a fault-tolerant and elastic manner. For the latest documentation, please refer to our website.
torchelastic requires
- python3 (3.8+)
- torch
- etcd
pip install torchelastic
Fault-tolerant on 4
nodes, 8
trainers/node, total 4 * 8 = 32
trainers.
Run the following on all nodes.
python -m torchelastic.distributed.launch
--nnodes=4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Elastic on 1 ~ 4
nodes, 8
trainers/node, total 8 ~ 32
trainers. Job
starts as soon as 1
node is healthy, you may add up to 4
nodes.
python -m torchelastic.distributed.launch
--nnodes=1:4
--nproc_per_node=8
--rdzv_id=JOB_ID
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
We welcome PRs. See the CONTRIBUTING file.
torchelastic is BSD licensed, as found in the LICENSE file.