PyTorch Data Parallel Benchmark

Files

models.py contains 5 main components

GPTConfig specifies the size of GPT3. See examples below. All DDP/Pipeline/FSDP GPT models should take a GPTConfig instance as an example.

class GPTSmallConfig(GPTConfig):
    """ GPT3-small like network roughly 125M params """
    n_layer = 12
    n_head = 12
    n_embd = 768

class GPTLargeConfig(GPTConfig):
    """ GPT3-large like network roughly 760M params """
    n_layer = 24
    n_head = 16
    n_embd = 1536

class GPTXLConfig(GPTConfig):
    """ GPT3-XL like network roughly 1.3B params """
    n_layer = 24
    n_head = 24
    n_embd = 2048

Single-device class GPT. This is used by DDP.
sequential_gpt returns a GPT model as a nn.Sequential which is already balanced across given devices. This is used by pipeline parallelism
TODO: GPT recursively wrapped by FSDP
TODO: import ResNet models from torchvision

trainer.py entry point for torchrun

Launching Experiments

Launch Locally

Launch DDP on two GPUs in the same machine:

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_id=100 --rdzv_endpoint="localhost:5678" trainer.py

Launch Pipeline + DDP on two GPUs in the same machine, where pipeline spans two devices and DDP world size is 1:

torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_endpoint="localhost:5678" trainer.py --ndevice_per_proc=2 --mode=pdp

Launch on GCP

First, ssh to the SLURM login node.

export CLUSTER_ZONE="us-central1-b"
export CLUSTER_LOGIN_NODE=$(gcloud compute instances list \
    --zones ${CLUSTER_ZONE} \
    --filter="name ~ shen*.*login." \
    --format="value(name)" | head -n1)
gcloud compute ssh ${CLUSTER_LOGIN_NODE} \
    --zone $CLUSTER_ZONE

Run sinfo to check there are at least 4 nodes idle.

Launc DDP on 4 nodes (i.e., 32 GPUs).

srun -p train -t 5:00:00 --gpus-per-node=8 --cpus-per-task=96 --nodes=4 --pty /home/shenli_fb_com/project/ptd_benchmark/launch_4_node_ddp.sh

Launch Pipeline + DDP on 4 nodes, where each pipeline spans two GPUs. So, there are 16 pipeline in total.

srun -p train -t 5:00:00 --gpus-per-node=8 --cpus-per-task=96 --nodes=4 --pty /home/shenli_fb_com/project/ptd_benchmark/launch_4_node_pipeline.sh

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
delay		delay
.gitignore		.gitignore
README.md		README.md
fsdp.sh		fsdp.sh
fsdp_sbatch.sh		fsdp_sbatch.sh
launch_4_node_pipeline.sh		launch_4_node_pipeline.sh
launch_cluster_ddp.sh		launch_cluster_ddp.sh
launch_cluster_fsdp.sh		launch_cluster_fsdp.sh
launch_cluster_pipeline.sh		launch_cluster_pipeline.sh
launch_local_ddp.sh		launch_local_ddp.sh
launch_local_fsdp.sh		launch_local_fsdp.sh
launch_local_fsdp_small_checkpoint.sh		launch_local_fsdp_small_checkpoint.sh
launch_local_pipeline.sh		launch_local_pipeline.sh
models.py		models.py
multi_launch_cluster_fsdp.sh		multi_launch_cluster_fsdp.sh
tests.py		tests.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Data Parallel Benchmark

Files

Launching Experiments

Launch Locally

Launch on GCP

About

Releases

Packages

Contributors 3

Languages

mrshenli/ptd_benchmark

Folders and files

Latest commit

History

Repository files navigation

PyTorch Data Parallel Benchmark

Files

Launching Experiments

Launch Locally

Launch on GCP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages