housing our model example of fine tuning an 11B t5 with FSDP to create a world-class grammar checker.
pip install -r requirements.txt
a large and small dataset are already present in the project (grammar_train.csv = small, gtrain_150K.csv = large).
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="localhost:5679" main_benchmark.py
On an A100 (p4d.24xlarge) you should expect to see:
To train with mp spawn:
python main.py
Or better, with torchrun:
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="localhost:5679" main_elastic.py
You can control the model size, dataset size, batch size, etc. all in the config/defaults.py