TorchAcc is an AI training acceleration framework developed by Alibaba Cloud’s PAI team.
TorchAcc is built on PyTorch/XLA and provides an easy-to-use interface to accelerate the training of PyTorch models. At the same time, TorchAcc has implemented extensive optimizations for distributed training, memory management, and computation specifically for GPUs, ultimately achieving improved ease of use, better GPU training performance, and enhanced scalability for distributed training.
-
Rich distributed parallelism strategies
- Data Parallelism
- Fully Sharded Data Parallelism
- Tensor Parallelism
- Pipeline Parallelism
- Context Parallelism
- Ulysess
- Ring Attention
- FlashSequence (2D Sequence Parallelism)
-
Memory efficient
-
High Performance
-
Easy-to-use API
You can accelerate your transformer models with just a few lines of code using TorchAcc.
The main goal of TorchAcc is to provide a high-performance AI training framework. It utilizes IR abstractions at different layers and employs static graph compilation optimization like XLA and dynamic graph compilation optimization like BladeDISC, as well as distributed optimization techniques, to offer a comprehensive end-to-end optimization solution from the underlying operators to the upper-level models.
sudo docker run --gpus all --net host --ipc host --shm-size 10G -it --rm --cap-add=SYS_PTRACE dsw-registry.cn-hangzhou.cr.aliyuncs.com/pai/acc:r2.3.0-cuda12.1.0-py3.10 bash
see the contribution guide.
We present a straightforward example for training a Transformer model using TorchAcc, illustrating the usage of the TorchAcc API. You can quickly initiate training a Transformer model with TorchAcc by executing the following command:
torchrun --nproc_per_node=4 benchmarks/transformer.py --bf16 --acc --disable_loss_print --fsdp_size=4 --gc
If you are familiar with HuggingFace Transformers's Trainer, you can easily accelerate a Transformer model using TorchAcc, see the huggingface transformers
If you want to try the latest features of Torchacc or want to use the TorchAcc interface more flexibly for model acceleration, you can use our LLM acceleration library, FlashModels. FlashModels integrates various distributed implementations of commonly used open-source LLMs and provides a wealth of examples and benchmarks.
https://github.com/AlibabaPAI/FlashModels
coming soon..
see the contribution guide.
You can contact us by adding our DingTalk group: