- GPT
This document describes what FasterTransformer provides for the GPT model, explaining the workflow and optimization. We also provide a guide to help users to run the GPT model on FasterTransformer. Finally, we provide benchmark to demonstrate the speed of FasterTransformer on GPT.
GPT is a variant of Decoding model, which does not have the encoder module, cross multi-head attention, and uses GeLU as the activation. In 2020, OpenAI shows that using very giant model and lots of training data can significantly improve the capacity of GPT model in their paper. However, it is impossible to put such model into a single GPU. For example, the largest model, GPT-3, has 175 billion parameters, which takes about 350 GBs under half data type. Therefore, multi-gpus, even multi-nodes, is necessary. To solve the bottleneck of latency and memory due to the model size, FasterTransformer provides kernels with high efficiency, optimized memory usage, and model parallelism on multiple frameworks.
- Checkpoint converter
- Huggingface
- Megatron
- Nemo Megatron
- TensorFlow
- Data type
- FP32
- FP16
- BF16
- INT8 weight only PTQ.
- Limitations:
- Hidden sizes must be a multiple of 64 after weights are split for TP.
- The kernel typically only gives performance benefits for small batch (typically less than 32 or 64) and when weight matrices are large.
- Weight only PTQ only works for FP16/BF16 compute.
- Only supported on Volta and newer architectures.
- Note:
- Weights are preprocessed offline based on the current GPU to optimize the weight alignment for consumption by tensorcores. Currently, we directly consume FP32/BF16/FP16 weights and quantize them just before inference. If we want to store quantized weights, they MUST be preprocessed for the GPU intended to be used with inference.
- When using the torch APIs, int8 mode is only available via the Parallel GPT Op. The Parallel GPT Op can also be used on single GPU.
- Limitations:
- INT8 with SmoothQuant
- FP8 (Experimental)
- Feature
- Multi-GPU multi-node inference
- Dynamic random seed
- Stop tokens
- Beam search and sampling are both supported
- Loading FP32 or FP16 weights
- Frameworks
- TensorFlow
- PyTorch
- C++
- Triton backend
Fig 1 demonstrates the workflow of FasterTransformer GPT. Different from BERT and encoder-decoder structure, GPT receive some input ids as context, and generates the respective output ids as response. In this workflow, the major bottleneck is the GptDecoderLayer (transformer block) because the time increase linearly when we increase the number of layers. In GPT-3, the GptDecoderLayer takes about 95% of total time.
FasterTransformer splits the whole workflow into 2 parts. The first one is “computing the k/v cache of context (input ids), and the second part is “auto-regressive generating the output ids”. The operations of these two parts are similar, but the shapes of tensors in the SelfAttention
is different. So, we use 2 different implementations to handle two different cases, as demonstrating in Fig 2. In DecoderSelfAttention
, the sequence length of query is always 1, so we used customed fused masked multi-head attention kernel to handle. On the other hand, the sequence length of query in the ContextSelfAttention
is maximum input length, so we use cuBLAS to leverage the tensor core.
The following examples demonstrating how to run multi-GPU and multi-node GPT model.
examples/cpp/multi_gpu_gpt_example.cc
: It uses MPI to organize all GPUs.examples/cpp/multi_gpu_gpt_triton_example.cc
: It uses threading for intra node, and MPI for inter node. This example also demonstrates how to use Triton backend API of FasterTransformer to run the GPT model.examples/pytorch/gpt/multi_gpu_gpt_example.py
: This example is similar toexamples/cpp/multi_gpu_gpt_example.cc
, but encapsulate the instance of FasterTransformer by PyTorch OP.
In summary, the workflow to run the GPT model is:
- Initializing the NCCL comm and setting ranks of tensor parallel and pipeline parallel by MPI or threading
- Load weights by the ranks of tensor parallel, pipeline parallel and other model hyper-parameters.
- Create the instance of
ParalelGpt
by the ranks of tensor parallel, pipeline parallel and other model hyper-parameters. - Receive the request from client and convert the request to the format of input tensors for ParallelGpt.
- Run forward
- Convert the output tensors of ParallelGpt to response of client and return the response.
In c++ example codes, we skip the step 4 and step 6, loading the request by
examples/cpp/multi_gpu_gpt/start_ids.csv
. In PyTorch example codes, the request comes from the PyTorch side. In Triton example codes, we have a completed examples from step 1 to step 6.
The source codes are put in src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc
. The arguments, input tensors and output tensors of GPT:
- Constructor of GPT
Classification | Name | Data Type | Description |
---|---|---|---|
[0] | max_batch_size | size_t | Deprecated, move to input |
[1] | max_seq_len | size_t | Deprecated, move to input |
[2] | max_input_len | size_t | Deprecated, move to input |
[3] | beam_width | size_t | Deprecated, move to input |
[4] | head_num | size_t | Head number for model configuration |
[5] | size_per_head | size_t | Size per head for model configuration |
[6] | inter_size | size_t | The inter size of feed forward network. It is often set to 4 * head_num * size_per_head. |
[7] | num_layer | size_t | Number of transformer layers for model configuration |
[8] | vocab_size | int | Vocabulary size for model configuration |
[9] | start_id | int | Start id for vocabulary |
[18] | temperature | float | Deprecated, move to input |
[19] | len_penalty | float | Deprecated, move to input |
[20] | repetition_penalty | float | Deprecated, move to input |
[21] | tensor_para | NcclParam | Tensor Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h |
[22] | pipeline_para | NcclParam | Pipeline Parallel information, which is declared in src/fastertransformer/utils/nccl_utils.h |
[23] | stream | cudaStream_t | CUDA stream |
[24] | cublas_wrapper | cublasMMWrapper* | Pointer of cuBLAS wrapper, which is declared in src/fastertransformer/utils/cublasMMWrapper.h |
[26] | is_free_buffer_after_forward | bool | If setting to be true , FasterTransformer will allocate buffer before forward, and free buffer after forward. When the allocator is based on memory pool, setting to true may help reducing the memory usage during inference. |
[27] | cuda_device_prop | cudaDeviceProp* | Pointer of CUDA device properties, which is used to get the properties of hardware like size of shared memory |
[28] | sparse | bool | Is using sparsity. Experimental feature |
[29] | int8_mode | int | 0 means no quantization. 1 means use weight-only PTQ Experimental feature. 2 for weight and activation quantization Experimental feature. |
[30] | custom_all_reduce_comm | AbstractCustomComm | Custom all reduction communication for custom all reduction in model parallelism. It is only supported in 8-way tensor parallelism |
[31] | enable_custom_all_reduce | int | Flag of enabling custom all reduction or not |
[32] | remove_padding | bool | Remove the padding of input ids or not in context phase. |
[33] | shared_contexts_ratio | float | Ratio that controls the use of the shared contexts optimization. If the compact size (that accounts only for unique prompts) is less than ratio * batch size, use the optimized implementation. Setting shared_contexts_ratio=0 deactivate the optimization. |
- Input of GPT
Name | Tensor/Parameter Shape | Location | Data Type | Description |
---|---|---|---|---|
input_ids | [batch_size, max_input_length] | GPU | int | The input ids (context) |
input_lengths | [batch_size] | GPU | int | The lengths of input ids |
prompt_learning_task_name_ids | [batch_size] | CPU | int | Optional. Task name ids for prompt learning. |
output_seq_len | [batch_size] | CPU | uint32_t | The largest number of tokens you hope for results. Note that it contains the input length |
stop_words_list | [batch_size, 2, stop_words_length] | GPU | int | Optional. When FT generates words in this list, it will stop the generation. An extension of stop id |
bad_words_list | [batch_size, 2, bad_words_length] | GPU | int | Optional. The words in the list will never be sampled. |
repetition_penalty | [1] or [batch_size] | CPU | float | Optional. Repetition penalty applied to logits for both beam search and sampling. Exclusive with presence_penalty. |
presence_penalty | [1] or [batch_size] | CPU | float | Optional. Presence penalty - additive type of repetition penalty - applied to logits for both beam search and sampling. Exclusive with repetition_penalty. |
min_length | [1] or [batch_size] | CPU | int | Optional. Minimum number of tokens to generate |
random_seed | [1] or [batch_size] | CPU | unsigned long long int | Optional. Random seed to initialize the random table in sampling. |
request_prompt_lengths | [batch_size], | GPU | int | Optional. Length of prefix soft prompt embedding. This describes how many tokens of soft prompt embedding in each sentence. |
request_prompt_embedding | [batch_size, max_prompt_length, hidden_units] | GPU | float/half/bfloat16 | Optional. FT will concat them with results of embedding lookup kernel. For prefix soft prompt embedding, the type must be float; for p/prompt tuning, the type is same to weight. |
request_prompt_type | [batch_size] | CPU | int | Optional. Prompt type of request. This is necessary when user pass the prompt embedding by input |
is_return_context_cum_log_probs | [1] | CPU | bool | Optional. Return the cumulative log probability of context or not |
is_return_context_embeddings | [1] | CPU | bool | Optional. Return the sum of context tokens encodings or not |
session_len | [1] | CPU | uint32 | Optional. The maximum time length allowed during the whole interactive generation. Only used for interactive generation feature |
continue_gen | [1] | CPU | bool | Optional. A flag to tell FasterTransformer to not discard previous tokens and continue producing token based on previous generations. Only used for interactive generation feature |
memory_len | [1] | CPU | uint32 | Optional. The maximum time memory used in attention modules. Reduces the memory footprint but quality of generation might degrades. |
top_p_decay | [batch_size] | GPU | float | Optional. decay values for top_p sampling |
top_p_min | [batch_size] | GPU | float | Optional. min top_p values for top p sampling |
top_p_reset_ids | [batch_size] | GPU | uint32 | Optional. reset ids for resetting top_p values for top p sampling |
- Output of GPT
Name | Tensor/Parameter Shape | Location | Data Type | Description |
---|---|---|---|---|
output_ids | [batch_size, beam_width, max_output_seq_len] | GPU | int | The output ids. It contains the input_ids and generated ids |
sequence_length | [batch_size, beam_width] | GPU | int | The lengths of output ids |
output_log_probs | [batch_size, beam_width, request_output_seq_len] | GPU | float | Optional. It records the log probability of logits at each step for sampling. |
cum_log_probs | [batch_size, beam_width] | GPU | float | Optional. Cumulative log probability of generated sentences |
context_embeddings | [batch_size, beam_width, hidden_units] | GPU | float | Optional. Sum of context tokens encodings. |
The beam_width
value is set by the output shape directly. When the beam_width
of output_ids
is larger than 1, FT will use beam search to generate tokens; otherwise, FT will use topk or topp sampling. When the inputs of beam search and sampling is invalid, like beam width 1, top k 0, top p 0.0, FT will run greedy search automatically.
- Kernel optimization: many kernels are based on the kernels of decoder and decoding modules, which are already highly optimized. To prevent from recomputing the previous keys and values, we will allocate a buffer to store them at each step. Although it takes some additional memory usage, we can save the cost of recomputing, allocating buffer at each step, and the cost of concatenation.
- Memory optimization: Different to traditional models like BERT, GPT-3 has 175 billion parameters, taking 350 GBs even if we store the model by half precision. Therefore, we must reduce the memory usage for other parts. In FasterTransformer, we will reuse the memory buffer of different decoder layers. Since the number of layers in GPT-3 is 96, we only need 1/96 memory.
- Model parallelism: In GPT model, FasterTransormer provides both tensor parallelism and pipeline parallelism. For tensor parallelism, FasterTransformer follows the idea of Megatron. For both self-attention block and feed forward network block, we split the weights of first matrix multiplication by row and split the weights of the second matrix multiplication by column. By optimization, we can reduce the reduction operation to 2 times for each transformer block. The workflow is demonstrated in Fig 3. For pipeline parallelism, FasterTransformer splits the whole batch of request into multiple micro batches and hide the bubble of communication. FasterTransformer will adjust the micro batch size automatically for different cases. Users can adjust the model parallelism by modifying the
gpt_config.ini
file. We recommend to use tensor parallel intra node, and use pipeline parallel inter node because tensor parallel requires more NCCL communication. - Multiple frameworks: Except the source codes on c, FasterTransformer also provide the TensorFlow op, PyTorch op and Triton backend. Currently, TensorFlow op only supports the single GPU, while PyTorch op and Triton backend support multi-GPU and multi-node. To prevent the additional work of splitting model for model parallelism, FasterTransformer also provides a tool to split and convert the model of Megatron to binary files, then FasterTransformer can load the model in binary directly.
We provide the environment variables to tune for specific usage.
Name | Description | Default | Values accepted |
---|---|---|---|
FMHA_ENABLE |
enable the fused multi-head attention kernels (fp16 accumulation) | disabled | ON = enable fmha, otherwise disabled |
CONTEXT_ATTENTION_BMM1_HALF_ACCUM |
use fp16 accumulation for the qk gemm, and only make a difference to unfused multi-head attention kernels | fp32 accumulation | ON = fp32 accumulation, otherwise fp16 accumulation |
The following guide demonstrates how to run the examples of c++, PyTorch and Triton backend.
- CMake >= 3.8 for Tensorflow, CMake >= 3.13 for PyTorch
- CUDA 11.0 or newer version
- NCCL 2.10 or newer version
- Python: Only verify on python 3
- Tensorflow: Verify on 1.15, 1.13 and 1.14 should work.
- PyTorch: Verify on 1.8.0, >= 1.5.0 should work.
Recommend use nvcr image like nvcr.io/nvidia/tensorflow:22.09-tf1-py3
or nvcr.io/nvidia/pytorch:22.09-py3
.
These components are readily available within the NGC TensorFlow Docker image below.
Ensure you have the following components:
- NVIDIA Docker and NGC container are recommended
- NVIDIA Pascal or Volta or Turing or Ampere based GPU
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
- Getting Started Using NVIDIA GPU Cloud
- Accessing And Pulling From The NGC Container Registry
- Running TensorFlow
- Running PyTorch
For those unable to use the NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.
You can choose the tensorflow version and python version you want. Here, we list some possible images:
To achieve best performance, we recommend to use the latest image. For example, running image `nvcr.io/nvidia/tensorflow:22.09-tf1-py3` by
```bash
nvidia-docker run -ti --shm-size 5g --rm nvcr.io/nvidia/tensorflow:22.09-tf1-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
```
- Note: the
xx
of-DSM=xx
in following scripts means the compute capability of your GPU. The following table shows the compute capability of common GPUs.
GPU | compute capacity |
---|---|
P40 | 60 |
P4 | 61 |
V100 | 70 |
T4 | 75 |
A100 | 80 |
A30 | 80 |
A10 | 86 |
By default, -DSM
is set by 70, 75, 80 and 86. When users set more kinds of -DSM
, it requires longer time to compile. So, we suggest setting the -DSM
for the device you use only. Here, we use xx
as an example due to convenience.
-
build with C++
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON .. make -j12
-
build with TensorFlow
Uses need to set the path of TensorFlow. For example, if we use
nvcr.io/nvidia/tensorflow:22.09-tf1-py3
, thencmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.8/dist-packages/tensorflow_core/ -DBUILD_MULTI_GPU=ON .. make -j12
-
build with PyTorch
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON .. make -j12
This will build the TorchScript custom class. Please make sure that the
PyTorch >= 1.5.0
.
- Install required tools
pip install -r ../examples/pytorch/gpt/requirement.txt
To run the GPT on c, users need to convert the checkpoint of TensorFlow or PyTorch to binary files, and then load by FasterTransformer c api. Unfortunately, there is no published large model. So, users are only able to verify the correctness by smaller model. Currently, FasterTransformer provides two kinds of samples. First one is using the checkpoint of OpenAI GPT-2 model (which is trained by TensorFlow); Another choice is using the checkpoint of Megatron (which is trained by pytorch).
- Download vocab and merge table
They can be used in both OpenAI GPT-2 and Megatron.
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models
To convert the OpenAI GPT model to binary, FasterTransformer provides a tool sample/tensorflow/utils/openai_gpt_ckpt_convert.py
to convert the checkpoint. The converter requires the following arguments:
-i
: The path of megatron model-o
: The output path of converted model-t_g
: The tensor parallel size to train the model-i_g
: The tensor parallel size we hope for inference-h_n
: Number of heads, which is the hyper-parameter of the model
mkdir -p ../models/openai-gpt-models/
python tensorflow/utils/download_gpt2_model.py <model_name>
e.g. python ../examples/tensorflow/gpt/utils/download_gpt2_model.py 124M
mv models/124M ../models/openai-gpt-models/
python ../examples/tensorflow/gpt/utils/openai_gpt_ckpt_converter.py -o ../models/openai-gpt-models/c-model/124m/ -i ../models/openai-gpt-models/124M/model.ckpt -g 1 # convert 124M model with 1 TP mode
python ../examples/tensorflow/gpt/utils/openai_gpt_ckpt_converter.py -o ../models/openai-gpt-models/c-model/124m/ -i ../models/openai-gpt-models/124M/model.ckpt -g 4 # convert 124M model with 4 TP mode
In the repo of OpenAI, they provide many models, including 124M
, 355M
, 774M
and 1558M
To convert the Megatron GPT model to binary, FasterTransformer provides a tool examples/pytorch/utils/megatron_ckpt_convert.py
to convert the checkpoint.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p ../models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d ../models/megatron-models/345m
export PYTHONPATH=$PWD/..:${PYTHONPATH}
python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py \
-head_num 16 \
-i ../models/megatron-models/345m/release/ \
-o ../models/megatron-models/c-model/345m/ \
-t_g 1 \
-i_g 1 \
--vocab-path ../models/gpt2-vocab.json \
--merges-path ../models/gpt2-merges.txt
python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py \
-head_num 16 \
-i ../models/megatron-models/345m/release/ \
-o ../models/megatron-models/c-model/345m/ \
-t_g 1 \
-i_g 8 \
--vocab-path ../models/gpt2-vocab.json \
--merges-path ../models/gpt2-merges.txt
where t_g
means the number GPUs of TP during training, and i_g
means the number of GPUs for TP during inference.
Note that there are different checkpoint version of Megatron. The version of the checkpoint above is 0.
For model trained by pipeline parallelism or the checkpoint version is 3, you don't need to specify head_num or checkpoint_version as it can retrieve from model_args.
python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py -i ../models/megatron-models/345m/release/ -o ../models/megatron-models/c-model/345m/ -i_g 1
Note that the original gpt2-10.onnx
model at https://github.com/onnx/models/raw/master/text/machine_comprehension/gpt-2/model/gpt2-10.onnx
is removed. And new link https://github.com/onnx/models/blob/main/text/machine_comprehension/gpt-2/model/gpt2-10.onnx
cannot be loaded by onnx successfully.
To convert the ONNX GPT model to binary, FasterTransformer provides a tool examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py
to convert the checkpoint.
wget https://github.com/onnx/models/blob/main/text/machine_comprehension/gpt-2/model/gpt2-10.onnx
python ../examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py -i gpt2-10.onnx -o ../models/onnx-models/c-model/124m/ -i_g 1
python ../examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py -i gpt2-10.onnx -o ../models/onnx-models/c-model/124m/ -i_g 4
git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ../models/huggingface-models/c-model/gpt2-xl -i_g 1
-
Run GPT under on C++ with multiple gpu
1.1 Generate the
gemm_config.in
file.
Data Type = 0 (FP32) or 1 (FP16) or 2 (BF16)./bin/gpt_gemm <batch_size> <beam_width> <max_input_len> <head_number> <size_per_head> <inter_size> <vocab_size> <data_type> <tensor_para_size> <is_append> E.g., ./bin/gpt_gemm 8 1 32 12 128 6144 51200 1 1 1
If the application may have multiple different shapes (like different batch size), users can run multiple time and set
is_append
to be true. For example./bin/gpt_gemm 8 1 32 12 128 6144 51200 1 1 0 # bs 8, not append, will create a new gemm_config.ini ./bin/gpt_gemm 16 1 32 12 128 6144 51200 1 1 1 # bs 16, append results in existed gemm_config.ini
1.2 Run GPT on C++
Users can see the details of arguments in
examples/cpp/multi_gpu_gpt/gpt_config.ini
. It controls the model path, model size, tensor parallelism size, and some hyper-parameters../bin/multi_gpu_gpt_example
then use following script to convert the token ids to sentence.
python ../examples/pytorch/gpt/utils/gpt_token_converter.py --vocab_file=../models/gpt2-vocab.json --bpe_file=../models/gpt2-merges.txt
By setting the
data_type
ofgpt_config.ini
tofp16
orbf16
, users can run gpt model under fp16 or bf16.1.3 Run with tensor parallelism (TP), pipeline parallelism (PP)
Users can use
tensor_para_size
andpipeline_para_size
ingpt_config.ini
to control the size of model parallel. Note that the number of processes must equal totensor_para_size * pipeline_para_size
.mpirun -n 8 ./bin/multi_gpu_gpt_example python ../examples/pytorch/gpt/utils/gpt_token_converter.py --vocab_file=../models/gpt2-vocab.json --bpe_file=../models/gpt2-merges.txt
1.4 Run gpt on multi-nodes
Since the c sample codes use the MPI to communicate, it can extend to multi-nodes easily, except that users need to setup some network environment to communicate between multi-nodes. The following scripts are an example to show how to run multi-nodes inference on slurm.
srun -N2 -n2 -t 600 --pty bash # Assume we get 2 nodes: prm-dgx-09 and prm-dgx-10 srun -N2 -n2 docker pull nvcr.io/nvidia/tensorflow:22.09-tf1-py3 srun -N2 -n2 nvidia-docker run -itd --shm-size 5g --rm --privileged --network=host --pid=host --cap-add=IPC_LOCK --device=/dev/infiniband -v $PWD:$PWD -w $PWD --name ft-test nvcr.io/nvidia/tensorflow:22.09-tf1-py3 /bin/bash srun -N2 -n2 nvidia-docker exec -i --env SLURM_NTASKS --env SLURM_NODEID --env SLURM_PROCID --env SLURM_STEP_NODELIST --env SLURMD_NODENAME --privileged ft-test bash -c "mkdir /root/.ssh && cp $PWD/ssh/* /root/.ssh && chmod 700 /root/.ssh && chmod 640 /root/.ssh/authorized_keys2 && chmod 400 /root/.ssh/id_rsa && apt-get update && apt-get install ssh -y && mkdir /run/sshd/ && /usr/sbin/sshd -p 11068 && nvidia-smi -lgc 1530" nvidia-docker exec -ti ft-test bash cd FasterTransformer/build mpirun --allow-run-as-root -np 2 -H prm-dgx-09:1,prm-dgx-10:1 -mca plm_rsh_args "-p 11068" ./bin/multi_gpu_gpt_example srun -N2 -n2 docker stop ft-test
-
Run GPT on PyTorch
Basically,
gpt_example.py
includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set
--sample_file_input
to the text file path. (By default, the script will generate outputs without context inputs.) Set--sample_file_output
to write the outputs to a file. Use--data_type fp16/bf16
to run in FP16 or BF16.Run with
-h
to see more settings.python ../examples/pytorch/gpt/multi_gpu_gpt_example.py -h
2.1 Run GPT with TP and PP on single node (NVIDIA DGX A100). Note that the number of processes must equal to
tensor_para_size * pipeline_para_size
.# No parallelism (tensor_para_size=1, pipeline_para_size=1) python ../examples/pytorch/gpt/multi_gpu_gpt_example.py # TP (tensor_para_size=8, pipeline_para_size=1) mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=1 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu" # LP (tensor_para_size=1, pipeline_para_size=8) mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=1 --pipeline_para_size=8 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/1-gpu" # TP and LP (tensor_para_size=4, pipeline_para_size=2) mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=4 --pipeline_para_size=2 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/4-gpu"
2.2 Run GPT with TP and PP on single-node/multi-node (NVIDIA SuperPOD)
```bash
srun -A devtech -J devtech-gpt:gpt -p luna -N1 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer --pty bash
mkdir build && cd build
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON .. && make -j12
```
* tensor_para_size=8, pipeline_para_size=1
```bash
srun -A devtech -p luna -N1 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer/build python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=1 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"
```
* tensor_para_size=8, pipeline_para_size=2
```bash
srun -A devtech -p luna -N2 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer/build python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=2 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"
```
2.2 Run LAMBADA test on PyTorch
download data set:
```bash
wget https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl -P ../models/megatron-models
export PYTHONPATH=$PWD/../:$PYTHONPATH
python ../examples/pytorch/gpt/utils/update_gpt_config.py \
--model-dir ../models/megatron-models/c-model/345m/1-gpu/ \
--config-ini-path ../models/megatron-models/c-model/345m/1-gpu/config.ini \
--pipeline-para-size 1 \
--tensor-para-size 1 \
--max-seq-len 512 \
--beam-width 1 \
--sampling-top-k 1 \
--sampling-top-p 0 \
--data-type fp16
python ../examples/pytorch/gpt/lambada_task_example.py \
--batch-size 64 \
--checkpoint-path ../models/megatron-models/c-model/345m/1-gpu/ \
--lib-path lib/libth_transformer.so \
--lambada-path ../models/megatron-models/lambada_test.jsonl
```
-
Run GPT on tensorflow
Follow Download openai-gpt model and convert to prepare the model. Assume the TF model is put in
../models/openai-gpt-models/
../bin/gpt_gemm 4 1 32 12 64 3072 50257 1 1 python ../examples/tensorflow/gpt/gpt_example.py --batch_size=4 \ --length=32 \ --top_k=4 \ --top_p=0.6 \ --data_type=fp16 \ --models_dir=../models/openai-gpt-models/
Note that the tensorflow op only supports single gpu.
GPT now supports p/prompt-tuning. It works with nemo checkpoint and prompt learning.
-
Convert the prompt weights
Use the
examples/pytorch/gpt/utils/nemo_ckpt_convert.py
to convert the NeMo Megatron Prompt Weights. It will automatically generate configuration needed for triton backend inference.Note that you need to specify
start_id
,end_id
by yourself in order to make sure that it is consistent with the tokenizer. -
Run GPT with C++ example
You need to specify the example gpt_config.ini like below to enable the p/prompt_tuning feature.
[gptj_6B] head_num=16 size_per_head=256 vocab_size=50400 decoder_layers=28 rotary_embedding=64 start_id=50256 end_id=50256 inter_size=16384 num_tasks=2 prompt_learning_type=2 ;prompt learning example (soft prompt doesn't need it) [gptj_6B_task_0] task_name=task_0 prompt_length=5 [gptj_6B_task_1] task_name=task_1 prompt_length=10
task_name
andprompt_length
are specified for loading prompt weights.prompt_learning_start_id
is needed for checking whether ids are prompts or normal input ids.prompt_learning_type:
- no prompt: 0
- soft_prompt: 1
- prefix_prompt: 2
- p/prompt_tuning: 3
Meta OPT and OpenAI GPT do not have big differences in terms of structures, so they are sharing the same model and triton backend classes.
You need to convert the Huggingface Meta Opt models to fastertransformer format by examples/pytorch/gpt/utils/huggingface_opt_convert.py
.
-
Run OPT under on C++ with multiple gpu
Users can see the details of arguments in
examples/cpp/multi_gpu_gpt/gpt_config.ini
. It controls the model path, model size, tensor parallelism size, and some hyper-parameters.
In order to run with Meta Opt models, you need to add additional configuraitons:model_variant
, which controls thelayernorm_eps, layernorm_type, activation_type, has_post_decoder_layernorm
.For example, the opt 125m model configuraitons would be like:
[opt_125M] head_num=12 size_per_head=64 vocab_size=50272 decoder_layers=12 start_id=2 end_id=2 inter_size=3072 model_variant=opt-pre ;define variant structure
There are two model types: opt-pre = pre_layernorm, opt_post = post_layernorm
Note that: the model has post decoder layernorm when layernorm_type is pre_layernorm.1.1 Support for w8a8 int8 mode with OPT (preview)
FasterTransformer supports having certain operations with both weights and activations in int8. To keep high accuracy with your model, we recommend SmoothQuant models. Fig 4 presents the data flow. You can convert a regular OPT model to a SmoothQuant one with this repo. You must also generate activation records for calibrating the scaling factors. With these, you can convert the SmoothQuant model for w8a8 inference in FT:
python3 examples/pytorch/gpt/utils/huggingface_opt_convert.py -i ../smoothquant/opt-1.3b-smooth/ -o ../nlp-models/ft/test/opt-1.3b-int8/ -i_g 1 -act_scale ../smoothquant/opt-1.3b-smooth.scales.pt
Then, set the
int8_mode
to2
inexamples/cpp/gpt/gpt_config.ini
and runbin/multi_gpu_gpt_example
. Note that this optimization only supports OPT with pre-layernorm (opt-pre
).Fig 4. SmoothQuant workflow.
-
Run OPT on PyTorch
We can run summarization task examples of meta opt models. See
examples/pytorch/gpt/opt_summarization.py
.Note that the summarization test are ran by topk = 2, so the rouge score of HF and FT are often different.
- Run on opt-125m model
git lfs clone https://huggingface.co/facebook/opt-125m python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \ -i opt-125m/ \ -o opt-125m/c-model/ \ -i_g 1 python3 ../examples/pytorch/gpt/opt_summarization.py \ --summarize \ --test_hf \ --max_ite 20 \ --ft_model_location opt-125m/c-model \ --hf_model_name opt-125m
The results are similar to:
Hugging Face (total latency: 9.258284 sec) rouge1 : 20.36984889475218 rouge2 : 4.854345624891912 rougeL : 14.82866480289381 rougeLsum : 18.23638863809613 Faster Transformers (total latency: 3.9376330000000004 sec) rouge1 : 26.676168312282357 rouge2 : 10.004052949342602 rougeL : 19.20934213532261 rougeLsum : 24.243496576656323
- Run on opt-350m model
git lfs clone https://huggingface.co/facebook/opt-350m python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \ -i opt-350m/ \ -o opt-350m/c-model/ \ -i_g 1 python3 ../examples/pytorch/gpt/opt_summarization.py \ --summarize \ --test_hf \ --max_ite 20 \ --ft_model_location opt-350m/c-model \ --hf_model_name opt-350m \ --data_type fp16
The results are similar to:
Hugging Face (total latency: 21.961627 sec) rouge1 : 28.939621379501467 rouge2 : 9.858278077813752 rougeL : 19.159853526952528 rougeLsum : 26.120654334830885 Faster Transformers (total latency: 6.293255999999998 sec) rouge1 : 26.80687566772978 rouge2 : 8.639787737378661 rougeL : 18.90520115636779 rougeLsum : 24.372302912676407
We can also run OPT summarization with int8
python3 ../examples/pytorch/gpt/opt_summarization.py \ --summarize \ --test_hf \ --max_ite 20 \ --ft_model_location opt-350m/c-model \ --hf_model_name opt-350m \ --data_type fp16 --int8_mode 1
The results are similar to (from RTX 3090):
Hugging Face (total latency: 17.364539 sec) rouge1 : 29.781707569865045 rouge2 : 10.400027824789843 rougeL : 20.295983024772482 rougeLsum : 26.529982852324874 Faster Transformers (total latency: 6.088986 sec) rouge1 : 26.744781183506355 rouge2 : 7.118945671926842 rougeL : 17.357590762660852 rougeLsum : 24.31072167607998
-
Run OPT with Triton Backends
Model configurations have been automatically generated when converting the meta opt models.
Then, you can use the converted weights and configuration file to serve the model by triton servers. Example of theconfig.ini
when converting the model:[gpt] model_name = opt-350m/ head_num = 16 size_per_head = 64 inter_size = 4096 max_pos_seq_len = 2048 num_layer = 24 layernorm_eps = 1e-5 layernorm_type = post_layernorm activation_type = Relu has_post_decoder_layernorm = 0 vocab_size = 50272 start_id = 2 end_id = 2 weight_data_type = fp32
BLOOM is a variant of GPT model leveraging ALiBi, which does not need a learnt positional encoding and allows the model to generate sequences longer than the sequence length used in training.
BLOOM has also similar structure to OpenAI GPT, so like OPT FT provides BLOOM model through the GPT classes as a variation.
Users can convert a pretrained Huggingface BLOOM model into fastertransformer format by using examples/pytorch/gpt/utils/huggingface_bloom_convert.py
.
-
Run BLOOM under on C++ with multiple gpu
Users can find the details of parameters from
examples/cpp/multi_gpu_gpt/gpt_config.ini
, which controls the checkpoint path, model size, tensor parallelism size, as well as the other hyper-parameters. Like OPT, we need to set an additional configurationmodel_variant=bloom
. For example, the bloom-560m model configuraitons would be like:[bloom_560M] head_num=16 size_per_head=64 vocab_size=250880 decoder_layers=24 start_id=1 end_id=3 inter_size=4096 model_variant=bloom ; define variant structure
-
Run BLOOM on PyTorch
We provide a LAMBADA task example for BLOOM model. Please see
examples/pytorch/gpt/bloom_lambada.py
.- Run on bloom-560m model
git clone https://huggingface.co/bigscience/bloom-560m python ../examples/pytorch/gpt/utils/huggingface_bloom_convert.py \ --input-dir bloom-560m \ --output-dir bloom-560m/c-model \ -tp 1 -p 4 -v wget https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl -P ../datasets/lambada # Run HF benchmark python ../examples/pytorch/gpt/bloom_lambada.py \ --tokenizer-path bloom-560m \ --dataset-path ../datasets/lambada/lambada_test.jsonl \ --test-hf --show-progress # Run FT benchmark python ../examples/pytorch/gpt/bloom_lambada.py \ --checkpoint-path bloom-560m/c-model/1-gpu \ --tokenizer-path bloom-560m \ --dataset-path ../datasets/lambada/lambada_test.jsonl \ --show-progress
The result accuracy will be around 35.3% in both cases.
(HF) Accuracy: 35.3775% (1823/5153) (elapsed time: 23.3663 sec) (FT) Accuracy: 35.3386% (1821/5153) (elapsed time: 10.8444 sec)
-
Run BLOOM with Triton Backends
Same as OPT, when converting into FT checkpoint, configurations have been automatically generated, allowing us to run the model through a triton server without any further step. Example of the
config.ini
when converting the model:[gpt] model_name=bloom-560m/ num_layer=24 head_num=16 inter_size=4096 size_per_head=64 vocab_size=250880 layernorm_eps=1e-05 weight_data_type=fp32 tensor_para_size=1 start_id=1 end_id=2
Details are in transformer_backend
We choose the checkpoint provided by modelscope. This checkpoint is trained by chinese dataset. So, we will test by some chinese texts. Besides, we need some modification on Megatron-DeepSpeed to load the MOE checkpoint. We have put the modified Megtron-DeepSpeed codes in moe_ft
branch of https://github.com/byshiue/Megatron-DeepSpeed/.
pip install git+https://github.com/microsoft/DeepSpeed.git
git clone https://github.com/byshiue/Megatron-DeepSpeed/ -b moe_ft
pip install Megatron-DeepSpeed/
pip install jieba
pip install -r ../examples/pytorch/gpt/requirement.txt
git lfs clone https://www.modelscope.cn/PAI/nlp_gpt3_text-generation_0.35B_MoE-64.git
mv nlp_gpt3_text-generation_0.35B_MoE-64 ../models
PYTHONPATH=$PWD/../ python ../examples/pytorch/gpt/utils/megatron_gpt_moe_ckpt_convert.py \
--input-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model \
--saved-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models \
--infer-gpu-num 1 \
--vocab-path ../models/gpt2-vocab.json \
--merges-path ../models/gpt2-merges.txt
echo \
'据悉,自驾
“首金”花落谁家,无疑' > sample_input_file.txt
python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--tensor_para_size=1 \
--pipeline_para_size=1 \
--ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/1-gpu/ \
--data_type=fp16 \
--vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
--vocab_size=51200 \
--start_id=7 \
--end_id=7 \
--sample_input_file=sample_input_file.txt \
--use_jieba_tokenizer
The output should be like
[INFO] batch 0, beam 0:
[Context]
据悉,自驾
[Output]
游的人数正在逐年增加,而且越来越多的人选择自驾游,而且越来越多的人选择自驾
[INFO] batch 1, beam 0:
[Context]
“首金”花落谁家,无疑
[Output]
是一场精彩的“战役”。 “首金”花落谁家,是一场精彩的“战役”。
modelscope also provides 27B checkpoint, which can be put in single A100-80GB under FP16 and have higher qualities.
FT also supports GPT-MOE with model parallelism.
PYTHONPATH=$PWD/../ python ../examples/pytorch/gpt/utils/megatron_gpt_moe_ckpt_convert.py \
--input-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model \
--saved-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models \
--infer-gpu-num 2 \
--vocab-path ../models/gpt2-vocab.json \
--merges-path ../models/gpt2-merges.txt
mpirun -n 2 python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--tensor_para_size=2 \
--pipeline_para_size=1 \
--ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/2-gpu/ \
--data_type=fp16 \
--vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
--vocab_size=51200 \
--start_id=7 \
--end_id=7 \
--sample_input_file=sample_input_file.txt \
--use_jieba_tokenizer
mpirun -n 2 python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--tensor_para_size=1 \
--pipeline_para_size=2 \
--ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/1-gpu/ \
--data_type=fp16 \
--vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
--vocab_size=51200 \
--start_id=7 \
--end_id=7 \
--sample_input_file=sample_input_file.txt \
--use_jieba_tokenizer
Note that FP8 is supported since Hopper and CUDA 11.8. Here, we use docker image nvcr.io/nvidia/pytorch:22.10-py3
to demonstrate
mkdir build
cmake -DSM=90 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j12
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p
unzip megatron_lm_345m_v0.0.zip -d ./models/345m
export PYTHONPATH=$PWD/..:${PYTHONPATH}
python3 ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
-i ./models/345m/release \
-o ./models/345m/c-model/ \
-i_g 1 \
-head_num 16 \
-trained_tensor_parallel_size 1
python3 ../examples/pytorch/gpt/gpt_summarization.py \
--data_type fp8 \
--lib_path ./lib/libth_transformer.so \
--summarize \
--ft_model_location ./models/345m/c-model/
The checkpoint does not have quantization. FT will initialize them by identity scales directly. However, the accuracy is still good like following:
rouge1 : 23.264943073521202
rouge2 : 6.43987431806994
rougeL : 16.517620811297537
rougeLsum : 21.24054457217973
The model downloading and conversion are described in Download megatron model and convert.
A common request is, we have single input request, and hope to reply multiple results with different random seed. To achieve this target, we can mulpitle the inputs by several times, and set different random seed for different sentences in a batch. You can enable it by adding --enable_random_seed
. Otherwise, all random seed would be set to 0 by default.
For example, we prepare a input with batch size 4, and the sentences are all same.
for i in {1..4} ; do echo " Article : (CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's \"The Dukes of Hazzard,\" died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he'd been a busy actor for decades in theater and in Hollywood, Best didn't become famous until 1979, when \"The Dukes of Hazzard's\" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his \"hot pursuit\" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive \"kew-kew-kew\" chuckle and for goofy catchphrases such as \"cuff 'em and stuff 'em! \" upon making an arrest. Among the most popular shows on TV in the early '80s, \"The Dukes of Hazzard\" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best's \"Hazzard\" co-stars paid tribute to the late actor on social media. \"I laughed and learned more from Jimmie in one hour than from anyone else in a whole year,\" co-star John Schneider, who played Bo Duke, said on Twitter. \"Give Uncle Jesse my love when you see him dear friend.\" \"Jimmy Best was the most constantly creative person I have ever known,\" said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. \"Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life's many passions.\" Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. TL;DR: " >> sample_input.txt ; done
Then, we run the multi_gpu_gpt_example.py
with --enable_random_seed
:
python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--ckpt_path ../models/megatron-models/c-model/345m/1-gpu/ \
--vocab_file ../models/gpt2-vocab.json \
--merges_file ../models/gpt2-merges.txt \
--sample_input_file sample_input.txt \
--max_batch_size 4 \
--time \
--top_p 0.9 \
--top_k 0 \
--shared_contexts_ratio 0.0 \
--enable_random_seed \
--output_len 8
You can see the results are little different, and the program will show the time cost like:
[INFO] GPT time costs: 64.25 ms
Although this method can achieve our target, but computing same duplicated inputs is waste. So, we can set --shared_contexts_ratio
to compute the duplicated inputs once in context phase:
python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
--ckpt_path ../models/megatron-models/c-model/345m/1-gpu/ \
--vocab_file ../models/gpt2-vocab.json \
--merges_file ../models/gpt2-merges.txt \
--sample_input_file sample_input.txt \
--max_batch_size 4 \
--time \
--top_p 0.9 \
--top_k 0 \
--shared_contexts_ratio 1.0 \
--enable_random_seed \
--output_len 8
You can see the inference is faster than original one like:
[INFO] GPT time costs: 41.69 ms
Notes:
- The results of enabling
shared_context
and disablingshared_context
may be different because the shape of GEMM are changed. But it does not affect the qualities of generation. - We use short
output_len
in this example to demonstarte the benefit ofshared_context
. In real application, the more duplicated input, longer input length compared to output length, the more speedupshared_context
brings. - Since the additional overhead of enabling
shared_context
is ignorable, we enable it by default.
In some scenarios (like chatting), the new requests are related to previous requests. Currently, users can pass all previous inputs and outputs as a new inputs into FT to make FT generate new reply from these previous texts, like what we see in Fig 5 and Fig 6. However, this means that we need to re-compute the k/v cache of all previous inputs and outputs again, which is time wasting when the context is very long.
To achieve better performance and prevent useless computing, we add a new flag continue_gen
into GPT. When this flag is on, FT keeps all results during generation and assume the users will provide some more texts. And FT would not compute the k/v cache of the results it already has, but only compute the k/v cache of new ids. The workflow would become what we demonstrate in Fig 7. To prevent allocate the memory buffer again, users also need to set the session_len
to be the maximum sequence length of the final sentence, but not only for intermediate sentence.
We will use multi_gpu_gpt_interactive_example
to demonstarte how to use this feature. In this example, we load the examples/cpp/multi_gpu_gpt/start_ids.csv
first (the input length are all 8):
818, 262, 938, 3155, 286, 1528, 11, 257
198, 464, 968, 8221, 2732, 286, 15198, 318
464, 968, 1971, 12056, 423, 257, 649, 1182
464, 968, 1971, 3782, 468, 3199, 663, 5079
818, 257, 1445, 326, 481, 1884, 787, 340
464, 968, 1971, 12056, 6, 5859, 41683, 423
198, 198, 464, 5398, 4332, 628, 628, 198
464, 717, 640, 314, 2497, 262, 3807, 11
then generates 32 tokens with setting continue_gen=true
to get an intermediate results (the results are saved in out.interm
):
818 262 938 3155 286 1528 11 257 1256 286 661 423 587 4737 502 546 262 649 1492 11 290 314 1053 587 2111 284 3280 617 286 262 2683 326 661 423 587 4737 502 13 198 198
198 464 968 8221 2732 286 15198 318 1762 351 262 1181 338 9358 5011 284 5004 262 1266 835 284 1445 262 4979 13 198 1 1135 821 1016 284 307 2045 379 262 1266 835 284 1445 262
464 968 1971 12056 423 257 649 1182 3985 11 290 339 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640
464 968 1971 3782 468 3199 663 5079 1351 286 262 995 338 749 14212 661 13 198 464 1351 11 543 373 14102 416 262 968 1971 3782 11 318 1912 319 257 5526 286 517 621 352 11
818 257 1445 326 481 1884 787 340 4577 329 262 1664 284 3677 663 7303 11 262 1664 468 4987 284 3677 663 10171 287 262 1664 284 257 1448 286 7713 2957 416 262 2839 13598 4081 309
464 968 1971 12056 6 5859 41683 423 587 257 1263 636 286 262 1074 338 1943 428 1622 13 198 464 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262
198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332
464 717 640 314 2497 262 3807 11 314 373 588 11 705 5812 616 1793 11 428 318 523 3608 2637 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705
Next, we load another inputs from examples/cpp/multi_gpu+gpt/interactive_inputs_ids
(the input length are all 8 again):
5962, 11, 314, 561, 588, 284, 910, 326
11125, 286, 2844, 291, 5028, 422, 262, 7627
392, 257, 1913, 1998, 351, 1353, 12, 28282
830, 34643, 11, 7602, 11, 4708, 6332, 1938
5, 38328, 763, 13, 1119, 481, 2148, 257
3245, 355, 257, 22080, 1074, 13, 4042, 286
14150, 26443, 262, 1230, 338, 1410, 284, 3958
5195, 4398, 470, 314, 7342, 340, 2961, 30
and pass into FT again (note that we only need to pass new ids because FT already records all previous ids). Then FT will concatenate these new ids into output ids, compute k/v caches for only these new ids, and then generate another 32 tokens as a new response (the results are saved in out
):
818 262 938 3155 286 1528 11 257 1256 286 661 423 587 4737 502 546 262 649 1492 11 290 314 1053 587 2111 284 3280 617 286 262 2683 326 661 423 587 4737 502 13 198 198 5962 11 314 561 588 284 910 326 314 1101 407 257 4336 286 262 1492 13 314 892 340 338 257 1310 1165 881 286 257 366 10919 611 1 1492 13 314 892 340 338 257 1310 1165
198 464 968 8221 2732 286 15198 318 1762 351 262 1181 338 9358 5011 284 5004 262 1266 835 284 1445 262 4979 13 198 1 1135 821 1016 284 307 2045 379 262 1266 835 284 1445 262 11125 286 2844 291 5028 422 262 7627 7784 15296 284 262 7421 7784 15296 553 531 42743 6523 3899 1024 33246 271 13 198 464 42743 318 635 2045 379 262 5885 286 3867 262 4979 422 262 7421
464 968 1971 12056 423 257 649 1182 3985 11 290 339 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 392 257 1913 1998 351 1353 12 28282 18370 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13
464 968 1971 3782 468 3199 663 5079 1351 286 262 995 338 749 14212 661 13 198 464 1351 11 543 373 14102 416 262 968 1971 3782 11 318 1912 319 257 5526 286 517 621 352 11 830 34643 11 7602 11 4708 6332 1938 290 584 14212 661 13 198 464 1351 318 14102 416 262 968 1971 3782 290 318 3199 319 262 3052 286 262 7533 13 198 464 1351 318 20633 416 262
818 257 1445 326 481 1884 787 340 4577 329 262 1664 284 3677 663 7303 11 262 1664 468 4987 284 3677 663 10171 287 262 1664 284 257 1448 286 7713 2957 416 262 2839 13598 4081 309 5 38328 763 13 1119 481 2148 257 2472 286 720 16 13 20 2997 287 5003 290 4283 13 198 464 1730 318 2938 284 1969 287 262 1218 2063 286 428 614 13 198 464 1664 531 340
464 968 1971 12056 6 5859 41683 423 587 257 1263 636 286 262 1074 338 1943 428 1622 13 198 464 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262 3245 355 257 22080 1074 13 4042 286 262 640 11 262 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262 3245 355 257 22080 1074 13 198 464 12056 423
198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 14150 26443 262 1230 338 1410 284 3958 262 779 286 262 1573 366 16991 1 287 262 1499 338 1743 3303 13 198 198 464 1230 338 1410 284 3958 262 779 286 262 1573 366 16991 1 287
464 717 640 314 2497 262 3807 11 314 373 588 11 705 5812 616 1793 11 428 318 523 3608 2637 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705 5195 4398 470 314 7342 340 2961 30 4162 4398 470 314 1775 340 878 8348 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705 40 765 284 307 287 428
Hardware settings (A100 SuperPod architecture):
- Intra node: 8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor, linked by NVSwitch
- Inter node: Linked by Infiniband, 8x200Gb/s NICs
We demonstrate the inference time of Megatron and FasterTransformer on Triton, and show the speedup of FasterTransformer compare to Megatron for GPT-175B and GPT-89B. In the experiments of GPT, we updated the following parameters:
- head_num = 128
- size_per_head = 160
- num_layers = 105
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
TP means tensor parallelism, PP means pipeline parallelism.
Batch Size | Input Length | Output Length | Latency of TP-16, PP-1 (ms) | Latency of TP-32, PP-1 (ms) | Latency of TP-8, PP-3 (ms) |
---|---|---|---|---|---|
1 | 20 | 8 | 565 | 431 | 842 |
2 | 20 | 8 | 598 | 455 | 860 |
4 | 20 | 8 | 616 | 493 | 867 |
8 | 20 | 8 | 660 | 523 | 929 |
16 | 20 | 8 | 730 | 575 | 1049 |
32 | 20 | 8 | 865 | 672 | 1283 |
64 | 20 | 8 | 1191 | 942 | 1722 |
128 | 20 | 8 | 1862 | 1431 | 2124 |
256 | 20 | 8 | 3341 | 2483 | 3140 |
1 | 60 | 20 | 1379 | 1037 | 2085 |
2 | 60 | 20 | 1515 | 1110 | 2122 |
4 | 60 | 20 | 1512 | 1198 | 2184 |
8 | 60 | 20 | 1631 | 1295 | 2367 |
16 | 60 | 20 | 1868 | 1454 | 2753 |
32 | 60 | 20 | 2361 | 1804 | 3543 |
64 | 60 | 20 | 3383 | 2646 | 4117 |
128 | 60 | 20 | 5406 | 4099 | 5319 |
256 | 60 | 20 | OOM | 7203 | 8318 |
1 | 128 | 8 | 585 | 451 | 866 |
2 | 128 | 8 | 667 | 508 | 932 |
4 | 128 | 8 | 765 | 606 | 1097 |
8 | 128 | 8 | 990 | 766 | 1434 |
16 | 128 | 8 | 1377 | 1074 | 2104 |
32 | 128 | 8 | 2251 | 1741 | 2623 |
64 | 128 | 8 | 4002 | 3114 | 3578 |
128 | 128 | 8 | OOM | 5784 | 5512 |
256 | 128 | 8 | OOM | 11232 | 9614 |
- head_num = 96
- size_per_head = 128
- num_layers = 96
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
- tensor_parallel_size = 8 with NVLink
Batch_size | Input Seqlen | Output Seqlen | Megatron Latency (ms) |
FT Latency (ms) |
FT Speedup |
---|---|---|---|---|---|
1 | 128 | 8 | 660.38 | 488.86 | 1.35 |
2 | 128 | 8 | 687.34 | 509.47 | 1.35 |
4 | 128 | 8 | 1004.88 | 629.64 | 1.60 |
8 | 128 | 8 | 1705.07 | 749.86 | 2.27 |
12 | 128 | 8 | 2365.02 | 886.24 | 2.67 |
16 | 128 | 8 | 3111.57 | 1037.47 | 3.00 |
20 | 128 | 8 | 3723.73 | 1135.72 | 3.28 |
32 | 128 | 8 | 5778.72 | 1547.44 | 3.73 |
1 | 512 | 32 | 2384.78 | 1719.96 | 1.39 |
2 | 512 | 32 | 2503.24 | 1830.56 | 1.37 |
4 | 512 | 32 | 3658.65 | 2092.56 | 1.75 |
8 | 512 | 32 | 6238.79 | 2629.97 | 2.37 |
16 | 512 | 32 | 11409.53 | 3706.23 | 3.08 |
- head_num = 96
- size_per_head = 128
- num_layers = 48
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
- tensor_parallel_size = 8 with NVLink
Batch_size | Input Seqlen | Output Seqlen | Megatron Latency (ms) |
FT Latency (ms) |
FT Speedup |
---|---|---|---|---|---|
1 | 128 | 8 | 342.86 | 279.44 | 1.23 |
2 | 128 | 8 | 369.43 | 280.24 | 1.32 |
4 | 128 | 8 | 540.97 | 317.71 | 1.70 |
8 | 128 | 8 | 912.46 | 377.50 | 2.42 |
12 | 128 | 8 | 1263.39 | 445.46 | 2.84 |
16 | 128 | 8 | 1663.39 | 524.80 | 3.17 |
20 | 128 | 8 | 1991.16 | 575.83 | 3.46 |
32 | 128 | 8 | 3086.85 | 786.57 | 3.92 |
1 | 512 | 32 | 1244.81 | 887.52 | 1.40 |
2 | 512 | 32 | 1357.54 | 940.11 | 1.44 |
4 | 512 | 32 | 1970.08 | 1133.22 | 1.74 |
8 | 512 | 32 | 3341.66 | 1415.02 | 2.36 |
16 | 512 | 32 | 6090.07 | 1952.2 | 3.12 |
- head_num = 48
- size_per_head = 128
- num_layers = 44
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
TP means tensor parallelism
Batch_size | Input Length | Output Length | Latency of single GPU (ms) |
Latency of 2-way TP (ms) |
Latency of 4-way TP (ms) |
Latency of 8-way TP (ms) |
---|---|---|---|---|---|---|
1 | 20 | 8 | 225 | 147 | 102 | 89 |
2 | 20 | 8 | 225 | 152 | 108 | 94 |
4 | 20 | 8 | 228 | 158 | 113 | 100 |
8 | 20 | 8 | 239 | 169 | 121 | 107 |
16 | 20 | 8 | 268 | 191 | 133 | 113 |
32 | 20 | 8 | 331 | 230 | 155 | 127 |
64 | 20 | 8 | 452 | 314 | 200 | 169 |
128 | 20 | 8 | 726 | 484 | 318 | 256 |
256 | 20 | 8 | 1352 | 844 | 533 | 416 |
1 | 60 | 20 | 560 | 358 | 248 | 212 |
2 | 60 | 20 | 562 | 378 | 262 | 222 |
4 | 60 | 20 | 582 | 393 | 274 | 236 |
8 | 60 | 20 | 635 | 429 | 299 | 247 |
16 | 60 | 20 | 748 | 510 | 345 | 272 |
32 | 60 | 20 | 933 | 620 | 418 | 325 |
64 | 60 | 20 | 1352 | 887 | 574 | 454 |
128 | 60 | 20 | 2218 | 1384 | 928 | 699 |
256 | 60 | 20 | 4141 | 2424 | 1574 | 1152 |
1 | 128 | 20 | 566 | 362 | 254 | 217 |
2 | 128 | 20 | 580 | 385 | 267 | 227 |
4 | 128 | 20 | 629 | 421 | 290 | 244 |
8 | 128 | 20 | 740 | 487 | 333 | 267 |
16 | 128 | 20 | 931 | 618 | 405 | 312 |
32 | 128 | 20 | 1335 | 862 | 547 | 418 |
64 | 128 | 20 | 2157 | 1379 | 832 | 634 |
128 | 128 | 20 | 3830 | 2365 | 1439 | 1072 |
256 | 128 | 20 | OOM | 4414 | 2639 | 1943 |
1 | 80 | 200 | 5609 | 3532 | 2438 | 2053 |
2 | 80 | 200 | 5588 | 3682 | 2544 | 2095 |
4 | 80 | 200 | 5661 | 3797 | 2646 | 2206 |
8 | 80 | 200 | 5838 | 3984 | 2741 | 2268 |
16 | 80 | 200 | 6167 | 4356 | 2964 | 2307 |
32 | 80 | 200 | 6864 | 4817 | 3233 | 2566 |
64 | 80 | 200 | 8290 | 6003 | 3815 | 3173 |
128 | 80 | 200 | OOM | 7884 | 5239 | 4303 |
256 | 80 | 200 | OOM | 12007 | 7603 | 6087 |
1 | 200 | 200 | 5648 | 3544 | 2481 | 2080 |
2 | 200 | 200 | 5686 | 3739 | 2597 | 2131 |
4 | 200 | 200 | 5830 | 3876 | 2719 | 2249 |
8 | 200 | 200 | 6146 | 4123 | 2851 | 2338 |
16 | 200 | 200 | 6815 | 4672 | 3152 | 2475 |
32 | 200 | 200 | 8111 | 5488 | 3634 | 2811 |
64 | 200 | 200 | 10766 | 7256 | 4536 | 3621 |
128 | 200 | 200 | OOM | 10538 | 6618 | 5229 |
256 | 200 | 200 | OOM | OOM | 10447 | 7895 |
- head_num = 32
- size_per_head = 128
- num_layers = 32
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
- tensor_para_size = 1
Batch_size | Input Seqlen | Output Seqlen | FT Latency (ms) |
Memory Usage (GB) |
---|---|---|---|---|
1 | 128 | 8 | 98.29 | 15.55 |
2 | 128 | 8 | 106.74 | 15.66 |
4 | 128 | 8 | 123.47 | 15.87 |
8 | 128 | 8 | 162.51 | 16.31 |
16 | 128 | 8 | 241.16 | 17.19 |
32 | 128 | 8 | 400.35 | 18.84 |
64 | 128 | 8 | 718.07 | 22.17 |
1 | 512 | 32 | 384.70 | 15.96 |
2 | 512 | 32 | 425.88 | 16.30 |
4 | 512 | 32 | 514.93 | 16.99 |
8 | 512 | 32 | 699.62 | 18.72 |
16 | 512 | 32 | 1068.88 | 22.17 |
32 | 512 | 32 | 1814.03 | 28.73 |
64 | 512 | 32 | 3306.41 | 41.84 |
- head_num = 32
- size_per_head = 64
- num_layers = 24
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
- tensor_para_size = 1
Batch_size | Input Seqlen | Output Seqlen | FT Latency (ms) |
Memory Usage (GB) |
---|---|---|---|---|
1 | 128 | 8 | 36.76 | 8.67 |
2 | 128 | 8 | 39.16 | 5.39 |
4 | 128 | 8 | 43.32 | 5.49 |
8 | 128 | 8 | 52.92 | 5.66 |
16 | 128 | 8 | 74.44 | 6.00 |
32 | 128 | 8 | 116.74 | 6.66 |
64 | 128 | 8 | 201.71 | 7.97 |
1 | 512 | 32 | 135.85 | 5.58 |
2 | 512 | 32 | 150.57 | 5.71 |
4 | 512 | 32 | 178.25 | 5.97 |
8 | 512 | 32 | 232.11 | 6.64 |
16 | 512 | 32 | 345.96 | 7.98 |
32 | 512 | 32 | 578.52 | 10.52 |
64 | 512 | 32 | 1036.21 | 15.61 |
- head_num = 16
- size_per_head = 64
- num_layers = 24
- data_type = FP16
- vocab_size = 51200
- top_p = 0.9
- tensor_para_size = 1
Batch_size | Input Seqlen | Output Seqlen | FT Latency (ms) |
Memory Usage (GB) |
---|---|---|---|---|
1 | 128 | 8 | 25.43 | 3.43 |
2 | 128 | 8 | 26.42 | 3.46 |
4 | 128 | 8 | 28.00 | 3.51 |
8 | 128 | 8 | 32.56 | 3.61 |
16 | 128 | 8 | 42.87 | 3.78 |
32 | 128 | 8 | 62.61 | 4.13 |
64 | 128 | 8 | 104.51 | 4.81 |
1 | 512 | 32 | 92.01 | 3.57 |
2 | 512 | 32 | 97.87 | 3.65 |
4 | 512 | 32 | 110.70 | 3.78 |
8 | 512 | 32 | 136.45 | 4.12 |
16 | 512 | 32 | 189.91 | 4.80 |
32 | 512 | 32 | 296.15 | 6.09 |
64 | 512 | 32 | 529.18 | 8.67 |