LLaMA is a collection of foundation language models of 6.7B
, 13.0
, 32.5
, and 65.2B
parameters. The model information is shown below.
The paper reports their code processes around 380
tokens/sec/GPU on 2048
A100 GPU with 80GB of RAM, and the training performance are summarized in the below table.
The analyses below are performed with llm-analysis as a showcase.
For training analysis, we use flops_efficiency = 0.5
and hbm_memory_efficiency = 0.9
as observed by the literature.
The setup follows the paper description and is detailed in the run_train.sh script.
The output summaries are in outputs_train directory (both raw and readable jsons are produced).
For example, here is a full readable summary of LLaMA-65B model.
llm-analysis provides details about time and memory, as presented below.
The table below compares the outputs from llm-analysis to the reported GPU-hours
in the LLaMA paper. The last column shows the difference, suggesting llm-analysis gives a good estimate of training time.
params | LLaMA paper GPU-hours | llm-analysis GPU-hours | diff. % |
---|---|---|---|
LLaMA-7B | 82,432 | 77,832 | -5.58 |
LLaMA-13B | 135,168 | 148,047 | +9.53 |
LLaMA-33B | 530,432 | 510,213 | -3.81 |
LLaMA-65B | 1,022,362 | 1,015,701 | -0.65 |
Blow shows some example analyses that advise the benefit of using a different training setup. There are many more you can do, try it out!
Setting FLOPS and memory efficiency (flops_efficiency
and hbm_memory_efficiency
in the script) to 1
gives the lower-bound time estimation.
It suggests through system optimizations how much the performance can be further improved. The corresponding output summaries are in the outputs_train_ideal directory.
The GPU-hours are halved as expected, as the training is compute-bound.
We can quantitatively see from the new time breakdown below that the time portion of tensor parallelism communication (tp_comm
) and input embedding lookup (input_embedding
) increases while the other parts decrease.
In this analysis, we show how the training performance changes when different GPUs are used. To list all predefined GPUs, run python -m llm_analysis.config list_gpu_configs
.
Setting gpu_name
to h100-sxm-80gb in the run_train.sh script uses H100-SXM 80GB GPUs. The corresponding output summaries are in outputs_train_h100
directory.
The GPU-hours are reduced by 68 %
as expected, as the training is compute-bound and h100-sxm-80gb
has a peak FP16 TFLOPS of 989
compared to 312
for a100-sxm-80gb.
While the peak FLOPS from A100 to H100 increases by more than 2032 GB/s
to 3350 GB/s
, thus the tp_comm
takes more share in the time breakdown, as shown below.
As an example inference analysis, if not mentioned, we use ideal flop and memory efficiency, w16a16e16, context sequence length 512
, batch size 1
, number of tokens to generate 32
, and A100-SXM-80GB GPU. Note that LLaMA-65B requires 2
GPUs if using w16a16e16.
Below shows the prefill and decode (per token) latency for the LLaMA models.
The script used is in run_infer.sh and the output summaries are in outputs_infer_ideal.
Results for flops_efficiency (fe) = 0.5
and hbm_memory_efficiency (hbme) = 0.9
(in outputs_infer) are also presented as a comparison.
model | ideal prefill latency (ms) | ideal decode latency (ms) | prefill latency with fe=0.7, hbme=0.9 (ms) | decode latency with fe=0.7, hbme=0.9 (ms) |
---|---|---|---|---|
LLaMA-7B | 22.41 | 6.73 | 31.89 | 7.48 |
LLaMA-13B | 43.1 | 12.94 | 61.38 | 14.38 |
LLaMA-33B | 107.74 | 32.36 | 153.6 | 35.95 |
LLaMA-65B (tp_size=2) | 113.33 | 33.85 | 159.47 | 37.47 |
Below shows the time breakdown for the decode stage in ideal model inference.
Below shows the prefill and decode (per token) latency with different quantization schemes (data types) in ideal model inference. The following data type configurations are compared to the w16a16e16: w4a4e16, w4a8e16, w4a16e16, and w8a8e16.
The output summaries for these schemes are in output_infer_ideal_[scheme]
.
Note that to leverage INT4 (or INT8) GEMM, both weights and activations need to be in INT4 (or INT8).
- prefill latency
prefill latency (ms) | w16a16e16 | w4a4e16 | w4a8e16 | w4a16e16 | w8a8e16 |
---|---|---|---|---|---|
LLaMA-7B | 22.41 | 5.9 | 11.4 | 22.41 | 11.4 |
LLaMA-13B | 43.1 | 11.2 | 21.83 | 43.1 | 21.83 |
LLaMA-33B | 107.74 | 27.69 | 54.38 | 107.74 | 54.38 |
LLaMA-65B | 113.33 (tp_size=2) | 55.2 (tp_size=1) | 108.83 (tp_size=1) | 216.08 (tp_size=1) | 108.83 (tp_size=1) |
- decode latency
decode latency (ms) | w16a16e16 | w4a4e16 | w4a8e16 | w4a16e16 | w8a8e16 |
---|---|---|---|---|---|
LLaMA-7B | 6.73 | 1.78 | 1.85 | 1.99 | 3.43 |
LLaMA-13B | 12.94 | 3.36 | 3.47 | 3.69 | 6.55 |
LLaMA-33B | 32.36 | 8.25 | 8.46 | 8.89 | 16.28 |
LLaMA-65B | 33.85 (tp_size=2) | 16.41 (tp_size=1) | 16.76 (tp_size=1) | 17.47 (tp_size=1) | 32.56 (tp_size=1) |