Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update TensorRT-LLM #2008

Merged
merged 1 commit into from
Jul 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,6 @@ results_trt/

# Generated files
cpp/include/tensorrt_llm/executor/version.h

# User config files
CMakeUserPresets.json
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ TensorRT-LLM
[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
[![cuda](https://img.shields.io/badge/cuda-12.4.1-green)](https://developer.nvidia.com/cuda-downloads)
[![trt](https://img.shields.io/badge/TRT-10.1.0-green)](https://developer.nvidia.com/tensorrt)
[![trt](https://img.shields.io/badge/TRT-10.2.0-green)](https://developer.nvidia.com/tensorrt)
[![version](https://img.shields.io/badge/release-0.12.0.dev-green)](./tensorrt_llm/version.py)
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)

Expand Down
11 changes: 11 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# TensorRT-LLM Benchmarks

## Overview

There are currently three workflows to benchmark TensorRT-LLM:
* [C++ benchmarks](./cpp)
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
* [Python benchmarks](./python)
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
* [The Python benchmarking suite](./suite)
- This benchmarking suite is a current work in progress and is prone to large changes.
130 changes: 60 additions & 70 deletions benchmarks/cpp/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Benchmark for C++ Runtime
# Benchmark C++ Runtime

This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
multiple GPUs or multiple nodes with multiple GPUs.
multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.

## Usage

Expand All @@ -16,58 +16,11 @@ Windows users: Follow the
instead, and be sure to set DLL paths as specified in
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).

### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)

#### Prepare TensorRT-LLM engine(s)

Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.

Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).

#### Launch benchmarking

For detailed usage, you can do the following
```
cd cpp/build

# You can directly execute the binary for help information
./benchmarks/gptSessionBenchmark --help
./benchmarks/bertBenchmark --help
```

Take GPT-350M as an example for single GPU

```
./benchmarks/gptSessionBenchmark \
--engine_dir "../../benchmarks/gpt_350m/" \
--batch_size "1" \
--input_output_len "60,20"

# Expected output:
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
```
Take GPT-175B as an example for multiple GPUs
```
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
--engine_dir "../../benchmarks/gpt_175b/" \
--batch_size "1" \
--input_output_len "60,20"

# Expected output:
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
```

If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.

If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.

*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*

### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
### 2. Launch C++ benchmarking (Inflight/V1 batching)

#### Prepare dataset

Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.

This tool can be used in 2 different modes of traffic generation.

Expand Down Expand Up @@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be


#### Prepare TensorRT-LLM engines
Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.

Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).

#### Launch benchmarking

Expand All @@ -139,21 +93,10 @@ cd cpp/build
./benchmarks/gptManagerBenchmark --help
```

Take GPT-350M as an example for single GPU V1 batching
```
./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--type V1 \
--request_rate 10 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--max_num_samples 500
```

Take GPT-350M as an example for 2-GPU inflight batching
```
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
--type IFB \
--request_rate 10 \
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
--max_num_samples 500
Expand All @@ -163,10 +106,11 @@ mpirun -n 2 ./benchmarks/gptManagerBenchmark \

#### Emulated static batching

To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.

Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.

`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
```
python prepare_dataset.py \
--output tokens-fixed-lengths.json \
Expand All @@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
```
./benchmarks/gptManagerBenchmark \
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
--type IFB \
--request-rate -1 \
--static_emulated_batch_size 32 \
--static_emulated_timeout 100 \
Expand Down Expand Up @@ -239,7 +182,7 @@ ${HOME}/.local/bin/trtllm-build \
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate \
--max_lora_rank ${MAX_LORA_RANK}

NUM_LORAS=(8 16 24 32 64 128 256)
NUM_LORAS=(8 16)
NUM_REQUESTS=1024

# Convert LoRA to cpp format
Expand Down Expand Up @@ -271,7 +214,7 @@ for nloras in ${NUM_LORAS[@]}; do
done

# Generate random lora weights for 256 adapters
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 256
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 16

# perform benchmarking

Expand All @@ -284,7 +227,7 @@ mpirun -n ${TP} --output-filename ${EG_DIR}/log-base-lora \
--dataset "${EG_DIR}/data/token-norm-dist.json" \
--lora_host_cache_bytes 8589934592 \
--lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
--kv_cache_free_gpu_mem_fraction 0.80 \
--kv_cache_free_gpu_mem_fraction 0.70 \
--log_level info \
--eos_id ${EOS_ID}

Expand All @@ -302,9 +245,56 @@ for nloras in ${NUM_LORAS[@]}; do
--dataset "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
--lora_host_cache_bytes 8589934592 \
--lora_num_device_mod_layers $(( 16 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
--kv_cache_free_gpu_mem_fraction 0.80 \
--kv_cache_free_gpu_mem_fraction 0.70 \
--log_level info \
--eos_id ${EOS_ID} \
--lora_dir ${EG_DIR}/loras
done
```

### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)

#### Prepare TensorRT-LLM engine(s)

Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.

Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).

#### Launch benchmarking

For detailed usage, you can do the following
```
cd cpp/build

# You can directly execute the binary for help information
./benchmarks/gptSessionBenchmark --help
./benchmarks/bertBenchmark --help
```

Take GPT-350M as an example for single GPU

```
./benchmarks/gptSessionBenchmark \
--engine_dir "../../benchmarks/gpt_350m/" \
--batch_size "1" \
--input_output_len "60,20"

# Expected output:
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
```
Take GPT-175B as an example for multiple GPUs
```
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
--engine_dir "../../benchmarks/gpt_175b/" \
--batch_size "1" \
--input_output_len "60,20"

# Expected output:
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
```

If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.

If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.

*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
14 changes: 14 additions & 0 deletions benchmarks/cpp/gptManagerBenchmark.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ struct BenchmarkParams
std::optional<SizeType32> maxNumTokens{std::nullopt};
int randomSeed = 430;
std::optional<int> maxAttentionWindow{std::nullopt};
bool multiBlockMode{false};

// lora / peft params
std::optional<std::string> loraDir{std::nullopt};
Expand Down Expand Up @@ -820,6 +821,7 @@ class ExecutorServer
executorConfig.setDecodingConfig(texec::DecodingConfig(
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
std::nullopt, benchmarkParams.medusaChoices));
executorConfig.setMultiBlockMode(benchmarkParams.multiBlockMode);

mExecutor = std::make_unique<texec::Executor>(trtEnginePath, texec::ModelType::kDECODER_ONLY, executorConfig);

Expand Down Expand Up @@ -1399,6 +1401,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
optionalParams.decodingConfig = texec::DecodingConfig(
benchmarkParams.medusaChoices.has_value() ? texec::DecodingMode::Medusa() : texec::DecodingMode::Auto(),
std::nullopt, benchmarkParams.medusaChoices);
optionalParams.multiBlockMode = benchmarkParams.multiBlockMode;

auto const jsonConfig = GptJsonConfig::parse(engineDir / "config.json");
auto const worldConfig = WorldConfig::mpi(jsonConfig.getGpusPerNode(), jsonConfig.getTensorParallelism(),
Expand Down Expand Up @@ -1439,6 +1442,7 @@ void benchmarkGptManager(std::filesystem::path const& engineDir, TrtGptModelType
auto startLoraLoad = std::chrono::steady_clock::now();
LoraLib loras(benchmarkParams.loraDir.value());
SizeType32 reqId = 0;
gptServer->resetBatchDeadline();
for (auto const& [taskId, p] : loras.getLoras())
{
reqId++;
Expand Down Expand Up @@ -1550,6 +1554,9 @@ void benchmarkExecutor(std::filesystem::path const& engineDir, TrtGptModelType m
std::vector<texec::Request> requests;
for (auto& [taskId, p] : loras.getLoras())
{
// squeeze lora configs and weights since LoraConfig requires them to be 2D tensors
p.first->squeeze(0);
p.second->squeeze(0);
texec::LoraConfig loraConfig(
taskId, texec::detail::ofITensor(p.first), texec::detail::ofITensor(p.second));
Sample s{std::vector<int32_t>{1, 2, 3, 4, 5}, 1, static_cast<int32_t>(taskId)};
Expand Down Expand Up @@ -1771,6 +1778,10 @@ int main(int argc, char* argv[])
options.add_options()(
"medusa_choices", "Medusa choices in the format of [[0], [0, 1], [0, 0, 1]]", cxxopts::value<std::string>());

options.add_options()("multi_block_mode",
"Distribute the work across multiple CUDA thread-blocks on the GPU for masked MHA kernel",
cxxopts::value<bool>()->default_value("false"));

auto result = options.parse(argc, argv);

if (result.count("help"))
Expand Down Expand Up @@ -1922,6 +1933,9 @@ int main(int argc, char* argv[])
benchmarkParams.medusaChoices = parseVectorOfVectors(result["medusa_choices"].as<std::string>());
}

// Argument: multi_block_mode
benchmarkParams.multiBlockMode = result["multi_block_mode"].as<bool>();

std::optional<TokenIdType> padId;
// Argument: Padding token id
if (result.count("pad_id"))
Expand Down
7 changes: 5 additions & 2 deletions benchmarks/python/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Benchmark for Python Runtime
# Benchmark Python Runtime

> [!WARNING] Python benchmark is not recommended to be used for benchmarking, please use C++ benchmark instead
> The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.

This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
multiple GPUs or multiple nodes with multiple GPUs.
multiple GPUs or multiple nodes with multiple GPUs using the Python runtime.

## Overview

Expand Down
1 change: 0 additions & 1 deletion benchmarks/python/all_reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ def allreduce_benchmark(dtype: str,
]:
builder = tllm.Builder()
net = builder.create_network()
net.plugin_config.set_nccl_plugin(dtype, use_custom_all_reduce=True)
_buffers, workspace = current_all_reduce_helper(
).allocate_workspace(mapping, size * dtype_size)

Expand Down
Loading