Skip to content

Commit

Permalink
Update trtllm docs for 0.28.0
Browse files Browse the repository at this point in the history
  • Loading branch information
ydm-amazon committed May 29, 2024
1 parent bdf6420 commit 808bb1b
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,46 @@ option.max_rolling_batch_size=max(256,value you override)
We understand finding the maximum number is difficult, so we precomputed a lookup table for you to find the numbers.
In the future, we will fuse those number into our container and will not ask you to provide one.

### LMI 0.28.0

The following number is tested on the machine with batch size up to 128 and input context up to 3700.

| Model | Machine | Tensor Parallel Degree | max number of tokens |
|---------------|----------|------------------------|----------------------|
| LLaMA 3 8B | g5.12xl | 1 | 24000 |
| LLaMA 3 8B | g5.12xl | 4 | 176000 |
| LLaMA 2 7B | g5.12xl | 1 | 29000 |
| LLaMA 2 7B | g5.12xl | 4 | 198000 |
| LLaMA 2 13B | g5.12xl | 4 | 127000 |
| Gemma 7B | g5.12xl | 4 | 125000 |
| Gemma 7B | g5.12xl | 1 | 1190 |
| Falcon 7B | g5.12xl | 1 | 36000 |
| Mistral 7B | g5.12xl | 1 | 35000 |
| Mistral 7B | g5.12xl | 4 | 198000 |
| LLaMA 2 13B | g6.12xl | 4 | 116000 |
| LLaMA 2 13B | g5.48xl | 8 | 142000 |
| LLaMA 2 70B | g5.48xl | 8 | 4100 |
| LLaMA 3 70B | g5.48xl | 8 | Out of Memory |
| Mixtral 8x7B | g5.48xl | 8 | 31000 |
| Falcon 40B | g5.48xl | 8 | 32000 |
| CodeLLAMA 34B | g5.48xl | 8 | 36000 |
| LLAMA 2 13B | p4d.24xl | 4 | 235000 |
| LLAMA 2 70B | p4d.24xl | 8 | 97000 |
| LLAMA 3 70B | p4d.24xl | 8 | 82000 |
| Mixtral 8x7B | p4d.24xl | 4 | 50000 |
| Mixtral 8x7B | p4d.24xl | 8 | 112000 |
| Falcon 40B | p4d.24xl | 4 | 71000 |
| Mistral 7B | p4d.24xl | 2 | 245000 |
| Mistral 7B | p4d.24xl | 4 | 498000 |
| CodeLLaMA 34B | p4d.24xl | 4 | 115000 |
| CodeLLaMA 34B | p4d.24xl | 8 | 191000 |







### LMI 0.27.0

The following number is tested on the machine with batch size up to 128 and input context up to 3700.
Expand Down
10 changes: 5 additions & 5 deletions serving/docs/lmi/user_guides/trt_llm_user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,13 @@ The below model architectures are supported for JIT model compiltation and teste
* Mistral (since LMI V8 0.26.0)
* Mixtral (since LMI V8 0.26.0)
* Qwen (since LMI V8 0.26.0)
* GPT2/SantaCoder (since LMI V8 0.26.0)
* GPT2/SantaCoder/StarCoder/GPTBigCode (since LMI V8 0.26.0)
* Phi2 (since LMI V9 0.27.0)
* OPT (since LMI V9 0.27.0)
* Gemma (since LMI V10 0.28.0)

TRT-LLM LMI v9 0.27.0 containers come with [TRT-LLM 0.8.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.8.0).
For models that are not listed here and supported by [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0?tab=readme-ov-file#models) with [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend), you can use this [tutorial](../tutorials/trtllm_manual_convert_tutorial.md) instead to prepare model manually.
TRT-LLM LMI v10 0.28.0 containers come with [TRT-LLM 0.9.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.9.0).
For models that are not listed here and supported by [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.9.0?tab=readme-ov-file#models) with [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend), you can use this [tutorial](../tutorials/trtllm_manual_convert_tutorial.md) instead to prepare model manually.

We will add more model support in the future versions in our CI. Please feel free to [file an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) if you are looking for a specific model support.

Expand All @@ -37,7 +39,6 @@ You can leverage `tensorrtllm` with LMI using the following starter configuratio
```
engine=MPI
option.tensor_parallel_degree=max
option.rolling_batch=trtllm
option.model_id=<your model id>
# Adjust the following based on model size and instance type
option.max_num_tokens=50000
Expand All @@ -50,7 +51,6 @@ You can follow [this example](../deployment_guide/deploying-your-endpoint.md#con
````
HF_MODEL_ID=<your model id>
TENSOR_PARALLEL_DEGREE=max
OPTION_ROLLING_BATCH=trtllm
# Adjust the following based on model size and instance type
OPTION_MAX_NUM_TOKENS=50000
````
Expand Down

0 comments on commit 808bb1b

Please sign in to comment.