Update trtllm docs for 0.28.0

deepjavalibrary · May 29, 2024 · 808bb1b · 808bb1b
1 parent bdf6420
commit 808bb1b
Show file tree

Hide file tree

Showing 2 changed files with 45 additions and 5 deletions.
diff --git a/serving/docs/lmi/tutorials/trtllm_finding_max_num_tokens_tutorial.md b/serving/docs/lmi/tutorials/trtllm_finding_max_num_tokens_tutorial.md
@@ -38,6 +38,46 @@ option.max_rolling_batch_size=max(256,value you override)
 We understand finding the maximum number is difficult, so we precomputed a lookup table for you to find the numbers.
 In the future, we will fuse those number into our container and will not ask you to provide one.
 
+### LMI 0.28.0
+
+The following number is tested on the machine with batch size up to 128 and input context up to 3700.
+
+| Model         | Machine  | Tensor Parallel Degree | max number of tokens | 
+|---------------|----------|------------------------|----------------------|
+| LLaMA 3 8B    | g5.12xl  | 1	                     | 24000                |
+| LLaMA 3 8B    | g5.12xl  | 4	                     | 176000               |
+| LLaMA 2 7B    | g5.12xl  | 1	                     | 29000                |
+| LLaMA 2 7B    | g5.12xl  | 4	                     | 198000               | 
+| LLaMA 2 13B   | g5.12xl  | 4                      | 127000               |  
+| Gemma 7B      | g5.12xl  | 4                      | 125000               |  
+| Gemma 7B      | g5.12xl  | 1                      | 1190                 |  
+| Falcon 7B     | g5.12xl  | 1                      | 36000                |  
+| Mistral 7B    | g5.12xl  | 1                      | 35000                |  
+| Mistral 7B    | g5.12xl  | 4                      | 198000               |  
+| LLaMA 2 13B   | g6.12xl  | 4                      | 116000               |
+| LLaMA 2 13B   | g5.48xl  | 8                      | 142000               |  
+| LLaMA 2 70B   | g5.48xl  | 8                      | 4100                 |  
+| LLaMA 3 70B   | g5.48xl  | 8                      | Out of Memory        |  
+| Mixtral 8x7B  | g5.48xl  | 8                      | 31000                |  
+| Falcon 40B    | g5.48xl  | 8                      | 32000                |  
+| CodeLLAMA 34B | g5.48xl  | 8                      | 36000                |
+| LLAMA 2 13B   | p4d.24xl | 4                      | 235000               | 
+| LLAMA 2 70B   | p4d.24xl | 8                      | 97000                | 
+| LLAMA 3 70B   | p4d.24xl | 8                      | 82000                | 
+| Mixtral 8x7B  | p4d.24xl | 4                      | 50000                | 
+| Mixtral 8x7B  | p4d.24xl | 8                      | 112000               | 
+| Falcon 40B    | p4d.24xl | 4                      | 71000                | 
+| Mistral 7B    | p4d.24xl | 2                      | 245000               | 
+| Mistral 7B    | p4d.24xl | 4                      | 498000               | 
+| CodeLLaMA 34B | p4d.24xl | 4                      | 115000               | 
+| CodeLLaMA 34B | p4d.24xl | 8                      | 191000               | 
+
+
+
+
+
+
+
 ### LMI 0.27.0
 
 The following number is tested on the machine with batch size up to 128 and input context up to 3700.

diff --git a/serving/docs/lmi/user_guides/trt_llm_user_guide.md b/serving/docs/lmi/user_guides/trt_llm_user_guide.md
@@ -20,11 +20,13 @@ The below model architectures are supported for JIT model compiltation and teste
 * Mistral (since LMI V8 0.26.0)
 * Mixtral (since LMI V8 0.26.0)
 * Qwen (since LMI V8 0.26.0)
-* GPT2/SantaCoder (since LMI V8 0.26.0)
+* GPT2/SantaCoder/StarCoder/GPTBigCode (since LMI V8 0.26.0)
 * Phi2 (since LMI V9 0.27.0)
+* OPT (since LMI V9 0.27.0)
+* Gemma (since LMI V10 0.28.0)
 
-TRT-LLM LMI v9 0.27.0 containers come with [TRT-LLM 0.8.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.8.0). 
-For models that are not listed here and supported by [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0?tab=readme-ov-file#models) with [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend), you can use this [tutorial](../tutorials/trtllm_manual_convert_tutorial.md) instead to prepare model manually.
+TRT-LLM LMI v10 0.28.0 containers come with [TRT-LLM 0.9.0](https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.9.0). 
+For models that are not listed here and supported by [TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.9.0?tab=readme-ov-file#models) with [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend), you can use this [tutorial](../tutorials/trtllm_manual_convert_tutorial.md) instead to prepare model manually.
 
 We will add more model support in the future versions in our CI. Please feel free to [file an issue](https://github.com/deepjavalibrary/djl-serving/issues/new/choose) if you are looking for a specific model support.
 
@@ -37,7 +39,6 @@ You can leverage `tensorrtllm` with LMI using the following starter configuratio
 ```
 engine=MPI
 option.tensor_parallel_degree=max
-option.rolling_batch=trtllm
 option.model_id=<your model id>
 # Adjust the following based on model size and instance type
 option.max_num_tokens=50000
@@ -50,7 +51,6 @@ You can follow [this example](../deployment_guide/deploying-your-endpoint.md#con
 ````
 HF_MODEL_ID=<your model id>
 TENSOR_PARALLEL_DEGREE=max
-OPTION_ROLLING_BATCH=trtllm
 # Adjust the following based on model size and instance type
 OPTION_MAX_NUM_TOKENS=50000
 ````