[docs] Update trtllm docs for 0.28.0 #1990

ydm-amazon · 2024-05-29T23:21:07Z

Description

LMI 0.28.0 max num tokens table
Updating list of supported models
Removing option.rolling_batch=trtllm line in suggested serving.properties
add relevant new properties

siddvenk · 2024-05-31T23:52:38Z

serving/docs/lmi/tutorials/trtllm_finding_max_num_tokens_tutorial.md

+| LLaMA 2 13B   | g6.12xl  | 4                      | 116000               |
+| LLaMA 2 13B   | g5.48xl  | 8                      | 142000               |  
+| LLaMA 2 70B   | g5.48xl  | 8                      | 4100                 |  
+| LLaMA 3 70B   | g5.48xl  | 8                      | Out of Memory        |  


do we want to include this in the table? It's the only one with OOM listed and feels out of place

I think it is ok to list the one with OOM, to inform customer, this might not work

lanking520 · 2024-06-02T21:24:46Z

serving/docs/lmi/user_guides/trt_llm_user_guide.md

+|---------------------------------------------------------------|-------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| option.max_input_len	                                         | >= 0.25.0   | LMI	               | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request.                                                                                                                                                                                                                                                                                                          | Default: `1024`                                                                                                                                                             |
+| option.max_output_len	                                        | >= 0.25.0   | LMI	               | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set.	                                                                                                                                                                                                                                                                                                                                            | Default: `512`	                                                                                                                                                             |
+| option.max_num_tokens	                                        | >= 0.27.0   | LMI	               | Max total tokens size the TRTLLM engine will use. Internally, if you set this value, we will extend the max_input, max_output and batch size to the model could actually support. This would allow the model to run under more arbitary traffic                                                                                                                                                                                                                                                                                                                                                      | Default: None                                                                                                                                                               |


we are now giving defaults of 16384

lanking520 · 2024-06-02T21:25:21Z

serving/docs/lmi/user_guides/trt_llm_user_guide.md

+| option.smoothquant_per_token	                                 | >= 0.26.0   | Pass Through	      | This is only applied when `option.quantize` is set to `smoothquant`.  This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate	                                                                                                                                                                                                                                                                                                                                                                                          | `true`, `false`. <br/> Default is `false`	                                                                                                                                  |
+| option.smoothquant_per_channel	                               | >= 0.26.0   | Pass Through	      | This is only applied when `option.quantize` is set to `smoothquant`.  This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate	                                                                                                                                                                                                                                                                                                                                                                                        | `true`, `false`. <br/> Default is `false`	                                                                                                                                  |
+| option.multi_query_mode	                                      | >= 0.26.0   | Pass Through	      | This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b	                                                                                                                                                                                                                                                                                                                                                                                                                                         | `true`, `false`. <br/> Default is `false`	                                                                                                                                  |
+| Advanced parameters: AWQ	                                     |


We also need to add Advanced parameters: FP8

Added quantize, use_fp8_context_fmha, calib_size, and calib_batch_size; let me know if there is anything else I should pay attention to 👍

lanking520 · 2024-06-04T00:25:34Z

serving/docs/lmi/user_guides/trt_llm_user_guide.md

+| Advanced parameters: FP8                                       |
+| option.quantize	                                              | >= 0.26.0   | Pass Through	      | Currently only supports `fp8` for Llama, Mistral, Mixtral, Baichuan, Gemma, and GPT2 models with just in time compilation mode.	                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | `fp8`	                                                                                                                                                                      |
+| option.use_fp8_context_fmha	                                  | >= 0.28.0   | Pass Through	      | Paged attention for fp8; should only be turned on for p5 instances 	                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | `true`, `false`. <br/> Default is `false`	                                                                                                                                  |
+| option.calib_size	                                            | >= 0.27.0   | Pass Through	      | This is applied when `option.quantize` is set to `fp8`. Number of samples for calibration. 	                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Default is `32`	                                                                                                                                                            |


I think detault is 512

Fixed, thanks

lanking520 · 2024-06-04T00:25:43Z

serving/docs/lmi/user_guides/trt_llm_user_guide.md

+| option.awq_format	                                            | == 0.26.0   | Pass Through	      | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq`                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Default value is `int4_awq`	                                                                                                                                                |
+| option.awq_calib_size	                                        | == 0.26.0   | Pass Through	      | This is only applied when `option.quantize` is set to `awq`. Number of samples for calibration. 	                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Default is `32`	                                                                                                                                                            |
+| option.q_format	                                              | >= 0.27.0   | Pass Through	      | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq`                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Default value is `int4_awq`	                                                                                                                                                |
+| option.calib_size	                                            | >= 0.27.0   | Pass Through	      | This is applied when `option.quantize` is set to `awq`. Number of samples for calibration. 	                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Default is `32`	                                                                                                                                                            |


default is 512?

remove empty spaces

7de7d59

ydm-amazon force-pushed the trtllm_doc_update branch from 808bb1b to 7de7d59 Compare May 29, 2024 23:22

ydm-amazon marked this pull request as ready for review May 31, 2024 21:39

ydm-amazon requested review from zachgk, frankfliu and a team as code owners May 31, 2024 21:39

ydm-amazon changed the title ~~[wip] Update trtllm docs for 0.28.0~~ Update trtllm docs for 0.28.0 May 31, 2024

ydm-amazon changed the title ~~Update trtllm docs for 0.28.0~~ [docs] Update trtllm docs for 0.28.0 May 31, 2024

siddvenk reviewed May 31, 2024

View reviewed changes

lanking520 reviewed Jun 2, 2024

View reviewed changes

ydm-amazon force-pushed the trtllm_doc_update branch from 56ca8f7 to 2237f77 Compare June 3, 2024 16:46

lanking520 reviewed Jun 4, 2024

View reviewed changes

edit incorrect value: calib_size is 512

c3a018d

ydm-amazon force-pushed the trtllm_doc_update branch from 38eab36 to c3a018d Compare June 4, 2024 16:02

lanking520 approved these changes Jun 4, 2024

View reviewed changes

ydm-amazon merged commit 93ed30d into deepjavalibrary:master Jun 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Update trtllm docs for 0.28.0 #1990

[docs] Update trtllm docs for 0.28.0 #1990

ydm-amazon commented May 29, 2024 •

edited

Loading

siddvenk May 31, 2024

lanking520 Jun 2, 2024

lanking520 Jun 2, 2024

ydm-amazon Jun 3, 2024

lanking520 Jun 2, 2024

ydm-amazon Jun 3, 2024

lanking520 Jun 4, 2024

ydm-amazon Jun 4, 2024

lanking520 Jun 4, 2024

ydm-amazon Jun 4, 2024

[docs] Update trtllm docs for 0.28.0 #1990

[docs] Update trtllm docs for 0.28.0 #1990

Conversation

ydm-amazon commented May 29, 2024 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ydm-amazon commented May 29, 2024 •

edited

Loading