-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Update trtllm docs for 0.28.0 #1990
[docs] Update trtllm docs for 0.28.0 #1990
Conversation
808bb1b
to
7de7d59
Compare
| LLaMA 2 13B | g6.12xl | 4 | 116000 | | ||
| LLaMA 2 13B | g5.48xl | 8 | 142000 | | ||
| LLaMA 2 70B | g5.48xl | 8 | 4100 | | ||
| LLaMA 3 70B | g5.48xl | 8 | Out of Memory | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to include this in the table? It's the only one with OOM listed and feels out of place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to list the one with OOM, to inform customer, this might not work
|---------------------------------------------------------------|-------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| option.max_input_len | >= 0.25.0 | LMI | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request. | Default: `1024` | | ||
| option.max_output_len | >= 0.25.0 | LMI | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default: `512` | | ||
| option.max_num_tokens | >= 0.27.0 | LMI | Max total tokens size the TRTLLM engine will use. Internally, if you set this value, we will extend the max_input, max_output and batch size to the model could actually support. This would allow the model to run under more arbitary traffic | Default: None | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are now giving defaults of 16384
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| option.smoothquant_per_token | >= 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` | | ||
| option.smoothquant_per_channel | >= 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` | | ||
| option.multi_query_mode | >= 0.26.0 | Pass Through | This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b | `true`, `false`. <br/> Default is `false` | | ||
| Advanced parameters: AWQ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add Advanced parameters: FP8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added quantize, use_fp8_context_fmha, calib_size, and calib_batch_size; let me know if there is anything else I should pay attention to 👍
56ca8f7
to
2237f77
Compare
| Advanced parameters: FP8 | | ||
| option.quantize | >= 0.26.0 | Pass Through | Currently only supports `fp8` for Llama, Mistral, Mixtral, Baichuan, Gemma, and GPT2 models with just in time compilation mode. | `fp8` | | ||
| option.use_fp8_context_fmha | >= 0.28.0 | Pass Through | Paged attention for fp8; should only be turned on for p5 instances | `true`, `false`. <br/> Default is `false` | | ||
| option.calib_size | >= 0.27.0 | Pass Through | This is applied when `option.quantize` is set to `fp8`. Number of samples for calibration. | Default is `32` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think detault is 512
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks
| option.awq_format | == 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq` | Default value is `int4_awq` | | ||
| option.awq_calib_size | == 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. Number of samples for calibration. | Default is `32` | | ||
| option.q_format | >= 0.27.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq` | Default value is `int4_awq` | | ||
| option.calib_size | >= 0.27.0 | Pass Through | This is applied when `option.quantize` is set to `awq`. Number of samples for calibration. | Default is `32` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default is 512?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
38eab36
to
c3a018d
Compare
Description