Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Update trtllm docs for 0.28.0 #1990

Merged
merged 2 commits into from
Jun 4, 2024

Conversation

ydm-amazon
Copy link
Contributor

@ydm-amazon ydm-amazon commented May 29, 2024

Description

  • LMI 0.28.0 max num tokens table
  • Updating list of supported models
  • Removing option.rolling_batch=trtllm line in suggested serving.properties
  • add relevant new properties

@ydm-amazon ydm-amazon force-pushed the trtllm_doc_update branch from 808bb1b to 7de7d59 Compare May 29, 2024 23:22
@ydm-amazon ydm-amazon marked this pull request as ready for review May 31, 2024 21:39
@ydm-amazon ydm-amazon requested review from zachgk, frankfliu and a team as code owners May 31, 2024 21:39
@ydm-amazon ydm-amazon changed the title [wip] Update trtllm docs for 0.28.0 Update trtllm docs for 0.28.0 May 31, 2024
@ydm-amazon ydm-amazon changed the title Update trtllm docs for 0.28.0 [docs] Update trtllm docs for 0.28.0 May 31, 2024
| LLaMA 2 13B | g6.12xl | 4 | 116000 |
| LLaMA 2 13B | g5.48xl | 8 | 142000 |
| LLaMA 2 70B | g5.48xl | 8 | 4100 |
| LLaMA 3 70B | g5.48xl | 8 | Out of Memory |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to include this in the table? It's the only one with OOM listed and feels out of place

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to list the one with OOM, to inform customer, this might not work

|---------------------------------------------------------------|-------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| option.max_input_len | >= 0.25.0 | LMI | Maximum input token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to consume the long input. LMI also validates this at runtime for each request. | Default: `1024` |
| option.max_output_len | >= 0.25.0 | LMI | Maximum output token size you expect the model to have per request. This is a compilation parameter that set to the model for Just-in-Time compilation. If you set this value too low, the model will unable to produce tokens beyond the value you set. | Default: `512` |
| option.max_num_tokens | >= 0.27.0 | LMI | Max total tokens size the TRTLLM engine will use. Internally, if you set this value, we will extend the max_input, max_output and batch size to the model could actually support. This would allow the model to run under more arbitary traffic | Default: None |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are now giving defaults of 16384

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

| option.smoothquant_per_token | >= 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each token. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` |
| option.smoothquant_per_channel | >= 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `smoothquant`. This enables choosing at run time a custom smoothquant scaling factor for each channel. This is usally little slower and more accurate | `true`, `false`. <br/> Default is `false` |
| option.multi_query_mode | >= 0.26.0 | Pass Through | This is only needed when `option.quantize` is set to `smoothquant` . This is should be set for models that support multi-query-attention, for e.g llama-70b | `true`, `false`. <br/> Default is `false` |
| Advanced parameters: AWQ |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to add Advanced parameters: FP8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added quantize, use_fp8_context_fmha, calib_size, and calib_batch_size; let me know if there is anything else I should pay attention to 👍

@ydm-amazon ydm-amazon force-pushed the trtllm_doc_update branch from 56ca8f7 to 2237f77 Compare June 3, 2024 16:46
| Advanced parameters: FP8 |
| option.quantize | >= 0.26.0 | Pass Through | Currently only supports `fp8` for Llama, Mistral, Mixtral, Baichuan, Gemma, and GPT2 models with just in time compilation mode. | `fp8` |
| option.use_fp8_context_fmha | >= 0.28.0 | Pass Through | Paged attention for fp8; should only be turned on for p5 instances | `true`, `false`. <br/> Default is `false` |
| option.calib_size | >= 0.27.0 | Pass Through | This is applied when `option.quantize` is set to `fp8`. Number of samples for calibration. | Default is `32` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think detault is 512

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks

| option.awq_format | == 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq` | Default value is `int4_awq` |
| option.awq_calib_size | == 0.26.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. Number of samples for calibration. | Default is `32` |
| option.q_format | >= 0.27.0 | Pass Through | This is only applied when `option.quantize` is set to `awq`. awq format you want to set. Currently only support `int4_awq` | Default value is `int4_awq` |
| option.calib_size | >= 0.27.0 | Pass Through | This is applied when `option.quantize` is set to `awq`. Number of samples for calibration. | Default is `32` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default is 512?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ydm-amazon ydm-amazon force-pushed the trtllm_doc_update branch from 38eab36 to c3a018d Compare June 4, 2024 16:02
@ydm-amazon ydm-amazon merged commit 93ed30d into deepjavalibrary:master Jun 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants