TensorRT-LLM v0.12 Update #2164

Shixiaowei02 · 2024-08-29T09:10:36Z

TensorRT-LLM Release 0.12.0

Supported LoRA for MoE models.
The ModelWeightsLoader is enabled for LLaMA family models (experimental), see docs/source/architecture/model-weights-loader.md.
Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the LLM class.
Supported FP8 OOTB MoE.
Supported Starcoder2 SmoothQuant. (smoothquant on starcoder2 #1886)
Supported ReDrafter Speculative Decoding, see “ReDrafter” section in docs/source/speculative_decoding.md.
Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in support remove_input_padding for BertForSequenceClassification models #1834.
Added in-flight batching support for GLM 10B model.
Supported gelu_pytorch_tanh activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.
Added chunk_length parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in add chunk_length parameter to Whisper #1909.
Added concurrency argument for gptManagerBenchmark.
Executor API supports requests with different beam widths, see docs/source/executor.md#sending-requests-with-different-beam-widths.
Added the flag --fast_build to trtllm-build command (experimental).

[BREAKING CHANGE] max_output_len is removed from trtllm-build command, if you want to limit sequence length on engine build stage, specify max_seq_len.
[BREAKING CHANGE] The use_custom_all_reduce argument is removed from trtllm-build.
[BREAKING CHANGE] The multi_block_mode argument is moved from build stage (trtllm-build and builder API) to the runtime.
[BREAKING CHANGE] The build time argument context_fmha_fp32_acc is moved to runtime for decoder models.
[BREAKING CHANGE] The arguments tp_size, pp_size and cp_size is removed from trtllm-build command.
The C++ batch manager API is deprecated in favor of the C++ executor API, and it will be removed in a future release of TensorRT-LLM.
Added a version API to the C++ library, a cpp/include/tensorrt_llm/executor/version.h file is going to be generated.

Supported LLaMA 3.1 model.
Supported Mamba-2 model.
Supported EXAONE model, see examples/exaone/README.md.
Supported Qwen 2 model.
Supported GLM4 models, see examples/chatglm/README.md.
Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in examples/multimodal/README.md.

Fixed wrong pad token for the CodeQwen models. ([Feature] quantize_by_modelopt.py get_tokenizer is not suitable for CodeQwen1.5 7B Chat #1953)
Fixed typo in cluster_infos defined in tensorrt_llm/auto_parallel/cluster_info.py, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.
Removed duplicated flags in the command at docs/source/reference/troubleshooting.md, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.
Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in Fix segfault in TopP sampling layer #2039. (Segfault on main branch (problem in TopP layer) #2040)
Fixed the failure when converting the checkpoint for Mistral Nemo model. (Support for Mistral Nemo #1985)
Propagated exclude_modules to weight-only quantization, thanks to the contribution from @fjosw in [Fix] Propagate QuantConfig.exclude_modules to weight only quantization #2056.
Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in update links in overview section of README #2028.
Fixed some typos in the documentation, thanks to the contribution from @lfz941 in chore(docs): fix typos #1939.
Fixed the engine build failure when deduced max_seq_len is not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)

Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.07-py3.
Base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.07-py3.
The dependent TensorRT version is updated to 10.3.0.
The dependent CUDA version is updated to 12.5.1.
The dependent PyTorch version is updated to 2.4.0.
The dependent ModelOpt version is updated to v0.15.0.

On Windows, installation of TensorRT-LLM may succeed, but you might hit OSError: exception: access violation reading 0x0000000000000000 when importing the library in Python. See Installing on Windows for workarounds.

byshiue

LGTM

kaiyux and others added 2 commits August 29, 2024 08:39

open source f57bbfacfa745020b76e677c14aad215f4604228

20709ce

remove the windows libraries

06569a5

Shixiaowei02 requested a review from byshiue August 29, 2024 09:13

byshiue approved these changes Aug 29, 2024

View reviewed changes

Shixiaowei02 merged commit bfc50a7 into rel Aug 29, 2024

Shixiaowei02 deleted the preview/rel branch August 29, 2024 09:25