Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TensorRT-LLM Release 0.12.0
Key Features and Enhancements
ModelWeightsLoader
is enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md
.LLM
class.docs/source/speculative_decoding.md
.gelu_pytorch_tanh
activation function, thanks to the contribution from @ttim in Support gelu_pytorch_tanh activation function #1897.chunk_length
parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in addchunk_length
parameter to Whisper #1909.concurrency
argument forgptManagerBenchmark
.docs/source/executor.md#sending-requests-with-different-beam-widths
.--fast_build
totrtllm-build
command (experimental).API Changes
max_output_len
is removed fromtrtllm-build
command, if you want to limit sequence length on engine build stage, specifymax_seq_len
.use_custom_all_reduce
argument is removed fromtrtllm-build
.multi_block_mode
argument is moved from build stage (trtllm-build
and builder API) to the runtime.context_fmha_fp32_acc
is moved to runtime for decoder models.tp_size
,pp_size
andcp_size
is removed fromtrtllm-build
command.executor
API, and it will be removed in a future release of TensorRT-LLM.cpp/include/tensorrt_llm/executor/version.h
file is going to be generated.Model Updates
examples/exaone/README.md
.examples/chatglm/README.md
.examples/multimodal/README.md
.Fixed Issues
cluster_infos
defined intensorrt_llm/auto_parallel/cluster_info.py
, thanks to the contribution from @saeyoonoh in fix auto parallel cluster info typo #1987.docs/source/reference/troubleshooting.md
, thanks for the contribution from @hattizai in chore: remove duplicate flag #1937.exclude_modules
to weight-only quantization, thanks to the contribution from @fjosw in [Fix] Propagate QuantConfig.exclude_modules to weight only quantization #2056.max_seq_len
is not an integer. (llama 3.1 70B Instruct would not build engine "TypeError: set_shape(): incompatible function arguments." #2018)Infrastructure Changes
nvcr.io/nvidia/pytorch:24.07-py3
.nvcr.io/nvidia/tritonserver:24.07-py3
.Known Issues
OSError: exception: access violation reading 0x0000000000000000
when importing the library in Python. See Installing on Windows for workarounds.