TensorRT-LLM Engine integration #3228

agunapal · 2024-07-03T23:41:29Z

Description

This PR shows how to integrate TensorRT-LLM Engine with TorchServe

The example is shown to be working with llama
The example also uses TorchServe's async backend workers

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

mreso

Solid overall, left some comments. Especially the postprocessing needs some rework I think as client side does not now how many beams to expect and so it will be hard to make sense of the returned streaming chunks.

examples/large_models/trt_llm/llama/README.md

mreso · 2024-07-05T15:09:12Z

examples/large_models/trt_llm/llama/model-config.yaml

+maxBatchDelay: 100
+responseTimeout: 1200
+deviceType: "gpu"
+asyncCommunication: true


Can TRT-LLM handle multi-gpu inference easily? Then we should demonstrate that we can easily integrate that with

parallelType: "custom" parallelLevel: 4

I tried standalone multi gpu inference, it didn't work for me. Although the model loaded on the 4 GPUs, the inference was hanging

mreso · 2024-07-05T15:20:12Z

examples/large_models/trt_llm/llama/trt_llm_handler.py

+                streaming=True,
+                return_dict=True,
+            )
+        torch.cuda.synchronize()


What is the synchronization for?

Copy paste from the example code , but seems like its not needed. Works fine without it

mreso · 2024-07-05T15:24:15Z

examples/large_models/trt_llm/llama/trt_llm_handler.py

+                for beam in range(num_beams):
+                    output_begin = input_lengths[batch_idx]
+                    output_end = sequence_lengths[batch_idx][beam]
+                    outputs = output_ids[batch_idx][beam][


Are we sure we're doing the right thing here? Because output_begin is never used but instead output_end-1.

good catch, we don't need output_begin. The example code uses this

mreso · 2024-07-05T15:28:44Z

examples/large_models/trt_llm/llama/trt_llm_handler.py

+                        output_end - 1 : output_end
+                    ].tolist()
+                    output_text = self.tokenizer.decode(outputs)
+                    send_intermediate_predict_response(


I we send N=num_beams intermediate results back without any order information, can we assign each partial response to its beam sequence? Better to send one response per batch entry (will will be =1) with updates for all beams included as a list?

it seems like this is not needed for llama as num_beams > 1 is not working for llama Removed the inner for loop

…serve into examples/trt_llm_engine

mreso

Lets figure out the CUDA 12 situation, then this LGTM

mreso · 2024-07-09T09:12:37Z

examples/large_models/trt_llm/llama/README.md

@@ -10,6 +10,7 @@ This will downgrade the versions of PyTorch & Triton but this doesn't cause any

 ```
 pip install tensorrt_llm==0.10.0 --extra-index-url https://pypi.nvidia.com
+pip install tensorrt-cu12==10.1.0


Is this CUDA 12 exclusive? In that case we should inform people to install torch with CUDA 12 as well.

It doesn't mention it but all their docs are pointing to CUDA 12.x . Let me mention that its tested with CUDA 12.1

agunapal and others added 2 commits July 3, 2024 23:40

TensorRT-LLM Engine integration

a214c25

Merge branch 'master' into examples/trt_llm_engine

954c382

agunapal added this to the v0.12.0 milestone Jul 3, 2024

agunapal added 2 commits July 3, 2024 23:44

TensorRT-LLM Engine integration

c6104eb

TensorRT-LLM Engine integration

be48f57

agunapal marked this pull request as ready for review July 3, 2024 23:48

agunapal requested review from mreso and lxning July 3, 2024 23:48

mreso requested changes Jul 5, 2024

View reviewed changes

agunapal and others added 5 commits July 8, 2024 23:21

review comments

c97028a

review comments

8f47043

Merge branch 'master' into examples/trt_llm_engine

26d5dc6

review comments

b3982e7

Merge branch 'examples/trt_llm_engine' of https://github.com/pytorch/…

ccdd0a8

…serve into examples/trt_llm_engine

agunapal requested a review from mreso July 8, 2024 23:29

Merge branch 'master' into examples/trt_llm_engine

a025aa7

mreso approved these changes Jul 9, 2024

View reviewed changes

Update README.md

b63aff2

agunapal enabled auto-merge July 9, 2024 17:24

agunapal added this pull request to the merge queue Jul 9, 2024

Merged via the queue into master with commit a1c8eb2 Jul 9, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT-LLM Engine integration #3228

TensorRT-LLM Engine integration #3228

agunapal commented Jul 3, 2024 •

edited

Loading

mreso left a comment

mreso Jul 5, 2024

agunapal Jul 8, 2024

mreso Jul 5, 2024

agunapal Jul 8, 2024

mreso Jul 5, 2024

agunapal Jul 8, 2024

mreso Jul 5, 2024

agunapal Jul 8, 2024

mreso left a comment

mreso Jul 9, 2024

agunapal Jul 9, 2024

TensorRT-LLM Engine integration #3228

TensorRT-LLM Engine integration #3228

Conversation

agunapal commented Jul 3, 2024 • edited Loading

Description

Type of change

Feature/Issue validation/testing

Checklist:

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agunapal commented Jul 3, 2024 •

edited

Loading