-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Torch-TensorRT Nightly Install in Docker #1909
Conversation
gs-olive
commented
Sep 19, 2023
•
edited
Loading
edited
- Add install for Torch-TRT nightly
- Add validation to ensure Torch-TRT nightly versions are installed correctly and are functional
Using TensorRT in PyTorch requires both torch_tensorrt and the NVidia's TensorRT SDK: https://developer.nvidia.com/tensorrt. Currently, the Docker does not have the TensorRT SDK installed. |
@xuzhao9 - the |
@gs-olive You can get the latest built docker image using the following command:
|
674bd84
to
5a1f7ec
Compare
Thanks for the resource! I just tested out: pip install --pre --no-cache-dir torch torchvision torchaudio torch_tensorrt -i https://download.pytorch.org/whl/nightly/cu118 I believe the above should be the same command which would run in pip install --pre --no-cache-dir torch torchvision torchaudio torch_tensorrt --extra-index-url https://download.pytorch.org/whl/nightly/cu118 The above appears to have the same install behavior as the previous command for |
Hi @gs-olive thanks for your effort. I am wondering if you could test
|
Here is the output of the command you referenced: Compiling resnet50 with batch size 32, precision fp16, and default IR
INFO:torch_tensorrt._compile:ir was set to default, using dynamo as ir
WARNING:torch_tensorrt.dynamo.compile:The Dynamo backend is an experimental feature, for which only the following arguments are supported: {enabled_precisions, debug, workspace_size, min_block_size, max_aux_streams, version_compatible, optimization_level, torch_executed_ops, pass_through_build_failures, use_fast_partitioner, enable_experimental_decompositions, require_full_compilation}
INFO:torch_tensorrt.dynamo.compile:Compilation Settings: CompilationSettings(precision=torch.float16, debug=False, workspace_size=0, min_block_size=5, torch_executed_ops=[], pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_long_and_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.005506
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:36.552237
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 102760960 bytes of Memory
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:03.441001
[09/21/2023-17:55:44] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[09/21/2023-17:55:44] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[09/21/2023-17:55:44] [TRT] [W] Check verbose logs for the list of affected weights.
[09/21/2023-17:55:44] [TRT] [W] - 1 weights are affected by this issue: Detected subnormal FP16 values.
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:03:20.398813
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 256901120 bytes of Memory
Running eval method from resnet50 on cuda in torch_trt mode with input batch size 32 and precision fp16.
INFO:numba.cuda.cudadrv.driver:init
GPU Time: 10.823 milliseconds
CPU Total Wall Time: 10.841 milliseconds
GPU 0 Peak Memory: 1.4231 GB
CPU Peak Memory: 4.0303 GB
PT2 Compilation time: 1.405 seconds Below is the output of a similar command: Compiling resnet50 with batch size 32, precision fp16, and torch_compile IR
Running eval method from resnet50 on cuda in torch_trt mode with input batch size 32 and precision fp16.
INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Device not specified, using Torch default current device - cuda:0. If this is incorrect, please specify an input device, via the device keyword.
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(precision=torch.float16, debug=False, workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_long_and_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False)
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.005175
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:35.786096
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 102760960 bytes of Memory
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:03.359519
[09/21/2023-18:14:10] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[09/21/2023-18:14:10] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[09/21/2023-18:14:10] [TRT] [W] Check verbose logs for the list of affected weights.
[09/21/2023-18:14:10] [TRT] [W] - 1 weights are affected by this issue: Detected subnormal FP16 values.
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:03:15.177063
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 269746176 bytes of Memory
INFO:numba.cuda.cudadrv.driver:init
GPU Time: 11.090 milliseconds
CPU Total Wall Time: 11.111 milliseconds
GPU 0 Peak Memory: 1.3997 GB
CPU Peak Memory: 3.7539 GB
PT2 Compilation time: 249.850 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The result is looks great! Do we still need to make sure that the version of torch_trt nightly matches torch nightly? I am thinking if the package sets the dependency correctly and publishes at https://download.pytorch.org/, we should be able to install the compatible versions. The downside is, if on some day the |
@xuzhao9 - I see - this is a good point. I will look into adding a check similar to |
5a1f7ec
to
3f84f62
Compare
3f84f62
to
ba8722d
Compare
ba8722d
to
aa27dc0
Compare
@xuzhao9 - I've added install validation for the Torch-TensorRT package and separated out the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me, we can accept this after the inline comment is addressed.
aa27dc0
to
7694449
Compare
- Add install for Torch-TRT nightly - Add install validation
7694449
to
e527a45
Compare
@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Thanks @xuzhao9! I am wondering where I would find the results for future runs, to see if the installation and testing is working properly? |
@gs-olive Second, after it's been built in the nightly docker, ping me at #1823, and I can start another CI workflow run of the torch_trt userbenchmark, and if it succeeds, the benchmark metrics can be downloaded as GitHub artifacts. |
@gs-olive I tried running the userbenchmark on the latest nightly docker container with torch_trt nightly installed: https://github.com/pytorch/benchmark/actions/runs/6341149525/job/17224134160 It seems to fail on compiling the BERT_pytorch model. The convention in userbenchmark is that we have a parent process that runs a single model in a child process, and if the child process crashes or throws exception, the parent process can still run the next model. |
Thanks for this! It seems this is testing against a different IR than I had intended (it's using the default IR, but it should be |
Added #1946 to fix the selected IR - after this I expect the BERT pytorch benchmark to complete, as I had verified that one locally. |