Replies: 1 comment 1 reply
-
@gs-olive Wondering whether cudagraphs control in C++ land (something like .enable_cudagraphs) was able to land ? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CUDAGraphs in Torch-TRT
TL;DR
Enable CUDAGraphs model acceleration in Torch-TRT, to enhance performance by hiding kernel launch time bottlenecks. CUDAGraphs is enabled via a
compile
boolean argument, which enables the functionality. Some considerations remain regarding dynamic shapes.Goal(s)
Boost performance via the use of CUDAGraphs to group kernels into units, thereby improving kernel launch times and decreasing overhead.
Usecases
Proposed APIs / UX
The API would be invoked via argument to
torch_tensorrt.compile
, as so:Then, the backend would attempt to compile the graph with the specified input shapes, using CUDAGraphs. This argument provides a layer of abstraction to the user, where the background work of CUDAGraphs is handled by Torch-TensorRT and the acceleration benefits are immediate.
Limitations
This feature will not work with arbitrary dynamic-shape inputs without user-provided padding. This is on account of memory and recompilation limitations. Some consideration is needed to define a mechanism for handling dynamic shape models, including when to dispatch to the CUDAGraphs implementation vs using eager execution as a fallback.
Internal Implementation
Design
The key requirements of this feature are an instantiation, building, and storage mechanism for the CUDAGraphs objects.
Extensions Required to Core API implementations
The main APIs needing to change are the TRT runtime modules. There are a few candidate options for modifications:
1. All Recording in Python
Recording graph operations can be done post TRT engine building as a wrapper over the inference execution. Some options for this include:
One detail which is yet-unclear about this component is whether recording the completed TRT Engine will include kernel artifacts which are undesired in the end-result.
2. Recording at Engine Build Time
As recommended by TRT here, the
execute_async
call can be recorded to capture the sequence of kernels run in a narrow and precise scope. This is likely the better approach, though it would require more effort since it needs C++ modifications for theTorchTensorRTModule
and Python modifications for thePythonTorchTensorRTModule
.CUDAGraphs also requires a degree of control over the CUDA stream actively in use within the user context. This includes copying input tensors to precise memory locations and other considerations for managed memory. Approach 2 above alleviates this a bit by relying on the TRT Module to do at least a portion of stream-management. Some of the methods in Approach 1 handle stream logic behind the scenes as well.
Data Structures
The CUDAGraphs will require a storage mechanism. This will likely depend on the selected
ir
to an extent. The graph object itself will be stored as a field of the selected runtime module. At first, there can be a single CUDAGraph, compiled for the first shape which the model encounters. Later, extensions can be made to construct a dictionary, keyed on input shapes or dynamic dimensions, to support saving multiple CUDAGraphs.ir="dynamo"
When using
ir="dynamo"
, we do not have the capability to recompile the model or re-record the CUDAGraph at runtime with a different dynamic shape. Therefore, we either need to store multiple CUDAGraphs corresponding to common shapes, or a single CUDAGraph to which other shapes are "paddable".To start, we can have a CUDAGraph be generated for the
opt
shape as a default case, and then expand to other more advanced cases later.ir="torch_compile"
In the
torch.compile
case, the capability for recompilation unlocks additional applications for CUDAGraphs. In much the same way thatmode="reduce-overhead"
provides CUDAGraphs capabilities for Inductor, along with dynamic recompilation on shape changes, we can use this model for supporting CUDAGraphs with dynamic shapesStill, to start it would be easiest to generate the CUDAGraph as part of compilation and store only one graph per engine.
Details specific for TorchScript Support
TorchScript support should follow if Approach 2 is taken, since the runtime modifications will be portable to TorchScript if they are functional for
ir="dynamo"
.Implementation Phases
Prototype - Small/Medium
MVP (
2.4.0
) - Medium/LargeExtension Phase 1 - Medium
mode="reduce-overhead"
, with dynamic shape support fortorch.compile
and smart recompilation leveraging PyTorch guards. Consider integrating with or using the Inductor functionalities directlyBeta Was this translation helpful? Give feedback.
All reactions