Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Updated cuda graphs doc #3357

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docsrc/tutorials/images/cuda_graphs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docsrc/tutorials/images/cuda_graphs_breaks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 18 additions & 6 deletions examples/dynamo/torch_export_cudagraphs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,13 @@
Torch Export with Cudagraphs
======================================================

This interactive script is intended as an overview of the process by which the Torch-TensorRT Cudagraphs integration can be used in the `ir="dynamo"` path. The functionality works similarly in the `torch.compile` path as well."""
CUDA Graphs allow multiple GPU operations to be launched through a single CPU operation, reducing launch overheads and improving GPU utilization. Torch-TensorRT provides a simple interface to enable CUDA graphs. This feature allows users to easily leverage the performance benefits of CUDA graphs without managing the complexities of capture and replay manually.

.. image:: /tutorials/images/cuda_graphs.png

This interactive script is intended as an overview of the process by which the Torch-TensorRT Cudagraphs integration can be used in the `ir="dynamo"` path. The functionality works similarly in the
`torch.compile` path as well.
"""

# %%
# Imports and Model Definition
Expand Down Expand Up @@ -69,19 +75,25 @@

# %%
# Cuda graphs with module that contains graph breaks
# ----------------------------------
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# When CUDA Graphs are applied to a TensorRT model that contains graph breaks, each break introduces additional
# overhead. This occurs because graph breaks prevent the entire model from being executed as a single, continuous
# optimized unit. As a result, some of the performance benefits typically provided by CUDA Graphs, such as reduced
# kernel launch overhead and improved execution efficiency, may be diminished.
#
# Using a wrapped runtime module with CUDA Graphs allows you to encapsulate sequences of operations into graphs
# that can be executed efficiently, even in the presence of graph breaks.
# If TensorRT module has graph breaks, CUDA Graph context manager returns a wrapped_module. This module captures entire
# execution graph, enabling efficient replay during subsequent inferences by reducing kernel launch overheads
# and improving performance. Note that initializing with the wrapper module involves a warm-up phase where the
# that can be executed efficiently, even in the presence of graph breaks. If TensorRT module has graph breaks, CUDA
# Graph context manager returns a wrapped_module. And this module captures entire execution graph, enabling efficient
# replay during subsequent inferences by reducing kernel launch overheads and improving performance.
#
# Note that initializing with the wrapper module involves a warm-up phase where the
# module is executed several times. This warm-up ensures that memory allocations and initializations are not
# recorded in CUDA Graphs, which helps maintain consistent execution paths and optimize performance.
#
# .. image:: /tutorials/images/cuda_graphs_breaks.png
# :scale: 60 %
# :align: left


class SampleModel(torch.nn.Module):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,12 @@ def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, .
contiguous_inputs[i].dtype == self.inputs[i].dtype
), f"Dtype mismatch for {i}th input. Expect {self.inputs[i].dtype}, got {contiguous_inputs[i].dtype}."

if need_cudagraphs_record:
# If cudagraphs is enabled, this memory is reserved for future cudagraph runs
# Clone is required to avoid re-using user-provided GPU memory
self._input_buffers[i] = contiguous_inputs[i].clone()
else:
self._input_buffers[i].copy_(contiguous_inputs[i])
if need_cudagraphs_record:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed wrong indent.

# If cudagraphs is enabled, this memory is reserved for future cudagraph runs
# Clone is required to avoid re-using user-provided GPU memory
self._input_buffers[i] = contiguous_inputs[i].clone()
else:
self._input_buffers[i].copy_(contiguous_inputs[i])

self._caller_stream = torch.cuda.current_stream()
if (
Expand Down
Loading