YOLOv11 with CUDA and TensorRT

This project provides a high-performance implementation of YOLOv11 object detection using TensorRT for inference acceleration. The pipeline processes images and videos in batches, leveraging CUDA for preprocessing and inference. Non-Maximum Suppression (NMS) and postprocessing are performed on the CPU to optimize results.

Features

CUDA-accelerated preprocessing and inference for faster performance.
Batch processing for images and videos, supporting multiple inputs simultaneously.
Postprocessing with NMS to refine detections.
Threaded Execution to handle multiple input streams concurrently, with each stream processed on separate CUDA streams for maximum GPU utilization.
Scalable pipeline: Process multiple files (images or videos) in parallel.
Outputs detections with bounding boxes, class labels, and confidence scores.

Report

Configuration	Inference Time (ms)	Preprocessing Time (ms)	Total Latency (ms)	FPS
Baseline (CUDA Only)	80	-	80	12.5
TensorRT + CUDA Streams (CPU Preprocessing)	30	Depends on CPU (~0-10)	30	33.3
TensorRT + CUDA Streams (CUDA Preprocessing)	20	Optimal	20	50

Model Preparation

Export the YOLOv11 Model to TensorRT Engine

Install the necessary tools for exporting the model:
```
pip install ultralytics
```

Convert the PyTorch YOLO model to ONNX format:

yolo export model=yolo11s.pt format=onnx batch=8 half=True

Compile the ONNX model into a TensorRT engine:
```
trtexec --onnx=yolo11s.onnx \
        --saveEngine=yolo11s.engine \
        --memPoolSize=workspace:4G \
        --fp16
```
- --onnx: Specifies the ONNX model file.
- --saveEngine: Specifies the output TensorRT engine file.
- --memPoolSize: Allocates GPU memory for the engine.
- --fp16: Enables half-precision floating-point computation for faster inference.

Build Instructions

1. Build Base Docker Image

The base Docker image includes TensorRT, OpenCV, and Python 3.11:

docker build -t tensorrt-opencv5-python3.11-cuda -f Dockerfile.base .

2. Build the Inference Image

The inference Docker image includes the YOLOv11 pipeline:

docker build -t yolov11-cuda-trt -f Dockerfile .

Running the Project

1. Development Environment

To enter an interactive environment for development:

docker run --gpus all -it --rm \
-v $(pwd)/yolo-cuda:/workspace/yolo-cuda \
tensorrt-opencv5-python3.11-cuda bash

2. Inference

Usage

Run the inference executable with the following options:

Usage: ./build/main <input_path> [--engine_path=PATH] [--batch_size=N] [--confidence_threshold=FLOAT]
Example:
./build/main ./asset/walk1.mp4,./asset/walk2.mp4 --engine_path=weights/yolo11s.engine --batch_size=8 --confidence_threshold=0.7

<input_path>: Comma-separated list of input image or video paths.
--engine_path (optional): Path to the TensorRT engine file (default: ./weights/yolo11s.engine).
--batch_size (optional): Number of inputs to process per batch (default: 8).
--confidence_threshold (optional): Confidence threshold for filtering detections (default: 0.7).

Running Inference with Docker

To run inference using the Docker image:

docker run --gpus all -it --rm \
-v $(pwd)/weights:/workspace/weights \
-v $(pwd)/asset:/workspace/asset \
yolov11-cuda-trt ./asset/walk1.mp4,./asset/walk2.mp4 --engine_path=weights/yolo11s.engine --batch_size=8 --confidence_threshold=0.7

Pipeline Details

1. Preprocessing (CUDA)

Resizes and normalizes input images or video frames to 640x640 resolution.
Converts color space from BGR to RGB.
Batch processing: Combines multiple inputs for parallel GPU processing.
Format Conversion: Converts images to NCHW format.

2. Inference (TensorRT)

Executes the TensorRT engine on the GPU.
Processes inputs in batches for efficiency.
Leverages CUDA streams to overlap computation and data transfer.

3. Postprocessing (CPU)

Confidence Filtering: Removes low-confidence detections based on a threshold.
Non-Maximum Suppression (NMS): Removes overlapping bounding boxes for the same object.
Outputs detections with:
- Class IDs
- Confidence scores
- Bounding box coordinates

4. Multi-threaded Execution

Threaded Inference: Input files (images or videos) are processed concurrently using multiple threads.
CUDA Streams: Each thread operates on a separate CUDA stream to parallelize preprocessing, inference, and data transfer for multiple inputs.

5. Logging and Output

Logs inference times for each batch, frame, and individual input file.
Logs detections with class labels, confidence scores, and bounding box details.
Saves processed images and videos with bounding boxes drawn.

Example Outputs

Image Input:

Input: ./asset/bus.jpg

Command:

./build/main ./asset/bus.jpg --engine_path=weights/yolo11s.engine --confidence_threshold=0.8

Output: Annotated image saved as out_bus.jpg.

Video Input:

Input: ./asset/walk1.mp4

Command:

./build/main ./asset/walk1.mp4 --engine_path=weights/yolo11s.engine --batch_size=4 --confidence_threshold=0.7

Output: Annotated video saved as out_walk1.mp4.

Performance

The pipeline processes inputs efficiently by leveraging GPU acceleration. Below are approximate performance metrics:

Preprocessing and inference: Executed on the GPU for faster computation.
Postprocessing: Executed on the CPU for flexibility and precision.
Throughput: Supports batch sizes up to the GPU memory limit, providing high throughput for both images and videos.
Multi-threading: Achieves concurrent processing of multiple inputs, significantly improving throughput.

Limitations

Postprocessing is CPU-bound, which may bottleneck performance for large batch sizes.
Requires a TensorRT-compatible GPU.

Example log

Inference time for batch in ./asset/walk.mp4: 163.39 ms, 20.4238ms/frame
[Final Detection] Class ID: 0, Confidence: 0.729492, BBox: [270, 80, 63, 428]
[Final Detection] Class ID: 0, Confidence: 0.835938, BBox: [269, 75, 67, 437]
[Final Detection] Class ID: 0, Confidence: 0.708984, BBox: [611, 227, 16, 76]
[Final Detection] Class ID: 0, Confidence: 0.706543, BBox: [260, 73, 95, 438]
[Final Detection] Class ID: 0, Confidence: 0.733398, BBox: [252, 74, 92, 436]
[Final Detection] Class ID: 0, Confidence: 0.82959, BBox: [244, 77, 124, 433]
[Final Detection] Class ID: 0, Confidence: 0.730469, BBox: [606, 226, 26, 78]
Inference time for batch in ./asset/walk.mp4: 164.353 ms, 20.5441ms/frame
[Final Detection] Class ID: 0, Confidence: 0.811523, BBox: [239, 80, 117, 430]
[Final Detection] Class ID: 0, Confidence: 0.751953, BBox: [606, 226, 30, 78]
[Final Detection] Class ID: 0, Confidence: 0.867676, BBox: [232, 86, 144, 424]
[Final Detection] Class ID: 0, Confidence: 0.759766, BBox: [606, 229, 33, 75]
[Final Detection] Class ID: 0, Confidence: 0.820312, BBox: [227, 88, 142, 422]
[Final Detection] Class ID: 0, Confidence: 0.742676, BBox: [606, 227, 33, 77]
[Final Detection] Class ID: 0, Confidence: 0.828613, BBox: [223, 91, 146, 420]
[Final Detection] Class ID: 0, Confidence: 0.839844, BBox: [221, 91, 140, 419]
[Final Detection] Class ID: 0, Confidence: 0.862793, BBox: [225, 91, 132, 419]
[Final Detection] Class ID: 0, Confidence: 0.775391, BBox: [240, 91, 112, 398]
Inference time for batch in ./asset/walk.mp4: 165.66 ms, 20.7076ms/frame
[Final Detection] Class ID: 0, Confidence: 0.737305, BBox: [443, 213, 19, 100]
[Final Detection] Class ID: 0, Confidence: 0.730469, BBox: [270, 76, 55, 438]
[Final Detection] Class ID: 0, Confidence: 0.714355, BBox: [439, 217, 19, 96]
[Final Detection] Class ID: 0, Confidence: 0.800781, BBox: [260, 76, 66, 435]
[Final Detection] Class ID: 0, Confidence: 0.796875, BBox: [254, 82, 73, 432]
[Final Detection] Class ID: 0, Confidence: 0.815918, BBox: [248, 81, 105, 431]
[Final Detection] Class ID: 0, Confidence: 0.85498, BBox: [241, 88, 102, 423]
[Final Detection] Class ID: 0, Confidence: 0.70166, BBox: [611, 228, 29, 77]

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
asset		asset
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
Dockerfile.base		Dockerfile.base
README.md		README.md
main.cu		main.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLOv11 with CUDA and TensorRT

Features

Report

Model Preparation

Export the YOLOv11 Model to TensorRT Engine

Build Instructions

1. Build Base Docker Image

2. Build the Inference Image

Running the Project

1. Development Environment

2. Inference

Usage

Running Inference with Docker

Pipeline Details

1. Preprocessing (CUDA)

2. Inference (TensorRT)

3. Postprocessing (CPU)

4. Multi-threaded Execution

5. Logging and Output

Example Outputs

Performance

Limitations

Example log

About

Releases

Packages

Languages

RTae/yolov11-cuda

Folders and files

Latest commit

History

Repository files navigation

YOLOv11 with CUDA and TensorRT

Features

Report

Model Preparation

Export the YOLOv11 Model to TensorRT Engine

Build Instructions

1. Build Base Docker Image

2. Build the Inference Image

Running the Project

1. Development Environment

2. Inference

Usage

Running Inference with Docker

Pipeline Details

1. Preprocessing (CUDA)

2. Inference (TensorRT)

3. Postprocessing (CPU)

4. Multi-threaded Execution

5. Logging and Output

Example Outputs

Performance

Limitations

Example log

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages