This project provides a high-performance implementation of YOLOv11 object detection using TensorRT for inference acceleration. The pipeline processes images and videos in batches, leveraging CUDA for preprocessing and inference. Non-Maximum Suppression (NMS) and postprocessing are performed on the CPU to optimize results.
- CUDA-accelerated preprocessing and inference for faster performance.
- Batch processing for images and videos, supporting multiple inputs simultaneously.
- Postprocessing with NMS to refine detections.
- Threaded Execution to handle multiple input streams concurrently, with each stream processed on separate CUDA streams for maximum GPU utilization.
- Scalable pipeline: Process multiple files (images or videos) in parallel.
- Outputs detections with bounding boxes, class labels, and confidence scores.
Configuration | Inference Time (ms) | Preprocessing Time (ms) | Total Latency (ms) | FPS |
---|---|---|---|---|
Baseline (CUDA Only) | 80 | - | 80 | 12.5 |
TensorRT + CUDA Streams (CPU Preprocessing) | 30 | Depends on CPU (~0-10) | 30 | 33.3 |
TensorRT + CUDA Streams (CUDA Preprocessing) | 20 | Optimal | 20 | 50 |
-
Install the necessary tools for exporting the model:
pip install ultralytics
-
Convert the PyTorch YOLO model to ONNX format:
yolo export model=yolo11s.pt format=onnx batch=8 half=True
-
Compile the ONNX model into a TensorRT engine:
trtexec --onnx=yolo11s.onnx \ --saveEngine=yolo11s.engine \ --memPoolSize=workspace:4G \ --fp16
--onnx
: Specifies the ONNX model file.--saveEngine
: Specifies the output TensorRT engine file.--memPoolSize
: Allocates GPU memory for the engine.--fp16
: Enables half-precision floating-point computation for faster inference.
The base Docker image includes TensorRT, OpenCV, and Python 3.11:
docker build -t tensorrt-opencv5-python3.11-cuda -f Dockerfile.base .
The inference Docker image includes the YOLOv11 pipeline:
docker build -t yolov11-cuda-trt -f Dockerfile .
To enter an interactive environment for development:
docker run --gpus all -it --rm \
-v $(pwd)/yolo-cuda:/workspace/yolo-cuda \
tensorrt-opencv5-python3.11-cuda bash
Run the inference executable with the following options:
Usage: ./build/main <input_path> [--engine_path=PATH] [--batch_size=N] [--confidence_threshold=FLOAT]
Example:
./build/main ./asset/walk1.mp4,./asset/walk2.mp4 --engine_path=weights/yolo11s.engine --batch_size=8 --confidence_threshold=0.7
<input_path>
: Comma-separated list of input image or video paths.--engine_path
(optional): Path to the TensorRT engine file (default:./weights/yolo11s.engine
).--batch_size
(optional): Number of inputs to process per batch (default:8
).--confidence_threshold
(optional): Confidence threshold for filtering detections (default:0.7
).
To run inference using the Docker image:
docker run --gpus all -it --rm \
-v $(pwd)/weights:/workspace/weights \
-v $(pwd)/asset:/workspace/asset \
yolov11-cuda-trt ./asset/walk1.mp4,./asset/walk2.mp4 --engine_path=weights/yolo11s.engine --batch_size=8 --confidence_threshold=0.7
- Resizes and normalizes input images or video frames to 640x640 resolution.
- Converts color space from BGR to RGB.
- Batch processing: Combines multiple inputs for parallel GPU processing.
- Format Conversion: Converts images to NCHW format.
- Executes the TensorRT engine on the GPU.
- Processes inputs in batches for efficiency.
- Leverages CUDA streams to overlap computation and data transfer.
- Confidence Filtering: Removes low-confidence detections based on a threshold.
- Non-Maximum Suppression (NMS): Removes overlapping bounding boxes for the same object.
- Outputs detections with:
- Class IDs
- Confidence scores
- Bounding box coordinates
- Threaded Inference: Input files (images or videos) are processed concurrently using multiple threads.
- CUDA Streams: Each thread operates on a separate CUDA stream to parallelize preprocessing, inference, and data transfer for multiple inputs.
- Logs inference times for each batch, frame, and individual input file.
- Logs detections with class labels, confidence scores, and bounding box details.
- Saves processed images and videos with bounding boxes drawn.
-
Image Input:
- Input:
./asset/bus.jpg
- Command:
./build/main ./asset/bus.jpg --engine_path=weights/yolo11s.engine --confidence_threshold=0.8
- Output: Annotated image saved as
out_bus.jpg
.
- Input:
-
Video Input:
- Input:
./asset/walk1.mp4
- Command:
./build/main ./asset/walk1.mp4 --engine_path=weights/yolo11s.engine --batch_size=4 --confidence_threshold=0.7
- Output: Annotated video saved as
out_walk1.mp4
.
- Input:
The pipeline processes inputs efficiently by leveraging GPU acceleration. Below are approximate performance metrics:
- Preprocessing and inference: Executed on the GPU for faster computation.
- Postprocessing: Executed on the CPU for flexibility and precision.
- Throughput: Supports batch sizes up to the GPU memory limit, providing high throughput for both images and videos.
- Multi-threading: Achieves concurrent processing of multiple inputs, significantly improving throughput.
- Postprocessing is CPU-bound, which may bottleneck performance for large batch sizes.
- Requires a TensorRT-compatible GPU.
Inference time for batch in ./asset/walk.mp4: 163.39 ms, 20.4238ms/frame
[Final Detection] Class ID: 0, Confidence: 0.729492, BBox: [270, 80, 63, 428]
[Final Detection] Class ID: 0, Confidence: 0.835938, BBox: [269, 75, 67, 437]
[Final Detection] Class ID: 0, Confidence: 0.708984, BBox: [611, 227, 16, 76]
[Final Detection] Class ID: 0, Confidence: 0.706543, BBox: [260, 73, 95, 438]
[Final Detection] Class ID: 0, Confidence: 0.733398, BBox: [252, 74, 92, 436]
[Final Detection] Class ID: 0, Confidence: 0.82959, BBox: [244, 77, 124, 433]
[Final Detection] Class ID: 0, Confidence: 0.730469, BBox: [606, 226, 26, 78]
Inference time for batch in ./asset/walk.mp4: 164.353 ms, 20.5441ms/frame
[Final Detection] Class ID: 0, Confidence: 0.811523, BBox: [239, 80, 117, 430]
[Final Detection] Class ID: 0, Confidence: 0.751953, BBox: [606, 226, 30, 78]
[Final Detection] Class ID: 0, Confidence: 0.867676, BBox: [232, 86, 144, 424]
[Final Detection] Class ID: 0, Confidence: 0.759766, BBox: [606, 229, 33, 75]
[Final Detection] Class ID: 0, Confidence: 0.820312, BBox: [227, 88, 142, 422]
[Final Detection] Class ID: 0, Confidence: 0.742676, BBox: [606, 227, 33, 77]
[Final Detection] Class ID: 0, Confidence: 0.828613, BBox: [223, 91, 146, 420]
[Final Detection] Class ID: 0, Confidence: 0.839844, BBox: [221, 91, 140, 419]
[Final Detection] Class ID: 0, Confidence: 0.862793, BBox: [225, 91, 132, 419]
[Final Detection] Class ID: 0, Confidence: 0.775391, BBox: [240, 91, 112, 398]
Inference time for batch in ./asset/walk.mp4: 165.66 ms, 20.7076ms/frame
[Final Detection] Class ID: 0, Confidence: 0.737305, BBox: [443, 213, 19, 100]
[Final Detection] Class ID: 0, Confidence: 0.730469, BBox: [270, 76, 55, 438]
[Final Detection] Class ID: 0, Confidence: 0.714355, BBox: [439, 217, 19, 96]
[Final Detection] Class ID: 0, Confidence: 0.800781, BBox: [260, 76, 66, 435]
[Final Detection] Class ID: 0, Confidence: 0.796875, BBox: [254, 82, 73, 432]
[Final Detection] Class ID: 0, Confidence: 0.815918, BBox: [248, 81, 105, 431]
[Final Detection] Class ID: 0, Confidence: 0.85498, BBox: [241, 88, 102, 423]
[Final Detection] Class ID: 0, Confidence: 0.70166, BBox: [611, 228, 29, 77]