This sample demonstrates QAT training&deploying YOLOv5s on Orin DLA, which includes:
- YOLOv5s QAT training. - see export
- Deploy YOLOv5s QAT model with and cuDLA hybrid mode and cuDLA standalone mode.
- Convert QAT model to PTQ model and INT8 calibration cache. - see export
- Build DLA standalone loadable with TensorRT(INT8/FP16). - see data/model
- Load and run the DLA loadable with cuDLA. - see src
- Validate DLA performance and accuracy on COCO 2017 val dataset. - see test_coco_map
sudo apt update
sudo apt install libopencv-dev libjsoncpp-dev python3-pip git git-lfs
# If want to mAP benchmark with COCO dataset, download COCO tool and dataset
pip3 install pycocotools
cd data/
bash download_coco_validation_set.sh
# cmake >= 3.18
# if pre-installed cmake is lower than 3.18, src/matx_reformat/build_matx_reformat.sh
# will install it for you
Refer to export/README.md.
git clone --recursive https://github.com/NVIDIA-AI-IOT/cuDLA-samples.git
If your OS version is less than Drive OS 6.0.8.0 or Jetpack 6.0, please apply trtexec-dla-standalone-trtv8.5.patch(for trt 8.5, for other version you may need to apply it manually) to trtexec and re-built.
cp data/trtexec-dla-standalone-trtv8.5.patch /usr/src/tensorrt/
cd /usr/src/tensorrt/
git apply trtexec-dla-standalone-trtv8.5.patch
cd samples/trtexec
sudo make
Build loadable and compile matx reformat lib
# Build INT8 and FP16 loadable from ONNX in this project
bash data/model/build_dla_standalone_loadable.sh
# Build matx used in pre-/post-processing
bash src/matx_reformat/build_matx_reformat.sh
Run the sample with cuDLA hybrid mode
make clean
# Run INT8 inference on single image
make run
# Or run COCO validation
make validate_cudla_int8 # or make validate_cudla_fp16
Run the sample with cuDLA standalone mode
# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1
Run the sample with cuDLA standalone mode with deterministic semaphore, this is for run the sample on some old DriveOS(we test 6.0.6.0) and Jetpack.
# "make clean" is needed when switch between hybrid mode and standalone mode
make clean
# Run INT8 inference on single image
make run USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# Or run COCO validation
make validate_cudla_int8 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
# or make validate_cudla_fp16 USE_DLA_STANDALONE_MODE=1 USE_DETERMINISTIC_SEMAPHORE=1
YOLOv5s | Official Data | DLA FP16 | DLA INT8 QAT | GPU INT8 QAT |
---|---|---|---|---|
mAP | 37.4 | 37.5 | 37.1 | 36.8 |
Note:
- We use inference resolution of 1x3x672x672 to get this mAP.
- Fallback the last 4 layers to FP16 in the last head can increase mAP from 37.1 to 37.3, but the perf will drop little from 4.0ms to 4.46ms. This can be tested with a new loadble built by
bash data/model/build_dla_standalone_loadable_v2.sh
Platform | GPU clock | Memory clock | DLA clock | TensorRT Version | DLA Version |
---|---|---|---|---|---|
Orin-X | 1275 MHz | 3200 MHz | 1331 MHz | 8.6 | 3.14 |
Batch Size | DLA INT8(int8:hwc4 in + fp16:chw16 out) (ms) | GPU INT8(int8:chw32 in + fp16:chw16 out) (ms) |
---|---|---|
1 | 3.82 | 1.82 |
2 | 7.68 | 2.91 |
4 | 15.17 | 4.99 |
8 | 30.92 | 9.19 |
12 | 46.71 | 13.27 |
16 | 62.54 | 16.87 |
- If we use
int8:hwc4 in + int8:chw32 out
then we can get perf of about 2.4ms(bs=1) for DLA INT8, but it will lead to small accuracy drop. We will optimize this in the future. - The inference time(median GPU computing time) here is taken with trtexec which use some pre-launch technique to maximize the through put, so it's slightly better than the inference time measure in the pipeline.
This sample demonstrates how to use cuDLA hybrid mode and cuDLA standalone mode for a CUDA->cuDLA->CUDA pipeline. More details on cuDLA hybrid mode and cuDLA standalone mode can be found at https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-model.
Using cuDLA hybrid mode allows quick integration with other CUDA tasks, all we need to do is register CUDA memory to cuDLA.
Use cuDLA standalone mode can prevent the CUDA context creation, and thus improve the parallelism with other GPU task. cuDLA's standalone mode make use of NvSci to finish the data transfer and synchronization with other modules like camera, GPU or CPU.
Our cuDLA hybrid mode context code and standalone mode context code has no other dependencies from the sample, thus it can be integrated to user's application quickly. Just copy the src/cuda_context_hybird.* or src/cuda_context_standalone.* to your own project, add necessary include path and link libraries(Check ./Makefile). then you can make use of our code directly.
- The scale used for FP32 to INT8 conversion in pre-processing is hardcoded, which value is from the first layer of the calibration cache, check mInputScale and images: 3c00f9f4. For more information about QAT, check pytorch-quantization. For more information about quantization inside TensorRT, check TensorRT Developer Guide
- Need to use below Input and Output format for cuDLA INT8 and FP16. More info about DLA I/O format can be found in I/O Formats on DLA. The INT8 cuDLA inference in this sample uses INT8 Input:kDLA_LINEAR,kDLA_HWC4 + FP16 Output:kDLA_LINEAR,kCHW16.
I/O | Format |
---|---|
INT8 Input | kDLA_LINEAR,kDLA_HWC4,kCHW32 |
FP16 Input | kDLA_LINEAR,kCHW16 |
INT8 Output | kDLA_LINEAR,kCHW32 |
FP16 Output | kDLA_LINEAR,kCHW16 |
- NVIDIA reserves the right to change road map without implicit assumption of API and compatibility support.
https://github.com/NVIDIA/Deep-Learning-Accelerator-SW
https://developer.nvidia.com/blog/maximizing-deep-learning-performance-on-nvidia-jetson-orin-with-dla
https://developer.nvidia.com/blog/deploying-yolov5-on-nvidia-jetson-orin-with-cudla-quantization-aware-training-to-inference