-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple TF export improvements #4824
Conversation
@zldrobit can confirm super speedup on TFLite detect.py inference: Beforedetect: weights=['yolov5s.tflite'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 🚀 v5.0-436-g6b44ecd torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)
image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (23.460s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (23.573s)
Speed: 5.0ms pre-process, 23516.5ms inference, 8.4ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp This PRdetect: weights=['yolov5s-fp16.tflite'], source=data/images, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False
YOLOv5 🚀 v3.0-901-gfe2b1ec torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)
image 1/2 /content/yolov5/data/images/bus.jpg: 640x640 4 class0s, 1 class5, Done. (0.403s)
image 2/2 /content/yolov5/data/images/zidane.jpg: 640x640 2 class0s, 2 class27s, Done. (0.320s)
Speed: 4.8ms pre-process, 361.5ms inference, 7.7ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp |
@zldrobit also confirm trainable params are now 0:
|
@zldrobit PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐ |
@zldrobit now that the Tflite export defaults to FP16 post-quantization, how does the |
@alexdwu13 that's a good question. |
@glenn-jocher so I've actually been unable to successfully use the But I did compare
If you drop the 2 models into https://netron.app/ you can see that
Since the GPU delegate should be able to natively run in FP16 it seems strange these |
@alexdwu13 yes it's true that there is an assert to avoid using --half with cpu. This is because pytorch is unable to run CPU inference with FP16 models, and we do dry runs with the pytorch model here to build grids for example: Lines 296 to 298 in 2c2ef25
Maybe we could cast the model and image to half after this pytorch inference? |
@alexdwu13 ok I ran an experiment. If I cast to .half() after pytorch inference I get errors on TorchScript and ONNX export: PyTorch: starting from /Users/glennjocher/PycharmProjects/yolov5/yolov5s.pt (14.8 MB)
TorchScript: starting export with torch 1.9.1...
TorchScript: export failure: "unfolded2d_copy" not implemented for 'Half'
ONNX: starting export with onnx 1.10.1...
ONNX: export failure: "unfolded2d_copy" not implemented for 'Half' But TFLite export works fine, though the export -fp16.tflite model still has dequantize blocks in it. All of the actual Conv layers are defined in FP32 interestingly (in the -fp16.tflite model), so yes it seems like providing the model directly in FP32 is most efficient unless there is a way to have the TFLite Conv layers exist natively in FP16. @zldrobit what do you think? |
@alexdwu13 @glenn-jocher I inspect the fp32 model @alexdwu13 provided. I found that it is actually an int8 model: TFLite now supports int8 model acceleration by GPU delegate tensorflow/tensorflow#41485 (comment). This explains why the Lines 191 to 192 in 39c17ce
has to be commented. I tested
The fp32 and fp16 models consume almost the same time. This is because
The elapsed time is more than doubled. The Therefore, an fp16 model is as efficient as an fp32 model on GPU delegate and an fp16 model is 50% smaller. yolov5s-fp32.tflite.zip |
@zldrobit thank you for catching that! Yeah it looks like It looks like the reason I was getting slower execution times with the
With the newest version of the benchmark tool – which uses So my question is: assuming |
Very interesting thread! Then, in the prediction, The inference time per image is about 400ms. Do I have any options left that might allow me to make a faster inference? |
I cannot find any documents elaborate the mechnism of fp16 TFLite inference, but you could refer to some more general materials, like ARM GPUs, and Nvidia GPUs.
Considering that a cell phone has a lesser storage size than a PC, mobile app developers prefer small models to large models, so they would choose fp16 models even int8 models running on GPU. Note that the fp16 precision is set to true by default when using GPU with TFLite Java API on Android https://stackoverflow.com/a/62088843/3036450. |
* Add fused conv support * Set all saved_model values to non trainable * Fix TFLite fp16 model export * Fix int8 TFLite conversion
@JNaranjo-Alcazar The PS: |
* Add fused conv support * Set all saved_model values to non trainable * Fix TFLite fp16 model export * Fix int8 TFLite conversion
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Improved support for TensorFlow Lite (TFLite) export and Keras model conversion 🚀
📊 Key Changes
keras_model.trainable
toFalse
to freeze the model weights during export.-fp16
) in the filename.🎯 Purpose & Impact
Overall, these changes aim to enhance the user experience by providing more efficient and understandable model export options, ensuring more consistent cross-platform model performance. 📈🔒