-
Notifications
You must be signed in to change notification settings - Fork 74.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use xa_nnlib for svdf for Fusion F1. #47098
Conversation
Thanks for contributing to TensorFlow Lite Micro. To keep this process moving along, we'd like to make sure that you have completed the items on this list:
We would like to have a discussion on the Github issue first to determine the best path forward, and then proceed to the PR review. |
tagging @pnikam-cad @nyadla-sys @bhanuprakashbv |
Confirmed that the test passes with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_softmax_test -j8 ``` However, the latency improvement is only ~1000 ticks, as tested with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_keyword_benchmark -j8 ``` Since Softmax is currently a small fraction of the overall keyword_benchmark latency we will focus on the latency of only this particular OP. With the optimized implementation: ``` SOFTMAX took 749 ticks (0 ms). ``` Reference implementation: ``` SOFTMAX took 2052 ticks (2 ms). ``` And with the LUT hifimini implementation (for completeness): ``` SOFTMAX took 1142 ticks (1 ms). ``` The gain of ~1500 ticks ticks is still worth merging because after all the optimizations (e.g. tensorflow#47098), this will still mean a ~5% improvement for the keyword benchmark. And the benefits might be more significant for other models too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, great speed-ups! Only minor comment nits.
(data.effective_scale_1_a), data.effective_scale_1_b), | ||
0); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer part of the feature matmul... perhaps leave a comment here to mark the start of the time weights dot product + activation stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
The code in this change is the subset of functionality needed for int8 svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc Note that the current change has not pulled in either the floating point implementation or the Hifi5 implementation. Profiled the keryword_benchmark with the following command: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade run_keyword_benchmark -j8 ``` gives a latency of 38516 ticks with this change vs 152642 ticks without this change. Per OP latency with this change: ``` KeywordRunNIerations(1) took 38516 ticks (38 ms) QUANTIZE took 3758 ticks (3 ms). SVDF took 4753 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 4211 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 3145 ticks (3 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 4211 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 2890 ticks (2 ms). SVDF took 3583 ticks (3 ms). SVDF took 3054 ticks (3 ms). FULLY_CONNECTED took 1091 ticks (1 ms). SOFTMAX took 2042 ticks (2 ms). QUANTIZE took 366 ticks (0 ms). ``` Without this change: ``` KeywordRunNIerations(1) took 152642 ticks (152 ms) QUANTIZE took 3758 ticks (3 ms). SVDF took 38003 ticks (38 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 13907 ticks (13 ms). SVDF took 15827 ticks (15 ms). SVDF took 15827 ticks (15 ms). FULLY_CONNECTED took 1091 ticks (1 ms). SOFTMAX took 2042 ticks (2 ms). QUANTIZE took 366 ticks (0 ms). ``` Also confirmed that the kernel_svdf_test passes with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_svdf_test -j8 ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment, as request.
de53e77
to
55633bf
Compare
The code in this change is the subset of functionality needed for int8 svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc
Note that the current change has not pulled in either the floating point implementation or the Hifi5 implementation.
Profiled the keryword_benchmark with the following command:
gives a latency of 38516 ticks with this change vs 152642 ticks without this change.
Per OP latency with this change:
Without this change:
Also confirmed that the kernel_svdf_test passes with:
Progress towards http://b/177457688