Opportunities to Speed Up Conv Operation on Arm-V8? #8320
-
Thanks for this wonderful software! It's a pleasure to work with it. I am trying to optimize a convolution operation and learn how to obtain the max throughout on an ARM-V8a system. I think the operation could require as little as ~0.25 ms (based on peak-flop benchmarking), but the op takes ~1.5 ms. Computation Perf Profiling
Assembly
Potential for Improvement I wonder if more registers and buffing the data loads would help. Only a fraction of the registers are used and they are employed directly after loading them. Could I prefetch/buffer the data? Do there seem other opportunities to improve the throughput? Current Schedule
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
If I were to optimize something like this, I'd start here and change the floats to int8s: https://github.com/halide/Halide/blob/main/apps/conv_layer/conv_layer_generator.cpp Note that it does indeed use more vector accumulator registers (20 instead of 8) is indeed doing some staging of inputs (The .in() scheduling calls) |
Beta Was this translation helpful? Give feedback.
If I were to optimize something like this, I'd start here and change the floats to int8s:
https://github.com/halide/Halide/blob/main/apps/conv_layer/conv_layer_generator.cpp
Note that it does indeed use more vector accumulator registers (20 instead of 8) is indeed doing some staging of inputs (The .in() scheduling calls)