Opportunities to Speed Up Conv Operation on Arm-V8? #8320

FabianSchuetze · 2024-06-25T15:16:58Z

FabianSchuetze
Jun 25, 2024

Thanks for this wonderful software! It's a pleasure to work with it.

I am trying to optimize a convolution operation and learn how to obtain the max throughout on an ARM-V8a system. I think the operation could require as little as ~0.25 ms (based on peak-flop benchmarking), but the op takes ~1.5 ms.

Computation
I am doing a convolution in int8 and int8 accumulator. The input size is 64x64 with 128 input and output channels. The memory ordering is W, H, channels of the input, and Filter Height, Filter Width, input channels, and output channels for the filter.

Perf Profiling
Simpleperf (android's variant of Perf) highlights that about 1/4 of all cycles are stalled in the backend, with about 1/2 of this waiting for memory:

Performance counter statistics:

#          count  event_name              # count / runtime
  34,699,123,585  cpu-cycles              # 2.480500 GHz 
   9,837,322,760  raw-stall-backend       # 703.230 M/sec
   4,409,769,032  raw-stall-backend-mem   # 315.236 M/sec

Assembly
The hot path is the following assembly code:

LBB251_5:                              // %"8_for_conv.s1.r12$x"
                                       //   Parent Loop BB251_1 Depth=1
                                       //     Parent Loop BB251_2 Depth=2
                                       //       Parent Loop BB251_3 Depth=3
                                       //         Parent Loop BB251_4 Depth=4
                                       // =>        This Inner Loop Header: Depth=5
   add	x19, x4, x6
   add	x7, x5, x6
   add	x20, x19, #576
   add	x21, x19, #1152
   add	x6, x6, #1
   ld1r	{ v17.16b }, [x19]
   cmp	x6, #3
   ldur	q16, [x7, #-34]
   ldr	q18, [x7]
   add	x7, x19, #1728
   ld1r	{ v19.16b }, [x20]
   ld1r	{ v20.16b }, [x21]
   mla	v7.16b, v17.16b, v16.16b
   mla	v6.16b, v17.16b, v18.16b
   ld1r	{ v17.16b }, [x7]
   mla	v5.16b, v19.16b, v16.16b
   mla	v4.16b, v19.16b, v18.16b
   mla	v3.16b, v20.16b, v16.16b
   mla	v2.16b, v20.16b, v18.16b
   mla	v1.16b, v17.16b, v16.16b
   mla	v0.16b, v17.16b, v18.16b
   b.ne	.LBB251_5

Potential for Improvement
Is there a profitable opportunity to improve this code?

I wonder if more registers and buffing the data loads would help. Only a fraction of the registers are used and they are employed directly after loading them. Could I prefetch/buffer the data? Do there seem other opportunities to improve the throughput?

Current Schedule
The code is generated with the Adams2019 auto-scheduler. The (relevant part) of the schedule seems to be:

    using ::Halide::Func;
    using ::Halide::MemoryType;
    using ::Halide::RVar;
    using ::Halide::TailStrategy;
    using ::Halide::Var;
    Func output = pipeline.get_func(6);
    Func conv = pipeline.get_func(5);
    Func translate = pipeline.get_func(4);
    Func repeat_edge = pipeline.get_func(3);
    Func lambda_0 = pipeline.get_func(2);
    Var _0(repeat_edge.get_schedule().dims()[0].var);
    Var _0i("_0i");
    Var _1(repeat_edge.get_schedule().dims()[1].var);
    Var _2(repeat_edge.get_schedule().dims()[2].var);
    Var c(output.get_schedule().dims()[2].var);
    Var ci("ci");
    Var x(output.get_schedule().dims()[0].var);
    Var xi("xi");
    Var y(output.get_schedule().dims()[1].var);
    Var yi("yi");
    Var yii("yii");
    RVar r12_x(conv.update(0).get_schedule().dims()[0].var);
    RVar r12_y(conv.update(0).get_schedule().dims()[1].var);
    RVar r12_z(conv.update(0).get_schedule().dims()[2].var);
    output
        .split(y, y, yi, 4, TailStrategy::ShiftInwards)
        .split(c, c, ci, 4, TailStrategy::ShiftInwards)
        .split(x, x, xi, 16, TailStrategy::ShiftInwards)
        .split(yi, yi, yii, 2, TailStrategy::ShiftInwards)
        .unroll(yii)
        .unroll(ci)
        .vectorize(xi)
        .compute_root()
        .reorder({xi, yii, ci, x, yi, y, c})
        .fuse(y, c, y)
        .parallel(y);
    conv
        .store_in(MemoryType::Stack)
        .split(x, x, xi, 16, TailStrategy::RoundUp)
        .unroll(y)
        .unroll(c)
        .vectorize(xi)
        .compute_at(output, x)
        .reorder({xi, x, y, c});
    conv.update(0)
        .split(x, x, xi, 16, TailStrategy::GuardWithIf)
        .unroll(y)
        .unroll(c)
        .vectorize(xi)
        .reorder({xi, x, y, c, r12_x, r12_y, r12_z});

Answered by abadams

Jun 25, 2024

If I were to optimize something like this, I'd start here and change the floats to int8s:

https://github.com/halide/Halide/blob/main/apps/conv_layer/conv_layer_generator.cpp

Note that it does indeed use more vector accumulator registers (20 instead of 8) is indeed doing some staging of inputs (The .in() scheduling calls)

View full answer

abadams · 2024-06-25T22:31:07Z

abadams
Jun 25, 2024
Maintainer

If I were to optimize something like this, I'd start here and change the floats to int8s:

https://github.com/halide/Halide/blob/main/apps/conv_layer/conv_layer_generator.cpp

Note that it does indeed use more vector accumulator registers (20 instead of 8) is indeed doing some staging of inputs (The .in() scheduling calls)

1 reply

FabianSchuetze Jun 26, 2024
Author

Thanks for the reply. That is a very good point indeed. Not sure why I didn't think of this, particularly as I used this code as reference... Will use it and report again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opportunities to Speed Up Conv Operation on Arm-V8? #8320

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Opportunities to Speed Up Conv Operation on Arm-V8? #8320

FabianSchuetze Jun 25, 2024

Replies: 1 comment · 1 reply

abadams Jun 25, 2024 Maintainer

FabianSchuetze Jun 26, 2024 Author

FabianSchuetze
Jun 25, 2024

Replies: 1 comment 1 reply

abadams
Jun 25, 2024
Maintainer

FabianSchuetze Jun 26, 2024
Author