Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

AlexandreEichenberger · 2024-09-18T19:20:15Z

When deciding the amount of unrolling of a simd loop, it is helpful to take into account the amount of register pressure. If unrolling a loop that use a lot of vector SIMD registers, this lead to spilling and thus lower performance.

Up to now, the register pressure was estimated as the number of SIMD ops / 2, roughly. That worked well for small loops. Linear Quantization, however, has a lot of operations that are very sequential, thus do not need many registers. But since it had a lot of operations, its estimated register pressure was high, and thus prevented loop unrolling.

I have added a new "metrics" in the map that describe the generic operations used in a loop. Before it was only "generic add" -> "3", meaning this kernel uses 3 generic adds.

I have now added a new enum value, GenericOps::EstimatedVectorRegisterPressure , that optionally let us specify an estimate of the register pressure for that kernel. When none is provided, the old approach is used; when one is provided, then that value is used.

A quick test for DynamicLinearQuantization on arm64 shows a lowering from 117us (no simd) -> 46us (simd, old unrolling) -> 38us (simd, new unrolling). Unrolling (unrollVL) went from 2 to 4.

A second issue that is addressed is the Linear Quantization. The loop has "complicated code" that simdize very well until it hit the SIMD conversion of float32 into int8. LLVM compiler generates code that is slow.

I have tried various scheme to accelerate this conversion. First I tried a hybrid approach where I use vector extract and insert element to proceed with scalar elements of a vector. The code generated by mlir/llvm for extract and insert is unfortunately worst than the original code.

I then tried to bypass the extract and insert by using memory. I save a SIMD vector into a buffer (of VL elements). Then I proceed with scalar load from that buffer, and save the int8 result in another buffer. Then that buffer is re-loaded as a SIMD vector. That approach did better than the above approach, but not as well as the next approach. Also, it has a privatization issue for the parallel loops.

The approach that worked best was to have the first loop perform all of the operations until the conversion from float to int. It saved its value (in SIMD computation mode) into a temporary buffer. A second loop then iterate over all values in the temporary buffer and convert them and save them into the output. The drawback of this approach is that it need a second buffer of the same size as the input. But each loop can perform at their best performance with significant unrolling.

A lot of assembly was looked at, to detect the best pattern, in addition to performance evaluation on z16.

For quantized linear, performance went from 158us to 124us for 64K values with 1 thread, and from 35us to 30us with 6 threads
For dynamic quantized linear, performance went from 174us to 161us for 64k values with 1 thread, and remained at 50us.

Note that I also discovered that the non-sims code (from the onnx-mlir) obtained with the -disable-simd still has a lot of SIMD introduced by the LLVM backend. So simply disabling onnx-mlir simd is not sufficient to have a code with/without SIMD. Nevertheless, onnx-mlir generated simd code outperforms code generated by the backend.

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

AlexandreEichenberger · 2024-09-20T01:26:26Z

src/Dialect/Mlir/DialectBuilder.cpp

@@ -2073,6 +2073,29 @@ void VectorBuilder::multiReduction(ArrayRef<Value> inputVecArray,
  }
 }

+Value VectorBuilder::extractElement(Value vector, int64_t index) const {


Added for experiments on how to get better performance, currently not in use but were verified to work during experiments associated with the PR

AlexandreEichenberger · 2024-09-20T01:27:40Z

src/Conversion/ONNXToKrnl/ONNXToKrnlCommon.hpp

@@ -663,6 +663,12 @@ int64_t computeSuitableUnrollFactor(mlir::MemRefType memRefType,
 // Cap totVL so that it is at most maxUnrollVL * archVL.
 int64_t capVLForMaxUnroll(
    mlir::MemRefType memRefType, int64_t totVL, int64_t maxUnrollVL);
+// In some type conversion loops we may have a given totVL based on a given


Also needed for some hybrid loops where multiple precisions where present in the same SIMD loop. Currently not needed in the (best performing) final version, but would be useful to similar situations in the future.

tungld

LGTM. Thanks for various optimizations!

jenkins-droid · 2024-09-24T13:20:05Z

Jenkins Linux ppc64le Build #14716 [push] Added explicit register ... started at 09:31

jenkins-droid · 2024-09-24T13:20:07Z

Jenkins Linux amd64 Build #15686 [push] Added explicit register ... started at 08:20

jenkins-droid · 2024-09-24T13:20:08Z

Jenkins Linux s390x Build #15689 [push] Added explicit register ... started at 09:20

jenkins-droid · 2024-09-24T14:25:08Z

Jenkins Linux amd64 Build #15686 [push] Added explicit register ... passed after 1 hr 5 min

jenkins-droid · 2024-09-24T14:48:17Z

Jenkins Linux s390x Build #15689 [push] Added explicit register ... passed after 1 hr 28 min

jenkins-droid · 2024-09-24T15:23:42Z

Jenkins Linux ppc64le Build #14716 [push] Added explicit register ... passed after 2 hr 3 min

added register pressure estimate to be explicit when needed

359e095

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

AlexandreEichenberger marked this pull request as draft September 18, 2024 19:20

AlexandreEichenberger added 10 commits September 19, 2024 10:51

cleanup

e2cbab8

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

update

e3123c1

added min unroll for conversion loops

bfd6542

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

try 2 loops for quantize linear

60a8b9e

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

version with extract/insert

5f798ae

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

added z16 meas times

7aa3df4

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

before prunning

9dfe268

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

attempt with temp buffers inside main loop

3a62034

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

updated tests

168d50d

Signed-off-by: Alexandre Eichenberger <alexe@us.ibm.com>

update

30b37da

AlexandreEichenberger marked this pull request as ready for review September 20, 2024 01:25

AlexandreEichenberger commented Sep 20, 2024

View reviewed changes

AlexandreEichenberger requested a review from tungld September 20, 2024 01:28

AlexandreEichenberger changed the title ~~Added register pressure estimate to be explicit when needed~~ Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations Sep 20, 2024

Merge branch 'main' into simd-framwork-v2

02062a3

tungld approved these changes Sep 24, 2024

View reviewed changes

AlexandreEichenberger merged commit 5c53b7e into onnx:main Sep 24, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

AlexandreEichenberger commented Sep 18, 2024 •

edited

Loading

AlexandreEichenberger Sep 20, 2024

AlexandreEichenberger Sep 20, 2024

tungld left a comment

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

Added explicit register pressure estimate for SIMD and tuned [Dynamic]LinearQuantization operations #2945

Conversation

AlexandreEichenberger commented Sep 18, 2024 • edited Loading

AlexandreEichenberger Sep 20, 2024

Choose a reason for hiding this comment

AlexandreEichenberger Sep 20, 2024

Choose a reason for hiding this comment

tungld left a comment

Choose a reason for hiding this comment

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

jenkins-droid commented Sep 24, 2024

AlexandreEichenberger commented Sep 18, 2024 •

edited

Loading